Tree-sitter vs. Language Servers

lambdaland.org

210 points by ashton314 11 hours ago


DanRosenwasser - 5 hours ago

> It is possible to use the language server for syntax highlighting. I am not aware of any particularly strong reasons why one would want to (or not want to) do this. The language server can be a more complicated program and so could surface particularly detailed information about the syntax; it might also be slower than tree-sitter.

We (TypeScript) used to do this for Visual Studio prior to tmLanguage. It was nice because we didn't have to write a second parser. Our parser was already error-tolerant and incremental, and syntax highlighting just involved descending into the syntax tree's tokens. So there was no room for divergence bugs in parsers, and there was also no need to figure out how to encode oddities and ambiguity-breaking logic in limited formats like tmLanguage.

This all predated TSServer (which predated LSP, though that's coming in TypeScript 7). The latency for syntax highlighting over JSON was too much, and other editors often didn't make syntax highlighting available outside of tmLanguage anyway. Eventually semantic highlighting became a thing, which is more latency-tolerant, and overlays colors on top of a syntactic highlighter in VS Code.

The other issue with this approach was that we still needed a dedicated thread just for fast syntax highlighting. That thread was a separate instance of the JS language service without anything shared, so that was a decent amount of memory overhead just for syntax highlighting.

thramp - 9 hours ago

(Hi, I’m on the rust-analyzer team, but I’ve been less active for reasons that are clear in my bio.)

> Language servers are powerful because they can hook into the language’s runtime and compiler toolchain to get semantically correct answers to user queries. For example, suppose you have two versions of a pop function, one imported from a stack library, and another from a heap library. If you use a tool like the dumb-jump package in Emacs and you use it to jump to the definition for a call to pop, it might get confused as to where to go because it’s not sure what module is in scope at the point. A language server, on the other hand, should have access to this information and would not get confused.

You are correct that a language server will generally provide correct navigation/autocomplete, but a language server doesn’t necessarily need to hook into an existing compiler: a language server might be a latency-sensitive re-implementation of an existing compiler toolchain (rust-analyzer is the one I’m most familiar with, but the recent crop of new language servers tend to take this direction if the language’s compiler isn’t query-oriented).

> It is possible to use the language server for syntax highlighting. I am not aware of any particularly strong reasons why one would want to (or not want to) do this.

Since I spend a lot of time writing Rust, I’ll use Rust as an example: you can highlight a binding if it’s mutable or style an enum/struct differently. It’s one of those small things that makes a big impact once you get used to it: editors without semantic syntax highlighting (as it is called in the LSP specification) feel like they’re naked to me.

torginus - 3 hours ago

I just want to add that treesitter is a heuristic, incremental parser.

The difference between regular parsers and treesitter, is that regular parsers start eating tokens from the start of the file, and try to assemble and AST from that. The AST is built from the top down.

Treesitter works differently, it grabs tokens from an arbitrary point, and assembles them into AST nodes, then tries to extend the AST until the whole file is parsed.

This method supports incremental edits (as you can throw away the AST for the modified part, and try to re-parse), but the problem is that most languages are designed to be unambiguous when parsed left to right, and parsing them like this might involve some retries and guesswork.

Also, unlike modern languages, like Go, which is designed to be parseable without any semantic analysis, a lot of older languages don't have this property, notably C/C++ needs a symbol table. In this case, treesitter has to guess, and it might guess wrong.

As for what can you do with an AST and what can't you: you can tell if something is a function call, a variable reference, or any other piece of syntax, but if you write something like x = 2; then tree-sitter has no idea what x is, is it a float, an int? is it a local, a class variable, or a global? You can tell this with a symbol table which the compiler uses to dereference symbols, but treesitter cant do this for you.

Fiveplus - 9 hours ago

>It is possible to use the language server for syntax highlighting. I am not aware of any particularly strong reasons why one would want to (or not want to) do this.

Hmm, the strong reason could be latency and layout stability. Tree-sitter parses on the main thread (or a close worker) typically in sub-ms timeframes, ensuring that syntax coloring is synchronous with keystrokes. LSP semantic tokens are asynchronous by design. If you rely solely on LSP for highlighting, you introduce a flash of unstyled content or color-shifting artifacts every time you type, because the round-trip to the server (even a local one) and the subsequent re-tokenization takes longer than the frame budget.

The ideal hygiene could be something like -> tree-sitter provides the high-speed lexical coloring (keywords, punctuation, basic structure) instantly and LSP paints the semantic modifiers (interfaces vs classes, mutable vs const) asynchronously like 200ms later. Relying on LSP for the base layer makes the editor feel sluggish.

KlayLay - 10 hours ago

Side note, but thanks for the note about not using AI to write your articles. I'm tired of looking for information online, finding an article that may answer it, and not being sure about the author's integrity (this is so rampant on Medium).

FjordWarden - 10 hours ago

This is like the difference between an orange and fruit juice. You can squeeze an orange to extract its juices, but that is not the only thing you can do with it, nor is it the only way to make fruit juice.

I use tree-sitter for developing a custom programming language, you still need an extra step to get from CST to AST, but the overall DevEx is much quicker that hand-rolling the parser.

tetris11 - 10 hours ago

I love tree-sitter+eglot but a few of the languages/schemes I work in, simply don't have parsers:

    > pacman -Ssq tree-sitter
    tree-sitter
    tree-sitter-bash
    tree-sitter-c
    tree-sitter-cli
    tree-sitter-javascript
    tree-sitter-lua
    tree-sitter-markdown
    tree-sitter-python
    tree-sitter-query
    tree-sitter-rust
    tree-sitter-vim
    tree-sitter-vimdoc

Where's R, YAML, Golang, and several others?
Conscat - 4 hours ago

I find neither LSP for Tree Sitter sufficient for syntax highlighting, but I am extremely satisfied by my special combination of _both_ in Emacs. I love how easy it is to write Tree Sitter queries for very special patterns, like highlighting namespace declarations differently from scope resolution, or highlighting inline assembly differently from normal strings.

But I really want the semantic highlighting from a language server, such as highlighting constants or macros special, and Emacs (among some other editors) make it trivial to blend the strengths of both together.

mickeyp - 10 hours ago

Tree-sitter is great. It powers Combobulate in Emacs. Structured editing and movement would not have been easily done without it.

geophph - 5 hours ago

Funny timing. I’ve been building an LSP for a niche DSL I use at work. I’ve been using tree-sitter to build out the AST of sorts for the LSP functions, but just yesterday it dawned on me that the syntax highlighting my LSP does is all just TS queries and encoding them properly for the protocol. So I was looking into if that can be done in the vscode extension that provides the LSP hookup instead. Kinda nice that the same tree-sitter grammar can be used across the extension and LSP, even tho they’re in different languages.

vivzkestrel - 9 hours ago

- as a guy who is absolutely not familiar with the idea of how code editors work and has to make a browser based code editor, what are the things that you think I should know?

- i got a hint of language server and tree sitter thanks to this wonderfully written post but it is still missing a lot of details like how does the protocol actually look like, what does a standard language server or tree sitter implementation looks like

- what are the other building blocks?

Metasyntactic - 4 hours ago

C# Language Designer here, and one of the designers/architects of 'Roslyn', the semantic analysis engine that powers the C#/VB compilers, VS IDE experiences, and our LSP server.

The original post conflates some concepts worth separating. LSP and language servers operate at an IDE/Editor feature level, whereas tree-sitter is a particular technological choice for parsing text and producing a syntax tree. They serve different purposes but can work together.

What does a language server actually do? LSP defines features like:

  1. Finding references (`textDocument/references`)

  2. Go-to-definition (`textDocument/definition`)

  3. Syntax highlighting (`textDocument/semanticTokens/...`)

  4. Code completion, diagnostics, refactorings
A language server for language X could use tree-sitter internally to implement these features. But it can use whatever technologies it wants. LSP is protocol-level; tree-sitter is an implementation detail.

The article talks about tree-sitter avoiding the problem of "maintaining two parsers" (one for the compiler, one for the editor). This misunderstands how production compiler/IDE systems actually work. In Roslyn, we don't have two parsers. We have one parser that powers both the compiler and the IDE. Same code, same behavior, same error recovery. This works better, not worse. You want your IDE to understand code exactly the way the compiler does, not approximately.

The article highlights tree-sitter being "error-tolerant" and "incremental" as key advantages. These are real concerns. If you're starting from scratch with no existing language infrastructure, tree-sitter's error tolerance is valuable. But this isn't unique to tree-sitter. Production compiler parsers are already extremely error-tolerant because they have to be. People are typing invalid code 99% of the time in an editor.

Roslyn was designed from day one for IDE scenarios. We do incremental parsing (https://github.com/dotnet/roslyn/blob/main/docs/compilers/De...), but more importantly, we do incremental semantic analysis. When you change a file, we recompute semantic information for just the parts that changed, not the entire project. Tree-sitter gives you incremental parsing. That's good. But if you want rich IDE features, you need incremental semantics too.

The article suggests language servers are inherently "heavy" while tree-sitter is "lightweight." This isn't quite right. An LSP server is as heavy or light as you make it. If all you need is parsing and there's no existing language library, fine, use tree-sitter and build a minimal LSP server on top. But if you want to do more, LSP is designed for that. The protocol supports everything from basic syntax highlighting to complex refactorings.

Now, as to syntax highlighting. Despite the name, it isn't just syntactic in modern IDEs. In C#, we call this "classification," and it's powered by the full semantic model. A reference to a symbol is classified by what that symbol is: local, parameter, field, property, class, struct, type parameter, method, etc. Symbol attributes affect presentation. Static members are italicized, unused variables are faded, overwritten values are underlined. We classify based on runtime behavior: `async` methods, `const` fields, extension methods.

This requires deep semantic understanding. Binding symbols, resolving types, understanding scope and lifetime. Tree-sitter gives you a parse tree. That's it. It's excellent at what it does, but it's fundamentally a syntactic tool.

Example: in C#, `var x = GetValue();` is syntactically ambiguous. Is `var` a keyword or a type name? Only semantic analysis can tell you definitively. Tree-sitter would have to guess or mark it generically.

Tree-sitter is definitely a great technology though. Want to add basic syntax highlighting for a new language to your editor? Tree-sitter makes this trivial. Need structural editing or code folding? Perfect use case. However, for rich IDE experiences, the kind where clicking on a variable highlights all its uses, or where hovering shows documentation, or where renaming a method updates all call sites across your codebase, you need semantic analysis. That's a fundamentally different problem than parsing.

Tree-sitter definitely lowers the barrier to supporting new languages in editors. But it's not a replacement for language servers or semantic analysis engines. They're complementary technologies. For languages with mature compilers and semantic engines (C#, TypeScript, Rust, etc.), using the real compiler infrastructure for IDE features makes sense. For cases with simpler tooling needs, tree-sitter is an excellent foundation to build on.

avtar - 8 hours ago

I thought that the blog's domain looked familiar. The author maintains an awesome and well maintained Emacs starter kit https://codeberg.org/ashton314/emacs-bedrock

thepancake - 5 hours ago

Damn the end of your blog post really touched me. I'm high AF, but I think this is still valid. Thanks for the lovely post!

mellery451 - 9 hours ago

one topic not mentioned is creating refactoring tools. My sense is that LSPs generally have the advantage here because they have the full parsed tree, but I suspect it would be possible to build simple syntactic refactorings in TS with the potential to be both faster and less sensitive to broken syntax.

briaoeuidhtns - 10 hours ago

I think the big reason to put syntax highlighting in the language server is you have more info, ex you can highlight symbols imported from a different file in one color for integers and a different for functions

jbreckmckye - 9 hours ago

I'm doing a project with tree sitter right now

Any tips for keeping the grammar sizes under control? I'm distributing a CLI tool that needs to support several languages, and I can see the grammars gradually bloating the binary size

I could build some clever thing where language packs are opt-in and distributed as WASM, maybe. But that could be complex

williamcotton - 8 hours ago

Tree-sitter does incremental parsing which will speed up the Language Server having to otherwise re-parse an entire file.

the_jizzler - 6 hours ago

[dead]