Vim Syntax Highlighting: Trading Accuracy For Speed

Vim has been my preferred text editor for nearly eighteen years. I appreciate its efficiency and ubiquity, the way I can rely on it regardless of what project I am working on or what machine I have ssh’d into. Like any software, however, vim reflects the time in which it was written. In many cases, vim optimizes for speed above all else, an approach that made sense given the limitations of late ‘90s computers. Nowhere is this trade-off more apparent than in vim’s implementation of syntax highlighting.

Vim syntax highlighting first appeared in version 5, which was released in 1998. Syntax highlighting quirks have confused vim users ever since. A quick Internet search yields many bug reports such as this rather plaintive reddit post from 2015. The Vim Tips Wiki has a full page titled simply “Fix syntax highlighting”.

With its typical candor, the vim user guide explains exactly why syntax highlighting errors occur:

Vim doesn’t read the whole file to parse the text. It starts parsing wherever you are viewing the file. That saves a lot of time, but sometimes the colors are wrong. A simple fix is hitting CTRL-L. Or scroll back a bit and then forward again.

It is easy to confuse vim’s syntax highlighter accidentally. For example, starting a new comment at the beginning of a long document will highlight the first screen, but if you jump to the end, vim fails to recognize the text as part of the comment:

Vim provides several knobs to control the “sync point” from which the syntax highlighter begins reprocessing edited text. One can set the sync point to the beginning of the document using :syntax sync fromstart or from a fixed number of lines before the edited line using :syntax sync minlines={N}. This is often prohibitively slow for large documents.

There are configurations (ccomment and javaComment) that provide more accurate results for highlighting comments. Vim also allows setting the sync point based on a regular expression. We can see an example in the default Python syntax highlighting rules:

" Sync at the beginning of class, function, or method definition.
syn sync match pythonSync grouphere NONE "^\%(def\|class\)\s\+\h\w*\s*[(:]"

Such rules are easy to get wrong, either by missing edge cases or re-parsing too much text. Even when implemented well, they do not necessarily guarantee correct results: an incremental parse might still produce different highlights than a full re-parse. Writing syntax rules is challenging enough without also compensating for the limitations of vim’s parser.

Another anomaly can occur when highlighting a large document. After a timeout controlled by the redrawtime setting, vim will simply stop highlighting. The result looks like this:

Vim stops highlighting a large JSON document after a timeout

For better or worse, many software developers rely on syntax highlighting to quickly identify problems in code. Seeing the colors disappear is an oddly disorienting experience.

It is worth asking whether vim’s approach still makes sense today. To the best of my knowledge, no major text editor developed in the last decade has needed to sacrifice correctness to achieve acceptable performance on modern hardware.

A more promising approach, originally invented by Tim Wagner in his Ph.D. dissertation,1 has recently been popularized by the Tree-Sitter and Lezer projects.2 The algorithm uses a clever trick to find syntax regions that might have changed after an edit. It then reruns the parser from the start of the first affected region until it detects that the edit cannot affect subsequent regions. This can drastically reduce the amount of text the parser needs to process.

Aretext, the minimalist vim clone I am building, uses this approach. Unlike vim, aretext guarantees that incremental parsing produces the same result as re-parsing the entire document, while maintaining acceptable performance. The video below shows how aretext handles the same files that vim parsed incorrectly above:

Like any system, aretext makes trade-offs to achieve its correctness guarantees. For some edits, aretext may need to re-parse most of the document, which can cause noticeable delays after an edit. To mitigate this issue, aretext’s parsing algorithm has been heavily optimized. Internally, it uses a B+ tree to efficiently search and modify the highlight regions. It also pre-compiles the parser into a minimal-state deterministic finite automata (DFA) that runs extremely quickly. This speed comes at the expense of flexibility: unlike vim, aretext does not support user-defined syntax rules. Whereas vim aims to be general and extensible, aretext aims to be minimalist and reliable.

You can find the full source code for aretext, including the incremental parsing implementation, at

  1. Wagner, T. A. (1997). Practical algorithms for incremental software development environments (Doctoral dissertation, University of California, Berkeley). See especially Chapter 5, “General Incremental Lexical Analysis”.
  2. Some developers working on the neovim project have discussed incorporating tree-sitter.