How Your Text Editor Does Syntax Highlighting

How your text editor does syntax highlighting By Tristan Hume

Where I learned this

Regex-based Tokenizers • Can be simple to implement, understand and
reasonably fast • Regexes match basic constructs like numbers, keywords, operators • Often relies on common patterns in programming language grammars ◦ Keywords, paired delimiters, strings, comments ◦ Example: Emacs only supports delimiters up to two characters without fancy features • Can add features until it highlights everything you want ◦ Multi-line strings/comments: special paired delimiter functionality ◦ String escapes: ability to set rules to run while between delimiters ◦ Heredocs: feature to put capture group from start regex into end regex • Give standardized names to language constructs for themes: ◦ Vim examples: Comment, String, Keyword, Type, Function

Editors using regex-based tokenizers • Gedit / GtkSourceView • Emacs
• Vim • Textmate • Sublime Text 2 • Atom • Visual Studio Code • Textmate 2 • Sublime Text 3

Editors using Textmate-compatible parsing • Gedit / GtkSourceView • Emacs
• Vim • Textmate: Plist-based tmLanguage format • Sublime Text 2: Same grammars as Textmate • Atom: Basically tmLanguage grammars translated to JSON • Visual Studio Code: similar to Atom • Textmate 2: Original Textmate format with some extensions • Sublime Text 3: a new YAML-based stack parser format

Vim: javascript.vim, 126 lines

Vim: Associating Highlight Classes

Textmate: Scopes • Dot-separated paths that go from general to
specific: ◦ constant.numeric.ruby ◦ support.function.builtin.ruby ◦ entity.name.class.js • First few levels are standardized so themes work on any language ◦ https://www.sublimetext.com/docs/3/scope_naming.html • Later parts allow themes to customize and enhance highlighting for individual languages and constructs.

Textmate: Scope Stacks

Textmate: Scope selectors • Prefixes of scopes ◦ “string” matches
“string.quoted.double.ruby” and “string.unquoted.heredoc.ruby” • Nested scopes in a scope stack ◦ “source.python keyword” can highlight any keyword in a python file differently • Exclude selectors ◦ “source.ruby string - string source” highlights strings, but not code nested in string interpolation. • Way more power for themes than Vim-style simple classes • Atom just translates scopes to HTML classes and uses CSS selectors

Sublime syntaxes • New YAML-based format in Sublime Text 3
• Turns the tmLanguage model into a full stack-based grammar format • Allows very fancy syntax highlighting and language analysis ◦ Nested languages ◦ Heredocs with any delimiter where the rest of the line is highlighted properly ◦ Full parsing of language constructs and nesting • The format my Syntect Rust library interprets

What Sublime/Syntect can do

Example: Javascript.sublime-syntax, 1400 lines

Basic procedure for syntax highlighting • For each line in
the file: • Starting with the position at the beginning of the line • Loop to find all the tokens on the line: ◦ Try matching each regex against the line from the current position forwards ◦ Take the one that matches closest to the current position, break if nothing matched ◦ Execute the associated action. Could be: ▪ Assigning a highlighting type to the matched substring ▪ If it’s a delimiter, push or pop a context of regexes from the parsing stack ◦ Move the current position to the end of the current match

Problem: this isn’t good enough • For each line for
each token we have to match every regex on the string • It’s slow, but there’s lots of optimizations we can do: ◦ Cache regex matches so that we remember if a regex matched a line and where ◦ Only output the places tokens start and end so that we don’t copy strings too much • It can get stuck in infinite push/pop loops: need to detect them • Need to support inheriting regexes in a stack for nested languages • If you don’t lazily compile the regexes while highlighting your editor will take forever to start up.

Problem: Regex engines • Most of the time highlighting is
spent in the regex engine • Textmate grammars assume the Oniguruma regex engine ◦ Lookaheads, look-behinds, atomic groups, named captures… ◦ Suffers from catastrophic backtracking taking exponential time in some cases • Sublime Text 3 has its own custom absurdly fast regex engine ◦ Only works on regexes that can be represented as DFAs ◦ Matches multiple regexes in one pass • I’m working on using an engine based on Rust’s regex crate for syntect ◦ Reduces catastrophic backtracking by using a DFA engine where possible ◦ Theoretically increased performance, so far it just matches Oniguruma

jQuery benchmark: 9200 lines of code • Time just for
syntax highlighting: ◦ Syntect takes 680ms, or 13,000 lines per second ◦ Sublime Text 3 dev build takes 90ms • Comparisons that also include rendering one screen of text: ◦ Sublime Text 3 dev build takes ~200ms ◦ Textmate 2, Spacemacs and Visual Studio Code all take ~2 seconds ◦ Atom takes 6 seconds • These comparisons aren't totally fair, except the one to Sublime Text since that is using the same theme and the same complex definition for ES6 syntax.

Optimization: Only re-highlight the screen • Changes can only affect
text further on • Can suspend the highlighting process between lines • On keystroke, re-highlight that line and every further line on screen • Kick off a background job to re-highlight the rest of the file • Can stop if the state stack for a line didn’t change. • Bounds the latency for highlighting to only ~100 lines = ~7ms

Code Tour (if there’s time left)

The End

How Your Text Editor Does Syntax Highlighting

How Your Text Editor Does Syntax Highlighting

Tristan Hume

More Decks by Tristan Hume

Other Decks in Programming

Featured

Transcript