Slide 1

Slide 1 text

How your text editor does syntax highlighting By Tristan Hume

Slide 2

Slide 2 text

Where I learned this

Slide 3

Slide 3 text

Regex-based Tokenizers ● Can be simple to implement, understand and reasonably fast ● Regexes match basic constructs like numbers, keywords, operators ● Often relies on common patterns in programming language grammars ○ Keywords, paired delimiters, strings, comments ○ Example: Emacs only supports delimiters up to two characters without fancy features ● Can add features until it highlights everything you want ○ Multi-line strings/comments: special paired delimiter functionality ○ String escapes: ability to set rules to run while between delimiters ○ Heredocs: feature to put capture group from start regex into end regex ● Give standardized names to language constructs for themes: ○ Vim examples: Comment, String, Keyword, Type, Function

Slide 4

Slide 4 text

Editors using regex-based tokenizers ● Gedit / GtkSourceView ● Emacs ● Vim ● Textmate ● Sublime Text 2 ● Atom ● Visual Studio Code ● Textmate 2 ● Sublime Text 3

Slide 5

Slide 5 text

Editors using Textmate-compatible parsing ● Gedit / GtkSourceView ● Emacs ● Vim ● Textmate: Plist-based tmLanguage format ● Sublime Text 2: Same grammars as Textmate ● Atom: Basically tmLanguage grammars translated to JSON ● Visual Studio Code: similar to Atom ● Textmate 2: Original Textmate format with some extensions ● Sublime Text 3: a new YAML-based stack parser format

Slide 6

Slide 6 text

Vim: javascript.vim, 126 lines

Slide 7

Slide 7 text

Vim: Associating Highlight Classes

Slide 8

Slide 8 text

Textmate: Scopes ● Dot-separated paths that go from general to specific: ○ constant.numeric.ruby ○ support.function.builtin.ruby ○ entity.name.class.js ● First few levels are standardized so themes work on any language ○ https://www.sublimetext.com/docs/3/scope_naming.html ● Later parts allow themes to customize and enhance highlighting for individual languages and constructs.

Slide 9

Slide 9 text

Textmate: Scope Stacks

Slide 10

Slide 10 text

Textmate: Scope selectors ● Prefixes of scopes ○ “string” matches “string.quoted.double.ruby” and “string.unquoted.heredoc.ruby” ● Nested scopes in a scope stack ○ “source.python keyword” can highlight any keyword in a python file differently ● Exclude selectors ○ “source.ruby string - string source” highlights strings, but not code nested in string interpolation. ● Way more power for themes than Vim-style simple classes ● Atom just translates scopes to HTML classes and uses CSS selectors

Slide 11

Slide 11 text

Sublime syntaxes ● New YAML-based format in Sublime Text 3 ● Turns the tmLanguage model into a full stack-based grammar format ● Allows very fancy syntax highlighting and language analysis ○ Nested languages ○ Heredocs with any delimiter where the rest of the line is highlighted properly ○ Full parsing of language constructs and nesting ● The format my Syntect Rust library interprets

Slide 12

Slide 12 text

What Sublime/Syntect can do

Slide 13

Slide 13 text

Example: Javascript.sublime-syntax, 1400 lines

Slide 14

Slide 14 text

Basic procedure for syntax highlighting ● For each line in the file: ● Starting with the position at the beginning of the line ● Loop to find all the tokens on the line: ○ Try matching each regex against the line from the current position forwards ○ Take the one that matches closest to the current position, break if nothing matched ○ Execute the associated action. Could be: ■ Assigning a highlighting type to the matched substring ■ If it’s a delimiter, push or pop a context of regexes from the parsing stack ○ Move the current position to the end of the current match

Slide 15

Slide 15 text

Problem: this isn’t good enough ● For each line for each token we have to match every regex on the string ● It’s slow, but there’s lots of optimizations we can do: ○ Cache regex matches so that we remember if a regex matched a line and where ○ Only output the places tokens start and end so that we don’t copy strings too much ● It can get stuck in infinite push/pop loops: need to detect them ● Need to support inheriting regexes in a stack for nested languages ● If you don’t lazily compile the regexes while highlighting your editor will take forever to start up.

Slide 16

Slide 16 text

Problem: Regex engines ● Most of the time highlighting is spent in the regex engine ● Textmate grammars assume the Oniguruma regex engine ○ Lookaheads, look-behinds, atomic groups, named captures… ○ Suffers from catastrophic backtracking taking exponential time in some cases ● Sublime Text 3 has its own custom absurdly fast regex engine ○ Only works on regexes that can be represented as DFAs ○ Matches multiple regexes in one pass ● I’m working on using an engine based on Rust’s regex crate for syntect ○ Reduces catastrophic backtracking by using a DFA engine where possible ○ Theoretically increased performance, so far it just matches Oniguruma

Slide 17

Slide 17 text

jQuery benchmark: 9200 lines of code ● Time just for syntax highlighting: ○ Syntect takes 680ms, or 13,000 lines per second ○ Sublime Text 3 dev build takes 90ms ● Comparisons that also include rendering one screen of text: ○ Sublime Text 3 dev build takes ~200ms ○ Textmate 2, Spacemacs and Visual Studio Code all take ~2 seconds ○ Atom takes 6 seconds ● These comparisons aren't totally fair, except the one to Sublime Text since that is using the same theme and the same complex definition for ES6 syntax.

Slide 18

Slide 18 text

Optimization: Only re-highlight the screen ● Changes can only affect text further on ● Can suspend the highlighting process between lines ● On keystroke, re-highlight that line and every further line on screen ● Kick off a background job to re-highlight the rest of the file ● Can stop if the state stack for a line didn’t change. ● Bounds the latency for highlighting to only ~100 lines = ~7ms

Slide 19

Slide 19 text

Code Tour (if there’s time left)

Slide 20

Slide 20 text

The End