How Your Text Editor Does Syntax Highlighting

How Your Text Editor Does Syntax Highlighting

Presented at the UWaterloo CS Club Alt+Tab Talks event in Winter 2017.

7f9969d3b17c6fd9d604ad2c09a06d58?s=128

Tristan Hume

April 16, 2017
Tweet

Transcript

  1. How your text editor does syntax highlighting By Tristan Hume

  2. Where I learned this

  3. Regex-based Tokenizers • Can be simple to implement, understand and

    reasonably fast • Regexes match basic constructs like numbers, keywords, operators • Often relies on common patterns in programming language grammars ◦ Keywords, paired delimiters, strings, comments ◦ Example: Emacs only supports delimiters up to two characters without fancy features • Can add features until it highlights everything you want ◦ Multi-line strings/comments: special paired delimiter functionality ◦ String escapes: ability to set rules to run while between delimiters ◦ Heredocs: feature to put capture group from start regex into end regex • Give standardized names to language constructs for themes: ◦ Vim examples: Comment, String, Keyword, Type, Function
  4. Editors using regex-based tokenizers • Gedit / GtkSourceView • Emacs

    • Vim • Textmate • Sublime Text 2 • Atom • Visual Studio Code • Textmate 2 • Sublime Text 3
  5. Editors using Textmate-compatible parsing • Gedit / GtkSourceView • Emacs

    • Vim • Textmate: Plist-based tmLanguage format • Sublime Text 2: Same grammars as Textmate • Atom: Basically tmLanguage grammars translated to JSON • Visual Studio Code: similar to Atom • Textmate 2: Original Textmate format with some extensions • Sublime Text 3: a new YAML-based stack parser format
  6. Vim: javascript.vim, 126 lines

  7. Vim: Associating Highlight Classes

  8. Textmate: Scopes • Dot-separated paths that go from general to

    specific: ◦ constant.numeric.ruby ◦ support.function.builtin.ruby ◦ entity.name.class.js • First few levels are standardized so themes work on any language ◦ https://www.sublimetext.com/docs/3/scope_naming.html • Later parts allow themes to customize and enhance highlighting for individual languages and constructs.
  9. Textmate: Scope Stacks

  10. Textmate: Scope selectors • Prefixes of scopes ◦ “string” matches

    “string.quoted.double.ruby” and “string.unquoted.heredoc.ruby” • Nested scopes in a scope stack ◦ “source.python keyword” can highlight any keyword in a python file differently • Exclude selectors ◦ “source.ruby string - string source” highlights strings, but not code nested in string interpolation. • Way more power for themes than Vim-style simple classes • Atom just translates scopes to HTML classes and uses CSS selectors
  11. Sublime syntaxes • New YAML-based format in Sublime Text 3

    • Turns the tmLanguage model into a full stack-based grammar format • Allows very fancy syntax highlighting and language analysis ◦ Nested languages ◦ Heredocs with any delimiter where the rest of the line is highlighted properly ◦ Full parsing of language constructs and nesting • The format my Syntect Rust library interprets
  12. What Sublime/Syntect can do

  13. Example: Javascript.sublime-syntax, 1400 lines

  14. Basic procedure for syntax highlighting • For each line in

    the file: • Starting with the position at the beginning of the line • Loop to find all the tokens on the line: ◦ Try matching each regex against the line from the current position forwards ◦ Take the one that matches closest to the current position, break if nothing matched ◦ Execute the associated action. Could be: ▪ Assigning a highlighting type to the matched substring ▪ If it’s a delimiter, push or pop a context of regexes from the parsing stack ◦ Move the current position to the end of the current match
  15. Problem: this isn’t good enough • For each line for

    each token we have to match every regex on the string • It’s slow, but there’s lots of optimizations we can do: ◦ Cache regex matches so that we remember if a regex matched a line and where ◦ Only output the places tokens start and end so that we don’t copy strings too much • It can get stuck in infinite push/pop loops: need to detect them • Need to support inheriting regexes in a stack for nested languages • If you don’t lazily compile the regexes while highlighting your editor will take forever to start up.
  16. Problem: Regex engines • Most of the time highlighting is

    spent in the regex engine • Textmate grammars assume the Oniguruma regex engine ◦ Lookaheads, look-behinds, atomic groups, named captures… ◦ Suffers from catastrophic backtracking taking exponential time in some cases • Sublime Text 3 has its own custom absurdly fast regex engine ◦ Only works on regexes that can be represented as DFAs ◦ Matches multiple regexes in one pass • I’m working on using an engine based on Rust’s regex crate for syntect ◦ Reduces catastrophic backtracking by using a DFA engine where possible ◦ Theoretically increased performance, so far it just matches Oniguruma
  17. jQuery benchmark: 9200 lines of code • Time just for

    syntax highlighting: ◦ Syntect takes 680ms, or 13,000 lines per second ◦ Sublime Text 3 dev build takes 90ms • Comparisons that also include rendering one screen of text: ◦ Sublime Text 3 dev build takes ~200ms ◦ Textmate 2, Spacemacs and Visual Studio Code all take ~2 seconds ◦ Atom takes 6 seconds • These comparisons aren't totally fair, except the one to Sublime Text since that is using the same theme and the same complex definition for ES6 syntax.
  18. Optimization: Only re-highlight the screen • Changes can only affect

    text further on • Can suspend the highlighting process between lines • On keystroke, re-highlight that line and every further line on screen • Kick off a background job to re-highlight the rest of the file • Can stop if the state stack for a line didn’t change. • Bounds the latency for highlighting to only ~100 lines = ~7ms
  19. Code Tour (if there’s time left)

  20. The End