Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reimplement RegExp in JavaScript

Reimplement RegExp in JavaScript

Regular expressions are used in many places to filter and validate input. While they are cute and powerful, they are usually hard to write and debug. So why not implement the entire RegExp object in JavaScript, such that each execution step can be visualized and debugging becomes easy? And what happens if you build a RegExp JIT in JavaScript that gets JITed by the JavaScript JIT … ;)

Julian Viereck

September 15, 2013
Tweet

More Decks by Julian Viereck

Other Decks in Programming

Transcript

  1. Overview • ParseTree • NodeList • Matcher • RegExpJS object

    itself • Writing a JIT in a JIT • Conclusion
  2. ParseTree • ParseTree: • structural representation of a string •

    described by grammar rules • Highlevel: • RegExp =[Parser]=> ParseTree
  3. Char: “c” Term Char: “b” Term Char: “a” Term Alternative

    Alternative Disjunction Example: /(ab|c)/ ParseTree Group
  4. Parser • Handwritten - LL(?) • Only basic error handling/messages

    • Uses RegExp iteself << replace them ;) • Emits “useful” information
  5. Notes on parsing • Should have gone with parser generator?

    • Mostly follows the spec • Addition: added handling for /]/ • Parsing numbers needs an extra slide...
  6. Number parsing • parseDecimalEscape(...) • /(a)(b)\2/ ‏ reference • /(a)(b)\71/

    ‏ NOT reference • IF: octal number, parseInt(71,8) • /(a)(b)\91/ ‏ NOT reference
  7. Number parsing • parseDecimalEscape(...) • /(a)(b)\2/ ‏ reference • /(a)(b)\71/

    ‏ NOT reference • IF: octal number, parseInt(71,8) • /(a)(b)\91/ ‏ NOT reference • ELSE: Ignore slash, match /91/
  8. Number parsing • parseDecimalEscape(...) • /(a)(b)\2/ ‏ reference • /(a)(b)\71/

    ‏ NOT reference • IF: octal number, parseInt(71,8) • /(a)(b)\91/ ‏ NOT reference • ELSE: Ignore slash, match /91/ V8 at least...
  9. CHAR Data: “b” • ParseTree ➡ NodeList • Data structure

    Matcher uses NodeList Data: “a” CHAR Example: /ab/
  10. DONE CHAR Data: “b” • ParseTree ➡ NodeList • Data

    structure Matcher uses NodeList Data: “a” CHAR Example: /ab/
  11. Data: “b” CHAR Data: “a” CHAR CHAR Data: “c” ALTR

    Example: /(ab|c)*c/ Index: 1 BEGIN_GROUP Id: 1 Min: 0 Max: ? REPEAT
  12. JOIN Data: “b” CHAR Data: “a” CHAR CHAR Data: “c”

    ALTR Example: /(ab|c)*c/ Index: 1 BEGIN_GROUP Id: 1 Min: 0 Max: ? REPEAT
  13. Index: 1 END_GROUP JOIN Data: “b” CHAR Data: “a” CHAR

    CHAR Data: “c” ALTR Example: /(ab|c)*c/ Index: 1 BEGIN_GROUP Id: 1 Min: 0 Max: ? REPEAT
  14. Index: 1 END_GROUP JOIN Data: “b” CHAR Data: “a” CHAR

    CHAR Data: “c” ALTR Example: /(ab|c)*c/ Index: 1 BEGIN_GROUP Id: 1 Min: 0 Max: ? REPEAT
  15. Index: 1 END_GROUP JOIN Data: “b” CHAR Data: “a” CHAR

    CHAR Data: “c” ALTR Data: “c” CHAR Example: /(ab|c)*c/ Index: 1 BEGIN_GROUP Id: 1 Min: 0 Max: ? REPEAT
  16. Index: 1 END_GROUP JOIN Data: “b” CHAR Data: “a” CHAR

    CHAR Data: “c” ALTR DONE Data: “c” CHAR Example: /(ab|c)*c/ Index: 1 BEGIN_GROUP Id: 1 Min: 0 Max: ? REPEAT
  17. Index: 1 END_GROUP JOIN Data: “b” CHAR Data: “a” CHAR

    CHAR Data: “c” ALTR DONE Data: “c” CHAR Example: /(ab|c)*c/ Index: 1 BEGIN_GROUP Id: 1 Min: 0 Max: ? REPEAT Char: “c” Term Char: “b” Term Char: “a” Term Alternative Alternative Disjunction Group
  18. NodeList • Look complicated to build • Easy by walking

    over ParseTree • Combine one node after the other
  19. Matcher • Does the actual RegExp matching • Matcher(state, nodeList)

    • state = { string, index, traces, counters, ...} • Tries all possibilities till match OR no progress • Calls Matcher(state’, nodeList’) recursively • If trace fails ➠ return function = backtracking
  20. Matcher • s = new State(‘abc’) • match(s, /a(b|c)d/) •

    test(s, /a/): ✓ • try /b/: match(s’, /(b|c)d/) a b c d a b d ^
  21. Matcher • s = new State(‘abc’) • match(s, /a(b|c)d/) •

    test(s, /a/): ✓ • try /b/: match(s’, /(b|c)d/) a b c d a b d ^ “ s’=s.clone()
  22. Matcher • s = new State(‘abc’) • match(s, /a(b|c)d/) •

    test(s, /a/): ✓ • try /b/: match(s’, /(b|c)d/) a b c d a b d ^ “ s’=s.clone()
  23. Matcher • s = new State(‘abc’) • match(s, /a(b|c)d/) •

    test(s, /a/): ✓ • try /b/: match(s’, /(b|c)d/) • test(‘b’, /b/): ✓ a b c d a b d ^ “ s’=s.clone()
  24. Matcher • s = new State(‘abc’) • match(s, /a(b|c)d/) •

    test(s, /a/): ✓ • try /b/: match(s’, /(b|c)d/) • test(‘b’, /b/): ✓ a b c d a b d ^ “ s’=s.clone()
  25. Matcher • s = new State(‘abc’) • match(s, /a(b|c)d/) •

    test(s, /a/): ✓ • try /b/: match(s’, /(b|c)d/) • test(‘b’, /b/): ✓ a b c d a b d ^ “ s’=s.clone()
  26. Matcher • s = new State(‘abc’) • match(s, /a(b|c)d/) •

    test(s, /a/): ✓ • try /b/: match(s’, /(b|c)d/) • test(‘b’, /b/): ✓ • test(‘d’, /d/): ✓ a b c d a b d ^ “ s’=s.clone()
  27. Matcher • s = new State(‘abc’) • match(s, /a(b|c)d/) •

    test(s, /a/): ✓ • try /b/: match(s’, /(b|c)d/) • test(‘b’, /b/): ✓ • test(‘d’, /d/): ✓ a b c d a b d ^ “ s’=s.clone()
  28. Matcher • s = new State(‘abc’) • match(s, /a(b|c)d/) •

    test(s, /a/): ✓ • try /b/: match(s’, /(b|c)d/) • test(‘b’, /b/): ✓ • test(‘d’, /d/): ✓ • DONE: ✓ a b c d a b d ^ “ s’=s.clone()
  29. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) a c d ^ a b c d
  30. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) a c d ^ a b c d “ s’=s.clone()
  31. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ a c d ^ a b c d “ s’=s.clone()
  32. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack a c d ^ a b c d “ s’=s.clone()
  33. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack a c d ^ a b c d “ s’=s.clone()
  34. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack • try /c/: match(s’’, /(b|c)d/) a c d ^ a b c d “ s’=s.clone()
  35. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack • try /c/: match(s’’, /(b|c)d/) a c d ^ a b c d “ s’=s.clone()
  36. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack • try /c/: match(s’’, /(b|c)d/) • test(‘c’, /c/): ✓ a c d ^ a b c d “ s’=s.clone()
  37. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack • try /c/: match(s’’, /(b|c)d/) • test(‘c’, /c/): ✓ a c d ^ a b c d “ s’=s.clone()
  38. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack • try /c/: match(s’’, /(b|c)d/) • test(‘c’, /c/): ✓ a c d ^ a b c d “ s’=s.clone()
  39. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack • try /c/: match(s’’, /(b|c)d/) • test(‘c’, /c/): ✓ • test(‘d’, /d/): ✓ a c d ^ a b c d “ s’=s.clone()
  40. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack • try /c/: match(s’’, /(b|c)d/) • test(‘c’, /c/): ✓ • test(‘d’, /d/): ✓ a c d ^ a b c d “ s’=s.clone()
  41. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack • try /c/: match(s’’, /(b|c)d/) • test(‘c’, /c/): ✓ • test(‘d’, /d/): ✓ • DONE a c d ^ a b c d “ s’=s.clone()
  42. • s = new State(‘abc’); • match(s, /a(b|c)d/) • test(‘a’,

    /a/): ✓ • try /b/: match(s’, /bd/) • test(s’, /b/): ✗ ‏ Backtrack • try /c/: match(s’’, /(b|c)d/) • test(‘c’, /c/): ✓ • test(‘d’, /d/): ✓ • DONE record the trace along the way a c d ^ a b c d “ s’=s.clone()
  43. Matcher pitfalls • /(a(b)*c)*/ • need to reset the matches

    of (b)* to null • also need to reset repeat counter • /()*a/ • Infinity matches ➡ Loop detection!
  44. RegExpJS Object • So far: • created ParseTree • created

    NodeList • have Matcher(string, parseNodes) • Needed: • something to play with :) • actual JS object similar to RegExp object
  45. RegExpJS Object • first attempt: • RegExpJS(regExp).exec = (str) =>

    { regExp = this.regExp; nodes = buildNodeList(parse(regExp)); return match(str, nodes); } • RegExpJS(/abc/).exec(‘abc’): ✓ • RegExpJS(/abc/).exec(‘JSConf abc’): ✗
  46. Match arbitrary start position • easy fix: • wrap the

    nodeList by “fake loop” • e.g: /abc/
  47. Match arbitrary start position • easy fix: • wrap the

    nodeList by “fake loop” • e.g: /abc/ • /abc/ ➠ /[^]*?(abc)/
  48. Match arbitrary start position • easy fix: • wrap the

    nodeList by “fake loop” • e.g: /abc/ • /abc/ ➠ /[^]*?(abc)/ • assign the first group index 0
  49. Match arbitrary start position • easy fix: • wrap the

    nodeList by “fake loop” • e.g: /abc/ • /abc/ ➠ /[^]*?(abc)/ • assign the first group index 0 • /abc/.exec(‘JSConf abc’): ✓
  50. Match arbitrary start position • easy fix: • wrap the

    nodeList by “fake loop” • e.g: /abc/ • /abc/ ➠ /[^]*?(abc)/ • assign the first group index 0 • /abc/.exec(‘JSConf abc’): ✓ • problem: V8 stack size not large enough
  51. How about flags? •multiline flag: /foo/m • affects only /^/

    and /$/ • easy to implement •global flag: /foo/g
  52. How about flags? •multiline flag: /foo/m • affects only /^/

    and /$/ • easy to implement •global flag: /foo/g • repeat from last position • okay’isch to implement
  53. How about flags? •ignore-case flag: /foo/i • not just str.toUpperCase()

    • e.g. if turns one character into two, then use the input character • fun with ranges: • /[a-}]/ ➠ /[a-}A-Z]/
  54. More fun • Spec requirement: • Call `ToString(...)`, `ToNumber(...)` on

    input • E.g. • str = { toString: function() { return false; } } • /LS/i.exec(str) ➡ returns?
  55. More fun • Spec requirement: • Call `ToString(...)`, `ToNumber(...)` on

    input • E.g. • str = { toString: function() { return false; } } • /LS/i.exec(str) ➡ returns? • /LS/i.exec(str) === “ls”
  56. so, RegExpJS object • There is more “fun” stuff required

    to get the ECMA test suite to pass • index.js (implements RegExpJS) ~240 LOC
  57. JIT Overview • Convert RegExp ➡ JS-Code • Have a

    RegExp-JIT running in JS-JIT • Idea: match input faster • use state machine / finite automata • only support a subset of RegExp • ... but these very fast & efficient
  58. Supported features • All features, except: • groups matches, lookahead

    - only (?: ) • no backreferences • only greedy repetitions • Not supported yet • repetitions, ... (under construction) • (written in three evenings hacks ;) )
  59. RegExp: NFA: 0 1 4 a b 2 ε 3

    a ε DFA: 0 1 a b 2 /ab|a/
  60. JIT Implementation • Convert RegExp to state machine • RegExp

    • Non-deterministic Finite Automata (NFA) • Deterministic Finite Automata (DFA) • CodeGen ➠ JavaScript code
  61. What alphabet to use? • e.g.: ranges like /[a-\u0777]/ •

    use one transition per character 0 1 a, b, c, ...
  62. What alphabet to use? • e.g.: ranges like /[a-\u0777]/ •

    use one transition per character • better solution? 0 1 a, b, c, ...
  63. Alphabet Classes • Example: /0[1-9]/ • before /ab/ 0 1

    2 0 1...9 alphabet: Σ={0, 1, 2, ..., 9}
  64. Alphabet Classes • Example: /0[1-9]/ • before /ab/ 0 1

    2 a b 0 1 2 0 1...9 alphabet: Σ={0, 1, 2, ..., 9}
  65. Alphabet Classes • Example: /0[1-9]/ • before /ab/ • let

    0 1 2 a b 0 1 2 0 1...9 alphabet: Σ={0, 1, 2, ..., 9}
  66. Alphabet Classes • Example: /0[1-9]/ • before /ab/ • let

    • a ≐ 0 0 1 2 a b 0 1 2 0 1...9 alphabet: Σ={0, 1, 2, ..., 9}
  67. Alphabet Classes • Example: /0[1-9]/ • before /ab/ • let

    • a ≐ 0 • b ≐ 1...9 0 1 2 a b 0 1 2 0 1...9 alphabet: Σ={0, 1, 2, ..., 9}
  68. Alphabet Classes • Example: /0[1-9]/ • before /ab/ • let

    • a ≐ 0 • b ≐ 1...9 0 1 2 a b 0 1 2 0 1...9 alphabet: Σ={0, 1, 2, ..., 9} alphabet: Σ={a, b}
  69. Alphabet Classes Calculation A-E /[A-E]/ C-Z /[C-Z]/ A-C D-E F-Z

    ≐ a ≐ b ≐ c A ≐ 65 Z ≐ 90 0 numberline
  70. TODO • Code cleanup • JIT still under construction •

    Align Parser/Traversal API with Esprima’s
  71. Status • Matcher works agains all test cases • Help

    needed: • project page • bugfixes • tester!
  72. More important • Crazy ideas what to use this for

    :) • Esprima enabled many analysis tools • Hope RegExp.JS becomes used as well • Lint RegExp • Monitor RegExp execution