Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RubyKaigi 2024 - Make Your Own Regex Engine!

RubyKaigi 2024 - Make Your Own Regex Engine!

This is a presentation slide for the talk "Make Your Own Regex Engine!" by Hiroya Fujinami.

https://rubykaigi.org/2024/presentations/makenowjust.html#day3

kantan-regex book:
https://makenowjust.github.io/kantan-regex-book/

TSUYUSATO Kitsune

May 17, 2024
Tweet

More Decks by TSUYUSATO Kitsune

Other Decks in Technology

Transcript

  1. $ whoami ‣ Hiroy a Fujin a mi (౻࿘େ໻) -

    t @make_now_just / g @makenowjust ‣ Ph.D student a t N a tion a l Institute of Inform a tics / SOKENDAI ‣ Work p a rt-time a t STORES, Inc. ‣ Ruby committer (regex engine, memoiz a tion) 2
  2. Recent work as a Ruby comitter ‣ Improved Ruby's regex

    engine (Onigmo). - Memoiz a tion works for look a round a ssertions a nd a toimic groupings. - Shipped it in Ruby 3.3. ‣ Contributed to Prism p a rser. 4 /(?<=foo)bar(?=baz)/ /(?>foo+)bar+/
  3. Goal 5 Talk theme Make Your Own Regex Engine! ‣

    Let you know the intern a l of regex engines! - And, I hope you'll w a nt to contribute Ruby's regex implement a tion. ‣ It seems a sm a ll progr a mming l a ngu a ge, so th a t it is good to implement regex m a tching a s your 1st progr a mming l a ngu a ge.
  4. Goal 6 Talk theme Make Your Own Regex Engine! ‣

    This t a lk follows my book (J a p a nese only): -https://makenowjust.github.io/kantan-regex-book/ ‣ Let's try it!
  5. Outline 7 Talk theme Make Your Own Regex Engine! 1.

    Tr a nspile regex to Ruby progr a m (b a d). 2. Use b a cktr a cking VM for regex m a tching. Next step (C a pture / Null-loop / Look a he a d)
  6. Outline 7 Talk theme Make Your Own Regex Engine! 1.

    Tr a nspile regex to Ruby progr a m (b a d). 2. Use b a cktr a cking VM for regex m a tching. Why is regex m a tching h a rd? Next step (C a pture / Null-loop / Look a he a d)
  7. Transpile 8 ‣ Tr a nspile is a conversion from

    source codes to source codes. - e.g., TypeScript is a tr a nspile l a ngu a ge to J a v a Script. ‣ We will f irst demonstr a te how to tr a nspile from regex to Ruby progr a m. - But, it will become cle a r th a t this ide a is b a d. tr a nspile 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17
  8. ‣ Abstr a ct synt a x tree (AST) is

    a d a t a structure of p a rsed regex. Parser / Abstract syntax tree (AST) 9 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 /a/ /a*/ /a|b/ /ab/
  9. ‣ Abstr a ct synt a x tree (AST) is

    a d a t a structure of p a rsed regex. Parser / Abstract syntax tree (AST) 9 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17
  10. ‣ Abstr a ct synt a x tree (AST) is

    a d a t a structure of p a rsed regex. Parser / Abstract syntax tree (AST) 9 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17
  11. p = -> (str, pos) { return pos } Transpile

    from regex 10 tr a nspile # @param str [String] # an input string # @param pos [Integer] # a current position # @return [Integer, nil] # return the matched pos, # or return nil if not matched 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 eval(" ") Abstr a ct synt a x tree
  12. Literal['a'] # /a/ p = -> (str, pos) { if

    str[pos] == 'a' pos += 1 else return nil end return pos } Transpile from regex: Literal 11 if str[pos] == 'a' pos += 1 else return nil end tr a nspile # @param str [String] # an input string # @param pos [Integer] # a current position # @return [Integer, nil] # return the matched pos, # or return nil if not matched 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17
  13. Transpile from regex: Concat p = -> (str, pos) {

    return pos } 12 r1 tr a nspile tr a nspile Concat[[r1, r2]] # /r1r2/ r2 tr a nspile 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 # @param str [String] # an input string # @param pos [Integer] # a current position # @return [Integer, nil] # return the matched pos, # or return nil if not matched
  14. Transpile from regex: Concat (example) p = -> (str, pos)

    { return pos } 13 tr a nspile Concat[[ Literal['a'], Literal['b'] ]] # /ab/ if str[pos] == 'a' pos += 1 else return nil end if str[pos] == 'b' pos += 1 else return nil end # @param str [String] # an input string # @param pos [Integer] # a current position # @return [Integer, nil] # return the matched pos, # or return nil if not matched 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17
  15. Transpile from regex: Choice (?) p = -> (str, pos)

    { p1 = ->(str, pos) { return pos }; r1 = p1.call(str, pos) if r1; pos = r1; else end; return pos } 14 r1 tr a nspile tr a nspile Choice[[r1, r2]] # /r1|r2/ r2 tr a nspile 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17
  16. p1 = ->(str, pos) { if str[pos] == 'a' pos

    += 1 else return nil end return pos };r1 = p1.call(str, pos) if r1; pos = r1; else if str[pos] == 'a' pos += 1 else return nil end if str[pos] == 'a' pos += 1 else return nil end end ᶄ Concat & Choice: Problematic case 15 tr a nspile Concat[[ Choice[[ Literal['a'], Concat[[ Literal['a'], Literal['a'], ]], ]], Literal['b'], ]] if str[pos] == 'b' pos += 1 else return nil end return pos Wh a t is the problem? # /(a|aa)b/ 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 ᶃ ᶄ ᶅ ᶃ ᶅ ᶃ ᶄ ᶅ
  17. p1 = ->(str, pos) { if str[pos] == 'a' pos

    += 1 else return nil end return pos };r1 = p1.call(str, pos) if r1; pos = r1; else if str[pos] == 'a' pos += 1 else return nil end if str[pos] == 'a' pos += 1 else return nil end end ᶄ Concat & Choice: Problematic case 15 tr a nspile Concat[[ Choice[[ Literal['a'], Concat[[ Literal['a'], Literal['a'], ]], ]], Literal['b'], ]] if str[pos] == 'b' pos += 1 else return nil end return pos Wh a t is the problem? p.("ab", 0) #=> 2 p.("aab",0) #=> nil # /(a|aa)b/ 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 ᶃ ᶄ ᶅ ᶃ ᶅ ᶃ ᶄ ᶅ
  18. Concat & Choice: Problematic case 16 ‣ This tr a

    nspil a tion doesn't c a tch a b a cktr a cking beh a vior correctly. ‣ To f ix this problem: use a continut a tion (or yield), use a b a cktr a cking VM. 1. Tr a nspile regex to Ruby progr a m (b a d). / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 2. B a cktr a cking VM
  19. Backtracking VM 17 ‣ B a cktr a cking VM

    is a sm a ll virtu a l m a chine (VM) for regex m a tching. ‣ There a re only four instructions for implementing regex m a tching. - push, jump, char, match 2. Use b a cktr a cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 compile 0 [:push, 5] 1 [:push, 4] 2 [:char, 'a'] 3 [:jump, 1] 4 [:jump, 7] 5 [:char, 'b'] 6 [:char, 'c'] 7 [:match] B a ckt a rcking VM progr a m ( a rr a y of instructions) Abstr a ct synt a x tree
  20. Backtracking VM: Structure 18 ‣ Input: str, start_pos, program ‣

    St a te: pos, pc (progr a m counter), stack 2. Use b a cktr a cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 stack a b c … str " " pos program 0 [:push, 3] 1 [:char, 'a'] 2 [:jump, 5] 3 [:char, 'a'] 4 [:char, 'a'] 5 [:char, 'b'] 6 [:match] pc
  21. Instruction: push 19 ‣[:push, backtrack_pc] ‣ Push (backtrack_pc, pos) to

    stack, a nd increment pc. ‣ This v a lue is poped when the m a tching is f a iled. → See the next slide. 2. Use b a cktr a cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 (pc=3,pos=0) stack a b c … str " " pos program 0 [:push, 3] 1 [:char, 'a'] 2 [:jump, 5] 3 [:char, 'a'] 4 [:char, 'a'] 5 [:char, 'b'] 6 [:match] pc back track
  22. Instruction: char 20 ‣[:char, c] ‣ If str[pos] == c

    then pos a nd pc is incremented. ‣ Otherwise, do b a cktr a cking. ‣ Pop a nd set v a lues from stack, or return a s a m a tching f a ilure. 2. Use b a cktr a cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 (pc=3,pos=0) stack a b c … str " " program 0 [:push, 3] 1 [:char, 'a'] 2 [:jump, 5] 3 [:char, 'a'] 4 [:char, 'a'] 5 [:char, 'b'] 6 [:match] pc pos
  23. Instruction: jump, match 21 ‣[:jump, next_pc] ‣ Set pc to

    next_pc. ‣[:match] ‣ Return pos immedi a tely. 2. Use b a cktr a cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 (pc=3,pos=0) stack a b c … str " " pos program 0 [:push, 3] 1 [:char, 'a'] 2 [:jump, 5] 3 [:char, 'a'] 4 [:char, 'a'] 5 [:char, 'b'] 6 [:match] pc jump
  24. Instruction: jump, match 21 ‣[:jump, next_pc] ‣ Set pc to

    next_pc. ‣[:match] ‣ Return pos immedi a tely. 2. Use b a cktr a cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 (pc=3,pos=0) stack a b c … str " " program 0 [:push, 3] 1 [:char, 'a'] 2 [:jump, 5] 3 [:char, 'a'] 4 [:char, 'a'] 5 [:char, 'b'] 6 [:match] pc pos
  25. Implementation 22 2. Use b a cktr a cking VM

    for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 ‣ Just 21 lines! ‣ But, the corresponding one in Ruby is ~2000 lines. (match_at in regexec.c) Backtracking VM
  26. Compile 23 ‣ Convert regexes to b a cktr a

    cking VM progr a ms ( a rr a y of instructions). 2. Use b a cktr a cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 compile 0 [:push, 5] 1 [:push, 4] 2 [:char, 'a'] 3 [:jump, 1] 4 [:jump, 7] 5 [:char, 'b'] 6 [:char, 'c'] 7 [:match] B a ckt a rcking VM progr a m ( a rr a y of instructions) Abstr a ct synt a x tree
  27. Compile: Literal 24 2. Use b a cktr a cking

    VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 compile [:char, 'a'] Literal['a'] # /a/
  28. Compile: Concat 25 2. Use b a cktr a cking

    VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 compile compile(r1) compile(r2) ⋮ ⋮ Concat[[r1,r2]] # /r1r2/
  29. Compile: Choice 26 2. Use b a cktr a cking

    VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 compile [:push, §1] compile(r1) [:jump, §2] compile(r2) §1 §2 ⋮ ⋮ Choice[[r1,r2]] # /r1|r2/
  30. Compile: Repetition (star) 27 2. Use b a cktr a

    cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 compile [:push, §1] §2 compile(r) [:jump, §2] §1 ⋮ Repetition[r,:star] # /r*/
  31. Compile: Repetition (plus, question) 28 2. Use b a cktr

    a cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 compile compile(r) §2 [:push, §1] [:jump, §2] §1 ⋮ Repetition[r,:plus] # /r+/ compile Repetition[r,:question] # /r?/ [:push, §1] compile(r) §1 ⋮
  32. Implementation 29 2. Use b a cktr a cking VM

    for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 Compiler ‣ ~ 60 lines.
  33. Implementation 30 2. Use b a cktr a cking VM

    for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 Public API p a rse Choice[[ Repetition[ Literal['a'], :star, ], Concat[[ Literal['b'], Abstr a ct synt a x tree (AST) compile 0 [:push, 5] 1 [:push, 4] 2 [:char, 'a'] 3 [:jump, 1] 4 [:jump, 7] 5 [:char, 'b'] 6 [:char, 'c'] 7 [:match] B a ckt a rcking VM progr a m exec M a tching result Input string "aaabc"
  34. Implementation 30 2. Use b a cktr a cking VM

    for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 Public API p a rse Choice[[ Repetition[ Literal['a'], :star, ], Concat[[ Literal['b'], Abstr a ct synt a x tree (AST) compile 0 [:push, 5] 1 [:push, 4] 2 [:char, 'a'] 3 [:jump, 1] 4 [:jump, 7] 5 [:char, 'b'] 6 [:char, 'c'] 7 [:match] B a ckt a rcking VM progr a m exec M a tching result Input string "aaabc"
  35. ‣ ~ 250 lines. All source codes of kantan-regex 31

    2. Use b a cktr a cking VM for regex m a tching. / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 AST P a rser P a rser B a cktr a ck VM Compiler API Next step
  36. Next step 32 Talk theme Make Your Own Regex Engine!

    1. Implement c a ptures. 2. Fix null-loop problem. 3. Implement look a he a d a ssertions.
  37. Captures 33 ‣ A c a pture is a regex

    synt a x to a llows l a ter reference to the string th a t m a tches the p a renthesized portion. ‣ AST: Capture[i, r] Next step - Implement c a ptures / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17
  38. Compile: Capture Next step - Implement c a ptures /

    Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 34 compile [:cap,i*2] compile(r) [:cap,i*2+1] ⋮ Capture[i,r] # /(r)/ caps 0 1 … $~.begin(0) $~.begin(1) $~.end(0) ‣ B a cktr a cking VM h a s a new st a te caps.
  39. New instruction: cap ‣ [:cap, cap_index] ‣ Upd a te

    caps[cap_index] to pos, a nd push a p a ir with a t a g C: ɾc a pture index cap_index ɾold caps[cap_index] ‣ When C(cap, pos) p a ir a ppe a r on b a cktr a cking, cap[cap_index] is reset to pos. Next step - Implement c a ptures / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 C(cap=2,pos=0) stack a b c … str " " pos 35 caps 0 1 … program [:cap, 2] ⋮ ⋮ pc 0 Next: null-loop problem
  40. Null-loop problem Next step - Fix null-loop problem / Hiroy

    a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 36 Repetition[ Repetition[ Literal['a'] :question ], :star ] # /(a?)*/ compile 0 [:push,4] 1 [:push,3] 2 [:char,'a'] 3 [:jump,0] 4 [:match] KantanRegex.new('(a?)*').match('aa') does not termin a te.
  41. How to fix null-loop problem ‣ One m a jor

    solution: Tre a ting a null-loop a s a f a ilure (J a v a Script). ‣ Another solution: Esc a ping a null-loop if it is found (Ruby ?). ‣ The di ff erence between J a v a Script a nd Ruby w a ys is a ppe a red depending on c a ptures. - J a v a Script: /(a|())*/.match("aaa") == ["aaa", "a", undefined] - Ruby ?: /(a|())*/.match("aaa") == ["aaa", "", ""] Next step - Fix null-loop problem / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 37
  42. New instructions for null-loop ‣ Anyw a y, we need

    new instructions for detecting null-loop. New instruction: push_null_check, null_check Next step - Fix null-loop problem / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 38 [:push, §1] §2 [:push_null_check, 1] compile(r) [:null_check, 1] [:jump, §2] §1 ⋮ Repetition[r,:star] # /r*/ compile
  43. New instruction: push_null_check ‣ [:push_null_check, t] ‣ Push NC(t, pos)

    to stack. ‣ This v a lue is used on null_check. Next step - Fix null-loop problem / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 NC(t=1,pos=1) stack a b c … str " " pos 39 program [:push_null_check,1] ⋮ ⋮ pc
  44. New instruction: null_check ‣ [:null_check,t] ‣ Find the top NC(t,pos=nc_pos)

    in stack. ‣ If pos == nc_pos, null-loop is detected: - tre a t it a s a f a ilure (JS w a y). - skip the next instruction (Ruby w a y). Next step - Fix null-loop problem / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 NC(t=1,pos=1) stack a b c … str " " pos 40 program [:null_check,1] [:jump,2] ⋮ ⋮ pc JS Ruby
  45. null-check: Ruby's implementation ‣ Ruby's implement a tion is more

    complex. Actu a lly, it is c a pture- a w a re. ‣ e.g., /(a|\2b|\3()|())*/.match("aaabbb") == ["aaabbb", "", "", ""] ‣ Implementing this is a n exercise ;) But, this c a se with look a round a ssertions is still unresolved. M a ny implement a tions get stuck in in f inite loops. Next step - Fix null-loop problem / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 41 Next: look a he a d a ssertions
  46. null-check: Ruby's implementation ‣ Ruby's implement a tion is more

    complex. Actu a lly, it is c a pture- a w a re. ‣ e.g., /(a|\2b|\3()|())*/.match("aaabbb") == ["aaabbb", "", "", ""] ‣ Implementing this is a n exercise ;) But, this c a se with look a round a ssertions is still unresolved. M a ny implement a tions get stuck in in f inite loops. Next step - Fix null-loop problem / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 41 Next: look a he a d a ssertions /((?=(\3a|a))(?=(\2a)))*/ =~ "aaaa"
  47. Lookahead assertions ‣ Look a he a d a ssertions

    a re one of the zero-width a ssertions. ‣ Synt a x: (?=pattern) ‣ Look a he a d (?=pattern) tries to m a tch pattern, but it does not consume ch a r a cters if it is m a tched. ‣ In b a cktr a cking VM, it me a ns resetting pos if pattern is m a tched. Next step - Implement look a he a d a ssertions / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 42 /(?=.*abc).*cba/ /hello (?=world)/
  48. ‣ New instruction: save_pos, reset_pos Compile: Lookahead Next step -

    Implement look a he a d a ssertions / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 43 compile [:save_pos] compile(r) [:reset_pos] ⋮ Lookahead[r] # /(?=r)/
  49. New instruction: save_pos, reset_pos ‣ [:save_pos] ‣ Push pos with

    a t a g LA. ‣ [:reset_pos] ‣ Pop stack until LA(pos=la_pos) is found, then set la_pos to pos. Next step - Implement c a ptures / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 LA(pos=1) stack a b c … str " " pos 44 program [:save_pos] ⋮ ⋮ pc
  50. New instruction: save_pos, reset_pos ‣ [:save_pos] ‣ Push pos with

    a t a g LA. ‣ [:reset_pos] ‣ Pop stack until LA(pos=la_pos) is found, then set la_pos to pos. Next step - Implement c a ptures / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 LA(pos=1) (pc=2,pos=1) (pc=3,pos=2) stack a b c … str " " pos 44 program [:reset_pos] ⋮ ⋮ pc pop
  51. Negative lookahead ‣ Neg a tive look a he a

    d is a v a ri a nt of look a he a d a ssertions. ‣ Synt a x: (?!pattern) ‣ push_neg_la: Like push, but it is t a gged with NLA. ‣ fail: Pop until NLA. Next step - Implement look a he a d a ssertions / Hiroy a Fujin a mi, "M a ke Your Own Regex Engine", 2024-05-17 45 /Ruby(?!Kaigi)/ [:push_neg_la, §] compile(r) [:fail] § ⋮
  52. Exercises 46 Talk theme Make Your Own Regex Engine! 1.

    Implement other look a round a ssertions neg a tive look a he a d, positive/neg a tive lookbehind 2. Implement b a ckreferences. (e.g., /(a*)\1/) 3. Apply optimiz a tion/ a ccel a r a tion. memoiz a tion (ref. to previoius my t a lk), VM progr a m to Ruby tr a nspil a tion
  53. Thank you for listening! 47 Talk theme Make Your Own

    Regex Engine! ‣ In this t a lk, we describe - wh a t is b a cktr a cking beh a vior str a nge, - how to implement simple regex engine in ~250 lines of Ruby progr a m, - how to extend this simple regex engine.