Upgrade to Pro — share decks privately, control downloads, hide ads and more …

(Re)make Regexp in Ruby: Democratizing internal...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

(Re)make Regexp in Ruby: Democratizing internals for the JIT

Presented at RubyKaigi 2026 in Hakodata Day 3 (2026/04/24).

https://github.com/makenowjust/naraku

Avatar for TSUYUSATO Kitsune

TSUYUSATO Kitsune

April 24, 2026

More Decks by TSUYUSATO Kitsune

Other Decks in Programming

Transcript

  1. Hiroya Fujinami (a.k.a. makenowjust) @RubyKaigi 2026 in Hakodate / 2026-04-24

    (Re)make Regexp in Ruby: Democratizing internals for the JIT
  2. Hiroya Fujinami a.k.a. makenowjust • https://github.com/makenowjust / https://x.com/make_now_just • Ph.D.

    student at NII (National Institute of Informatics). • Studying the application of automata theory and formal languages. • Recently interested in nominal sets. • Currently on the job market (!) 2
  3. Implementation plan (at the proposal time) 9 regcomp.c regexec.c regparse.c

    Regexp matching flow (Onigmo): regcomp.rb regexec.rb written in Ruby (so JIT powered ) by exposing internals (parser, char class) compiler matching VM
  4. Misperception about Ruby's Regexp 12 regcomp.c regexec.c regparse.c matching VM

    Onigmo Regexp matching flow (Ruby): Ruby's preprocess (2) Ruby's preprocess (1) Ruby's Regexp compiler Problem area
  5. Today's contents 16 Why remake Onigmo? Chapter 2 "Onigmo remake

    project" - Current status Chapter 3 To the Regexp + JIT dream... Chapter 4 Ruby's Regexp vs Onigmo's Chapter 1
  6. 4-stage Regexp (pre)processing 19 Handling escape sequences on parsing as

    Ruby (prism/parse.c) Handling escape sequences before compiling Regexp (re.c) Fast checking for collecting named captures (prism/regexp.c) Parsing with Onigmo (regparse.c) 1 2 3 4
  7. What each stage does (1) 20 Handling escape sequences on

    parsing as Ruby (prism/prism.c) 1 /(?<x>\woo)\ \u{1F600}/ =~ "foo " p /x\ y/ # => ??? original /(?<x>\woo)\u{1F600}/ =~ "foo " p /xy/ # => /xy/ processed (1)
  8. What each stage does (2) 21 Fast checking for collecting

    named captures (prism/regexp.c) 2 /(?<x>\woo)\u{1F600}/ =~ "foo " p /xy/ # => /xy/ processed (1) /(?<x>\woo)\u{1F600}/ =~ "foo " x = $~[:x] if $~ p /xy/ # => /xy/ processed (2)
  9. What each stage does (3) 22 Handling escape sequences before

    compiling Regexp (re.c) 3 /(?<x>\woo)\u{1F600}/ =~ "foo " x = $~[:x] if $~ p /xy/ # => /xy/ processed (2) s = "(?<x>\woo) " e = UTF-8 onigmo_parse(s, e) internal
  10. What each stage does (4) 23 Parsing with Onigmo (regparse.c)

    4 s = "(?<x>\woo) " e = UTF-8 onigmo_parse(s, e) internal list enclose string "oo" string " " list ctype \w internal AST
  11. Details of Regexp preprocessing in Ruby • These preprocessing stages

    are for the Prism case. • Regexp preprocessing stages depend on Ruby parser and how a Regexp value created. • parse.y also has its own preprocessing stage, but it uses Onigmo for collecting named captures. • When Regexp.new(...) is called, preprocessing is started from stage 3. 24
  12. Bugs coming from preprocessing chaos • Double-, triple-, and quadruple-escape

    processing is highly problematic. There are countless bugs. • e.g., 1. `/\c?/ =~ "\x7F"`, but `Regexp.new('\c?') !~ "\x7F"`. 2. `/(?<\x61>x)/ =~ "x"` raises IndexError. 3. `/[]]/` is an error in Prism, but it works in parse.y. 27
  13. 28 Implementation plan (at the proposal time) 9 regcomp.c regexec.c

    regparse.c Regexp matching flow (Onigmo): regcomp.rb regexec.rb written in Ruby (so JIT powered ) by exposing internals (parser, char class) compiler matching VM
  14. prism/prism.c prism/regexp.c re.c Prism /.../ path Regexp.new('...') path regparse.c parse.y

    /.../ path parse.y Ruby's Regexp regcomp.c regexec.c Regexp.parse path (?)
  15. Ideal architecture 34 Prism /.../ path parse.y /.../ path Regexp.new

    path Regexp.parse path comp.c exec.c parse.c
  16. Today's contents 39 Why remake Onigmo? Chapter 2 "Onigmo remake

    project" - Current status Chapter 3 To the Regexp + JIT dream... Chapter 4 Ruby's Regexp vs Onigmo's Chapter 1
  17. Goals of Project Naraku 1.Modern & clean architecture 2.Providing the

    Ruby's Regexp specification 3.User friendly new features 4.Performance improvement (with JIT?) 41 Creating a "Ruby-dedicated" Regexp engine
  18. Goals of Project Naraku 1.Modern & clean architecture 2.Providing the

    Ruby's Regexp specification 3.User friendly new features 4.Performance improvement (with JIT?) 42 Creating a "Ruby-dedicated" Regexp engine
  19. History of Oniguruma, Onigmo, and Ruby 43 Oniguruma development started

    Ruby 1.9 2002 2007 2011 Onigmo forked from Oniguruma 2013 Ruby 2.0 0OJHVSVNB Onigmo 2022 Ruby 3.2.0 (memoization) Ruby's Regexp engine:
  20. Goals of Project Naraku 1.Modern & clean architecture 2.Providing the

    Ruby's Regexp specification 3.User friendly new features 4.Performance improvement (with JIT?) 47 Creating a "Ruby-dedicated" Regexp engine
  21. prism/prism.c prism/regexp.c re.c Prism /.../ path Regexp.new('...') path regparse.c parse.y

    /.../ path parse.y Ruby's Regexp regcomp.c regexec.c Canon!
  22. Goals of Project Naraku 1.Modern & clean architecture 2.Fix the

    Ruby's Regexp behavior 3.User friendly new features 4.Performance improvement (with JIT?) 51 Creating a "Ruby-dedicated" Regexp engine
  23. s t r a s s e S T R

    A S S E ſ ſt st ſ ß ẞ ſ
  24. Goals of Project Naraku 1.Modern & clean architecture 2.Providing the

    Ruby's Regexp specification 3.User friendly new features 4.Performance improvement (with JIT?) 56 Creating a "Ruby-dedicated" Regexp engine
  25. TODO •[-] Encoding •[x] UTF-8, Shift_JIS, ASCII-8BIT / [ ]

    Others •[x] Parser •[ ] Matching VM / [ ] Compilation 59
  26. Ideal Current architecture 61 NarakuRuby.parse path Prism /.../ path Regexp.new

    path comp.c exec.c parse.c not yet implemented Ruby prototyping for JIT speed-up comp.rb exec.rb Chapter 3 -Fin-
  27. Today's contents 62 Why remake Onigmo? Chapter 2 "Onigmo remake

    project" - Current status Chapter 3 To the Regexp + JIT dream... Chapter 4 Ruby's Regexp vs Onigmo's Chapter 1
  28. Goals of Project Naraku 1.Modern & clean architecture 2.Providing the

    Ruby's Regexp specification 3.User friendly new features 4.Performance improvement (with JIT?) 63 Creating a "Ruby-dedicated" Regexp engine
  29. NarakuRuby::DFA • An on-the-fly DFA construction Regexp engine (like Go's

    `regexp` package) written purely in Ruby. • Limitations: UTF-8 only. No lookarounds, backreferences, and sub-exp calls. 64
  30. NarakuRuby::DFA Important disclaimer • This engine was built solely for

    "Regexp + [YZ]JIT" experiments. • It does NOT represent the future architectural direction of Project Naraku and Ruby. 65
  31. Benchmark results (in iteration-per-second) 66 #FODIDBTF NarakuRuby::DFA.match? (no JIT) NarakuRuby::DFA.match?

    (w/ YJIT) YJIT / no JIT NarakuRuby::DFA.match? (w/ ZJIT) ;+*5OP+*5 MJUFSBM JT JT YGBTUFS JT YGBTUFS BMUFSOBUJPO JT JT YGBTUFS JT YGBTUFS SFQFUJUJPO HSFFEZ JT JT YGBTUFS JT YGBTUFS SFQFUJUJPO BNCJHVPVT JT JT YGBTUFS JT YGBTUFS DIBS@DMBTT JT LJT YGBTUFS JT YGBTUFS BODIPS JT LJT YGBTUFS JT YGBTUFS Comparing with [YZ]JIT
  32. Benchmark results (in iteration-per-second) 67 #FODIDBTF Regexp#match? (Onigmo) NarakuRuby::DFA.match? (Naraku)

    /BSBLV0OJHNP MJUFSBM LJT JT YTMPXFS BMUFSOBUJPO LJT JT YTMPXFS SFQFUJUJPOHSFFEZ LJT JT YTMPXFS SFQFUJUJPOBNCJHVPVT JT JT YTMPXFS DIBS@DMBTT LJT LJT YTMPXFS BODIPS LJT LJT YTMPXFS Comparing with Onigmo (YJIT enabled)
  33. Benchmark results (in iteration-per-second) 72 #FODIDBTF Regexp#match? (Onigmo) NarakuRuby::DFA.match? (Naraku)

    Over-optimized NarakuRubyDFA.match? (OO Naraku) 00/BSBLV0OJHNP MJUFSBM LJT JT LJT YTMPXF SFQFUJUJPO HSFFEZ LJT JT LJT YTMPXFS SFQFUJUJPO BNCJHVPVT JT JT LJT YGBTUFS with YJIT
  34. Lessons from benchmarks 73 • YJIT enables a pure-Ruby Regexp

    engine fast as Onigmo. • However, it causes a maintainability issue. • Perhaps, my program did not obtain the full ZJIT power. I try to learn the ZJIT architecture.
  35. Next plan for Project Naraku 74 • To prevent performance

    degradation, we first go for the realistic way. parse.c comp.rb exec.c C side Ruby side matching VM
  36. Direct byte-code compilation issue 75 concat literal "a" literal "c"

    alt /(a|b)c/ push :br char "a" jump :exit :br char "b" :exit char "c" byte code literal "b" What prev char? Current
  37. IR (Internal Representation) for Regexp 76 concat literal "a" literal

    "c" alt /(a|b)c/ push :br char "a" jump :exit :br char "b" :exit char "c" byte code literal "b" push char "a" char "b" char "c" IR Future
  38. Memoization and IR 77 concat literal "a" literal "b" quantifier

    {1,*} /a+b/ :enter char "a" push :exit jump :enter :exit char "b" Current char "a" char "b" push Future / byte code IR
  39. Next next future 78 • This architecture is easy to

    introduce Ruby matching VM! exec.rb pure Ruby matching VM! parse.c C side Ruby side 78 exec.c matching VM comp.rb
  40. Remaining tasks 79 • Matching VM / Compilation • Other

    encodings (GB18030, etc) • Connecting to Ruby (replacing Onigmo with Naraku) • Subset for Prism (?) • Java or WebAssembly binding for JRuby (?)