(Re)make Regexp in Ruby: Democratizing internals for the JIT

Hiroya Fujinami (a.k.a. makenowjust) @RubyKaigi 2026 in Hakodate / 2026-04-24
(Re)make Regexp in Ruby: Democratizing internals for the JIT

Hiroya Fujinami a.k.a. makenowjust • https://github.com/makenowjust / https://x.com/make_now_just • Ph.D.
student at NII (National Institute of Informatics). • Studying the application of automata theory and formal languages. • Recently interested in nominal sets. • Currently on the job market (!) 2

Ruby committer

Regexp memoization

https://www.ruby-lang.org/ja/news/2022/12/25/ruby-3-2-0-released/

Regexp is good.

[YZ]JIT is good.

Regexp × JIT is good2...?

Implementation plan (at the proposal time) 9 regcomp.c regexec.c regparse.c
Regexp matching flow (Onigmo): regcomp.rb regexec.rb written in Ruby (so JIT powered ) by exposing internals (parser, char class) compiler matching VM

Ruby's Regexp is not Onigmo.

Onigmo is a Regexp engine not only for Ruby.

Misperception about Ruby's Regexp 12 regcomp.c regexec.c regparse.c matching VM
Onigmo Regexp matching flow (Ruby): Ruby's preprocess (2) Ruby's preprocess (1) Ruby's Regexp compiler Problem area

We need a "Ruby-dedicated" Regexp engine!

Onigmo should be made over!

Let's start "Onigmo remake project"! Chapter 1 -Fin-

Today's contents 16 Why remake Onigmo? Chapter 2 "Onigmo remake
project" - Current status Chapter 3 To the Regexp + JIT dream... Chapter 4 Ruby's Regexp vs Onigmo's Chapter 1

Ruby's Regexp preprocessing is crazy.

1 pattern string requires 3 preprocessing + 1 true processing.

4-stage Regexp (pre)processing 19 Handling escape sequences on parsing as
Ruby (prism/parse.c) Handling escape sequences before compiling Regexp (re.c) Fast checking for collecting named captures (prism/regexp.c) Parsing with Onigmo (regparse.c) 1 2 3 4

What each stage does (1) 20 Handling escape sequences on
parsing as Ruby (prism/prism.c) 1 /(?<x>\woo)\ \u{1F600}/ =~ "foo " p /x\ y/ # => ??? original /(?<x>\woo)\u{1F600}/ =~ "foo " p /xy/ # => /xy/ processed (1)

What each stage does (2) 21 Fast checking for collecting
named captures (prism/regexp.c) 2 /(?<x>\woo)\u{1F600}/ =~ "foo " p /xy/ # => /xy/ processed (1) /(?<x>\woo)\u{1F600}/ =~ "foo " x = $~[:x] if $~ p /xy/ # => /xy/ processed (2)

What each stage does (3) 22 Handling escape sequences before
compiling Regexp (re.c) 3 /(?<x>\woo)\u{1F600}/ =~ "foo " x = $~[:x] if $~ p /xy/ # => /xy/ processed (2) s = "(?<x>\woo) " e = UTF-8 onigmo_parse(s, e) internal

What each stage does (4) 23 Parsing with Onigmo (regparse.c)
4 s = "(?<x>\woo) " e = UTF-8 onigmo_parse(s, e) internal list enclose string "oo" string " " list ctype \w internal AST

Details of Regexp preprocessing in Ruby • These preprocessing stages
are for the Prism case. • Regexp preprocessing stages depend on Ruby parser and how a Regexp value created. • parse.y also has its own preprocessing stage, but it uses Onigmo for collecting named captures. • When Regexp.new(...) is called, preprocessing is started from stage 3. 24

prism/prism.c prism/regexp.c re.c Prism /.../ path Regexp.new('...') path regparse.c parse.y
/.../ path parse.y Ruby's Regexp regcomp.c regexec.c

Preprocessing chaos

Bugs coming from preprocessing chaos • Double-, triple-, and quadruple-escape
processing is highly problematic. There are countless bugs. • e.g., 1. `/\c?/ =~ "\x7F"`, but `Regexp.new('\c?') !~ "\x7F"`. 2. `/(?<\x61>x)/ =~ "x"` raises IndexError. 3. `/[]]/` is an error in Prism, but it works in parse.y. 27

28 Implementation plan (at the proposal time) 9 regcomp.c regexec.c
regparse.c Regexp matching flow (Onigmo): regcomp.rb regexec.rb written in Ruby (so JIT powered ) by exposing internals (parser, char class) compiler matching VM

Onigmo.parse ?

Onigmo's Regexp ≠ Ruby's Regexp

Onigmo.parse Regexp.parse

/.../ path parse.y Ruby's Regexp regcomp.c regexec.c Regexp.parse path (?)

Seriously?

Ideal architecture 34 Prism /.../ path parse.y /.../ path Regexp.new
path Regexp.parse path comp.c exec.c parse.c

Onigmo is a Regexp engine not only for Ruby.

We need a "Ruby-dedicated" Regexp engine!

Do you understand?

Do you wanna remake Onigmo? Chapter 2 -Fin-

Project Naraku 5IF0OJHNP3FNBLF1SPKFDU

Goals of Project Naraku 1.Modern & clean architecture 2.Providing the
Ruby's Regexp specification 3.User friendly new features 4.Performance improvement (with JIT?) 41 Creating a "Ruby-dedicated" Regexp engine

History of Oniguruma, Onigmo, and Ruby 43 Oniguruma development started
Ruby 1.9 2002 2007 2011 Onigmo forked from Oniguruma 2013 Ruby 2.0 0OJHVSVNB Onigmo 2022 Ruby 3.2.0 (memoization) Ruby's Regexp engine:

others 839 reg*.c 404 Number of `goto` statements: reg*.c (Onigmo)
vs. others

`goto` hell

46 Ruby's Regexp Eliminate! Onigmo's Regexp Naraku

"Compatibiltiy" Myth

Ruby's Regexp only in our brains/recognition

/.../ path parse.y Ruby's Regexp regcomp.c regexec.c Canon!

Goals of Project Naraku 1.Modern & clean architecture 2.Fix the
Ruby's Regexp behavior 3.User friendly new features 4.Performance improvement (with JIT?) 51 Creating a "Ruby-dedicated" Regexp engine

`i` flag is complex due to Unicode full case folding

/(?i)Straße/ matches 704 variants.

s t r a s s e S T R
A S S E ſ ﬅ ﬆ ſ ß ẞ ſ

Introduce `I` flag ASCII only case-folding

Current status of Project Naraku

https://github.com/ makenowjust/naraku

TODO •[-] Encoding •[x] UTF-8, Shift_JIS, ASCII-8BIT / [ ]
Others •[x] Parser •[ ] Matching VM / [ ] Compilation 59

Parser is implemeted.

Ideal Current architecture 61 NarakuRuby.parse path Prism /.../ path Regexp.new
path comp.c exec.c parse.c not yet implemented Ruby prototyping for JIT speed-up comp.rb exec.rb Chapter 3 -Fin-

NarakuRuby::DFA • An on-the-fly DFA construction Regexp engine (like Go's
`regexp` package) written purely in Ruby. • Limitations: UTF-8 only. No lookarounds, backreferences, and sub-exp calls. 64

NarakuRuby::DFA Important disclaimer • This engine was built solely for
"Regexp + [YZ]JIT" experiments. • It does NOT represent the future architectural direction of Project Naraku and Ruby. 65

Benchmark results (in iteration-per-second) 66 #FODIDBTF NarakuRuby::DFA.match? (no JIT) NarakuRuby::DFA.match?
(w/ YJIT) YJIT / no JIT NarakuRuby::DFA.match? (w/ ZJIT) ;+*5OP+*5 MJUFSBM JT JT YGBTUFS JT YGBTUFS BMUFSOBUJPO JT JT YGBTUFS JT YGBTUFS SFQFUJUJPO HSFFEZ JT JT YGBTUFS JT YGBTUFS SFQFUJUJPO BNCJHVPVT JT JT YGBTUFS JT YGBTUFS DIBS@DMBTT JT LJT YGBTUFS JT YGBTUFS BODIPS JT LJT YGBTUFS JT YGBTUFS Comparing with [YZ]JIT

Benchmark results (in iteration-per-second) 67 #FODIDBTF Regexp#match? (Onigmo) NarakuRuby::DFA.match? (Naraku)
/BSBLV0OJHNP MJUFSBM LJT JT YTMPXFS BMUFSOBUJPO LJT JT YTMPXFS SFQFUJUJPOHSFFEZ LJT JT YTMPXFS SFQFUJUJPOBNCJHVPVT JT JT YTMPXFS DIBS@DMBTT LJT LJT YTMPXFS BODIPS LJT LJT YTMPXFS Comparing with Onigmo (YJIT enabled)

Onigmo (and C) is fast.

Over optimization

Ahead-of-time (AOT) DFA construction + inline source expansion (source generation
& eval)

Benchmark results (in iteration-per-second) 72 #FODIDBTF Regexp#match? (Onigmo) NarakuRuby::DFA.match? (Naraku)
Over-optimized NarakuRubyDFA.match? (OO Naraku) 00/BSBLV0OJHNP MJUFSBM LJT JT LJT YTMPXF SFQFUJUJPO HSFFEZ LJT JT LJT YTMPXFS SFQFUJUJPO BNCJHVPVT JT JT LJT YGBTUFS with YJIT

Lessons from benchmarks 73 • YJIT enables a pure-Ruby Regexp
engine fast as Onigmo. • However, it causes a maintainability issue. • Perhaps, my program did not obtain the full ZJIT power. I try to learn the ZJIT architecture.

Next plan for Project Naraku 74 • To prevent performance
degradation, we first go for the realistic way. parse.c comp.rb exec.c C side Ruby side matching VM

Direct byte-code compilation issue 75 concat literal "a" literal "c"
alt /(a|b)c/ push :br char "a" jump :exit :br char "b" :exit char "c" byte code literal "b" What prev char? Current

IR (Internal Representation) for Regexp 76 concat literal "a" literal
"c" alt /(a|b)c/ push :br char "a" jump :exit :br char "b" :exit char "c" byte code literal "b" push char "a" char "b" char "c" IR Future

Memoization and IR 77 concat literal "a" literal "b" quantifier
{1,*} /a+b/ :enter char "a" push :exit jump :enter :exit char "b" Current char "a" char "b" push Future / byte code IR

Next next future 78 • This architecture is easy to
introduce Ruby matching VM! exec.rb pure Ruby matching VM! parse.c C side Ruby side 78 exec.c matching VM comp.rb

Remaining tasks 79 • Matching VM / Compilation • Other
encodings (GB18030, etc) • Connecting to Ruby (replacing Onigmo with Naraku) • Subset for Prism (?) • Java or WebAssembly binding for JRuby (?)

Made it!

To be continued...

(Re)make Regexp in Ruby: Democratizing internal...

(Re)make Regexp in Ruby: Democratizing internals for the JIT

More Decks by TSUYUSATO Kitsune

Other Decks in Programming

Featured

Transcript