Make Regexp#match much faster

2023/5/11 RubyKaigi 2023 at Matsumoto, Japan Make Regexp#match Much Faster
Hiroya Fujinami (makenowjust)

Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi
2023 at Matsumoto, Japan Hiroya Fujinami ‣ Ph.D. student at SOKENDAI (NII) ‣ t @make_now_just g @makenowjust ‣ Ruby committer since 2022/12 - I contributed for speed up Regexp#match in CRuby ⚡ at Cookpad Inc. internship

2023 at Matsumoto, Japan Agenda ‣ Regexp is powerful   Powerful features and applications in Ruby's Regexp ‣ ReDoS — a vulnerability about Regexp   What a vulnerability is ReDoS (Regular Expression Denial of Service)? ‣ How to speed up Regexp matching   The optimization by memoization makes Regexp matching faster. ‣ Future work   Talking about feature vision on Regexp matching

2023 at Matsumoto, Japan Regexp is powerful   Powerful features and applications in Ruby's Regexp

2023 at Matsumoto, Japan Ruby's Regexp is powerful! ‣ Ruby's Regexp has some rich features: - multiple character encoding support, - look-around operators, atomic groups, conditional branches, absent operator, - subexpression calls, and back-references (irregular extensions). → These features help our developments and hobbies. (?=foo) (?!foo) (?<=foo) (?<!foo) (?>foo) (?(x)foo|bar) look-around operators atomic group conditional branches (?<x>foo){0}\g<x> subexpression call (?<x>foo*)\k<x> back-reference (?~foo) absent operator

2023 at Matsumoto, Japan How powerful is it? ‣ The following Regexp computes (veri fi es) a sum equation of two binary digits. - e.g. "01+01=10" and "110+001=111" "01+01=00" /(?<s>[01](?=(?:(?<l0>(?<=0))|(?<l1>(?<=1)))[01]*\+(?<r>(?:\k<r-1>|(?!\k<r-1>)) [01])(?:(?<r0>(?<=0))|(?<r1>(?<=1)))[01]*=(?<a>(?:\k<a-1>|(?!\k<a-1>))[01])(?:(? <a0>(?<=0))|(?<a1>(?<=1)))[01]*\z(?:(?<c>\k<l0+0>\k<r0+0>\k<a1+0>| \k<l1+0>\k<r0+0>\k<a0+0>|\k<l0+0>\k<r1+0>\k<a0+0>|\k<l1+0>\k<r1+0>\k<a1+0>)| \k<l0+0>\k<r0+0>\k<a0+0>|\k<l1+0>\k<r0+0>\k<a1+0>|\k<l0+0>\k<r1+0>\k<a1+0>| \k<l1+0>\k<r1+0>\k<a0+0>)(?:\k<l0+0>(?:\k<r0+0>(?:\k<a0+0>(?!\k<c-1>)|\k<a1+0>(?! \k<c-1>))|\k<r1+0>(?:\k<a0+0>\k<c-1>|\k<a1+0>(?!\k<c-1>)))|\k<l1+0>(?:\k<r0+0>(?: \k<a0+0>\k<c-1>|\k<a1+0>(?!\k<c-1>))|\k<r1+0>(?:\k<a0+0>\k<c-1>| \k<a1+0>\k<c-1>))))\g<s>|(?!\k<c-1>)\+\k<r-1>=\k<a-1>){0}\A\g<s>\z/ Theoretically, this language (a set of strings) is neither regular nor context-free.

2023 at Matsumoto, Japan How powerful is it? ‣ A brainf*ck implementation in Ruby's Regexp exists. - https://github.com/shinh/hack/blob/master/bf_rb_reg/bf.rb - BF_REG =~ bf + BF_SUFFIX executes a bf program on Regexp matching   (loops may stop after 256 iterations.) ‣ Conclusion → ultimately powerful!

2023 at Matsumoto, Japan With great power   comes great responsibility.

2023 at Matsumoto, Japan ReDoS — a vulnerability about Regexp   What a vulnerability is ReDoS (Regular Expression Denial of Service)?

2023 at Matsumoto, Japan ReDoS (Regular Expression Denial of Service) ‣ A vulnerability about Regexp matching. ‣ The time taken for Regexp matching can be explosive. - time ruby -e '/\A(a|a)*\z/ =~ "a" * 30 + "b"'   30.99s user 0.09s system 99% cpu 31.188 total - The Regexp /\A(a|a)*\z/ takes an exponential matching time   against such an input string "a" * n + "b".

2023 at Matsumoto, Japan What a problem is Regexp matching explosion? One scenario: ‣ A web app uses /\A[^@]+@[^@]+([.][^@]+)+\z/   for validating E-mail address. ‣ Then, one user sends "a@" + "." * 50 + "@" to this web app. ‣ For processing this request, web app takes some minutes   because the Regexp matching time is explosive. ‣ Other users see a loading window or 500 Internal Server Error ¯\_(π)_/¯. → ReDoS gives a bad experience to users and hurts a business opportunity.

2023 at Matsumoto, Japan ReDoS is a threat in the real world. ‣ For example, Cloud fl are was down for 27 minutes due to ReDoS.   https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/ ‣ Recently, many ReDoS vulnerabilities in Ruby gems are reported. - CVE-2022-24836 (Nokogiri), CVE-2022-30122 (Rack), etc.   (See details in Japanese, https://zenn.dev/ooooooo_q/articles/ruby_3_2_redos) ‣ ReDoS vulnerabilities in Ruby's core libraries are also reported. → We need to prevent ReDoS in Ruby itself.

2023 at Matsumoto, Japan How to prevent ReDoS ‣ Some regexes cause ReDoS and some do not. ‣ ReDoS can be prevented if we are careful...? ‣ One way: using a ReDoS detection tool. - https://github.com/makenowjust-labs/recheck ‣ Another way: improving Regexp matching implementation (today's talk). - Ruby 3.2+ has the improved Regexp implementation that prevents ReDoS.

2023 at Matsumoto, Japan Improvement result (1) polynomial case https://www.ruby-lang.org/ja/news/2022/12/25/ruby-3-2-0-released/

2023 at Matsumoto, Japan Improvement result (2) exponential case https://www.ruby-lang.org/ja/news/2022/12/25/ruby-3-2-0-released/

2023 at Matsumoto, Japan How to speed up Regexp matching ⚡ The optimization by memoization makes Regexp matching faster.

2023 at Matsumoto, Japan VM-based Regexp engine ‣ CRuby's Regexp engine (Onigmo) is VM-based. ‣ VM-based means - Regexp is compiled into byte codes, and - byte code execution (matching) uses backtracking. ‣ push @label pushes a @label and the current position   into the stack for backtracking. begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end /\A(?:a|a)*\z/ Regexp byte codes compile

2023 at Matsumoto, Japan Memoization ‣ Memoization is a technique that records states and their results,   and omits computations for the same state. ‣ On the VM-based matching, a pair of the current PC (program counter)   and the position on the input string is recorded as a state. - If the same state is reached again, it is not necessary to record the result,   since the matching has failed once from that state. ‣ Starting on the next slide, I will explain how VM-based matching works.

2023 at Matsumoto, Japan Example: VM-based Regexp matching (1) ‣ Matching against the string "aab" is started. ‣ Initially, the position is 0, and the stack is empty. begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 0 Position Stack []

2023 at Matsumoto, Japan Example: VM-based Regexp matching (2) ‣ The position is 0, then begin-buf (\A) test is passed. begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 0 Position Stack []

2023 at Matsumoto, Japan Example: VM-based Regexp matching (3) ‣ push @loop-end and push @branch push   pairs of labels and the current position into the stack. begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 0 Position Stack [[@loop-end, 0], [@branch, 0]]

2023 at Matsumoto, Japan Example: VM-based Regexp matching (4) begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 1 Position Stack [[@loop-end, 0], [@branch, 0]] ‣ The position 0 points a character 'a', then exact1 'a' test is passed. ‣ Further, exact1 'a' advances the position. 0

2023 at Matsumoto, Japan Example: VM-based Regexp matching (5) ‣ jump @branch-end and jump @loop-begin update the PC (program counter). ‣ Then, the PC is @loop-begin (push @loop-end). begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 1 Position Stack [[@loop-end, 0], [@branch, 0]]

2023 at Matsumoto, Japan Example: VM-based Regexp matching (6) ‣ push @loop-end and push @branch push   pairs of labels and the current position into the stack. begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 1 Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1], [@branch, 1]]

2023 at Matsumoto, Japan Example: VM-based Regexp matching (7) begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 2 Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1], [@branch, 1]] ‣ The position 1 points a character 'a', then exact1 'a' test is passed. ‣ Further, exact1 'a' advances the position. 1

2023 at Matsumoto, Japan Example: VM-based Regexp matching (8) ‣ jump @branch-end and jump @loop-begin update the PC. ‣ Then, the PC is @loop-begin (push @loop-end). begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1], [@branch, 1]] 2

2023 at Matsumoto, Japan Example: VM-based Regexp matching (9) ‣ push @loop-end and push @branch push   pairs of labels and the current position into the stack. begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1], [@branch, 1], [@loop-end, 2], [@branch, 2]] 2

2023 at Matsumoto, Japan Example: VM-based Regexp matching (10) begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1], [@branch, 1], [@loop-end, 2]] 2 Backtrack [@branch, 2] pop! ‣ The position 2 points a character 'b', then exact1 'a' test is failed. ‣ VM pops the label and the position from the stack   and does backtrack.

2023 at Matsumoto, Japan ‣ The position 2 points a character 'b', then exact1 'a' test is failed. ‣ VM pops the label and the position from the stack   and does backtrack. Example: VM-based Regexp matching (11) begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1], [@branch, 1]] 2 Backtrack [@loop-end, 2] pop!

2023 at Matsumoto, Japan ‣ The position 2 is not the end position, then end-buf (\z) test is failed. ‣ VM pops the label and the position from the stack   and does backtrack. Example: VM-based Regexp matching (12) begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1]] 1 Backtrack [@branch, 1] pop! 2

2023 at Matsumoto, Japan Example: VM-based Regexp matching (13) begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 2 Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1]] ‣ The position 1 points a character 'a', then exact1 'a' test is passed. ‣ Further, exact1 'a' advances the position. 1

2023 at Matsumoto, Japan Example: VM-based Regexp matching (14) ‣ jump @loop-begin updates the PC. ‣ Then, the PC is @loop-begin (push @loop-end). begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 2 Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1]]

2023 at Matsumoto, Japan Example: VM-based Regexp matching (15) ‣ push @loop-end and push @branch push   pairs of labels and the current position into the stack. ‣ But, we have already reached this situation at (9).   → Memoization begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 2 Position Stack [[@loop-end, 0], [@branch, 0], [@loop-end, 1], [@loop-end, 2],   [@branch, 2]] (9)

2023 at Matsumoto, Japan Memoization ‣ In the previous example, (9) and (15) look similar situation except for the stack. - Both PCs are @loop-begin (push @loop-end), and positions are 2. ‣ That is, we already know the matching will be failed from (15). ‣ By recording the PC and the position pairs once reached,   we can reduce unnecessary backtracking. → Let's introduce memoization! (9) (15)

2023 at Matsumoto, Japan ‣ [push @loop-end, 2] is already reached (memoized) at (9). ‣ Thus, immediately VM does backtrack. ‣ After this, many backtracks are omitted   by memoization, and the matching fails. ‣ Details are described in this article (in Japanese).   https://techlife.cookpad.com/entry/2022/12/12/162023 After introducing memoization begin-buf @loop-begin   push @loop-end push @branch exact1 'a' jump @branch-end @branch exact1 'a' @branch-end jump @loop-begin @loop-end end-buf end "aab" Input byte codes 1 Position Stack [[@loop-end, 0], [@branch, 0]] Backtrack [@loop-end, 1] pop! 2

2023 at Matsumoto, Japan Theoretical background of memoization ‣ [David et al, SP '21] studied memoization for Regexp matching. ‣ According to the paper,   memoization makes matchings linear time against input string length . - Matching by backtracking may be exponential or polynomial time. - Thus, linear time is much faster. O(n) n

2023 at Matsumoto, Japan Pros of memoization ‣ Memoization is implemented to the original VM directly. ‣ Therefore, - we can guarantee high-level backward compatibility, and - the optimization can be implemented with fewer modi fi cations. ‣ Actual PR: https://github.com/ruby/ruby/pull/6486 - Diff +744 -12

2023 at Matsumoto, Japan Cons of memoization (1): Memory consumption ‣ Memory consumption is greater for memoization than not using memoization. - A memoization table is a bit-array, and the memory consumption   will increase by (string length x number of branch instructions) / 8. • Typically, the number of branch instructions is about 80 at most. - Allocating memoization table is delayed until the number of backtracks   exceeds a certain number. ‣ Thus, we conclude memory consumption is not a big problem.

2023 at Matsumoto, Japan Cons of memoization (2): Unsupported features ‣ Memoization is not enabled when the following features are used. - look-around operators, atomic groups, conditional branches, absent operator, - subexpression calls, back-references, etc. ‣ There are also strange limitations due to the implementation. - Nested ranged repetitions are not supported. - Nested null loops are not supported too. ‣ Not all Regexps can be optimized by memoization.

2023 at Matsumoto, Japan Regexp.linear_time? ‣ Ruby 3.2 also introduced Regexp.linear_time?. ‣ Regexp.linear_time?(re) checks   whether the given re can be optimized by memoization or not. - e.g. Regexp.linear_time?(/^(a|a)*$/) #=> true   e.g. Regexp.linear_time?(/^(a*)\1*$/) #=> false

2023 at Matsumoto, Japan Future work   Talking about feature vision on Regexp matching

2023 at Matsumoto, Japan Future work ‣ Showing a warning if memoization is not enabled. - Adding a new Rubocop rule. - Adding a new warning to Ruby. ‣ Introducing a new backtrack-less Regexp engine. → DFA-based Regexp engine

2023 at Matsumoto, Japan DFA-based Regexp engine ‣ DFA (Deterministic Finite-state Automaton) ‣ This can be done in linear time for matching. ‣ Go and Rust (modern languages) use such Regexp engines. ‣ Irregular extensions (subexpression calls, back-references) cannot be supported. ‣ We would like to use a DFA-based engine in Ruby if possible.

2023 at Matsumoto, Japan Implementation plan ‣ We would reuse existing components as much as possible for backward compatibility. - Regexp parser, character class inclusion test, ignore-case expansion, etc. (Just a personal vision...) Compiler Matcher Parser ignore-case   expansion char-class   inclusion DFA-based   Matcher

2023 at Matsumoto, Japan Implementation plan ‣ The recent paper [Moseley et al, PLDI '23] uses complex data structures. - It seems hard to implement it in C. ‣ Therefore, we would like to be able to implement a Regexp engine in Ruby. - It is getting ready to write fast code in Ruby (e.g. RJIT). ‣ To-do for implementing a Regexp engine in Ruby - Expose a Regexp internal APIs (parser, character-class, etc.) to Ruby. - Allow a Regexp engine to be replaced from the Ruby side. (Just a personal vision...)

2023 at Matsumoto, Japan TRegex ‣ TRegex is a Regexp matching engine used by Truf fl eRuby. ‣ It is DFA-based and JIT-enabled [Daloze and Haider, RubyKaigi '21]. ‣ However, TRegex is - implemented in Java (on GraalVM), and - not based on a CRuby's Regex engine   (Thus, a backward compatibility issue may exist...?)

2023 at Matsumoto, Japan Summary of today's talk ‣ ReDoS is a vulnerability about Regexp matching. ‣ To prevent ReDoS, the optimization by memoization is   introduced by makenowjust (me). ‣ Memoization speeds up matchings in many cases,   but in some cases memoization is not enabled. - Some extensions (look-around, back-references, etc.) are not supported. - There are some limitations (nested null loops, etc.) due to the implementation. ‣ Next: a new DFA-based Regexp engine written in Ruby...? Thank you!

2023 at Matsumoto, Japan References ‣ Davis, James C., Francisco Servant, and Dongyoon Lee. "Using selective memoization to defeat regular expression denial of service (ReDoS)." 2021 IEEE symposium on security and privacy (SP). IEEE, 2021. ‣ Dan Moseley, Mario Nishio, Jose Perez Rodriguez, Olli Saarikivi, Stephen Toub, Margus Veanes, Tiki Wan, Eric Xu. "Derivative Based Nonbacktracking Real-World Regex Matching with Backtracking Semantics" Proceeding of ACM SIGPLAN 2023 conference on Programming Language Design and Implementation. 2023. ‣ Benoit Daloze, and Josef Haider. "Just-in-Time Compiling Ruby Regexps on Truf fl eRuby" RubyKaigi 2021. 2021.

Make Regexp#match much faster

Make Regexp#match much faster

More Decks by TSUYUSATO Kitsune

Other Decks in Programming

Featured

Transcript