Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make Regexp#match much faster

Make Regexp#match much faster

TSUYUSATO Kitsune

May 11, 2023
Tweet

More Decks by TSUYUSATO Kitsune

Other Decks in Programming

Transcript

  1. 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Make Regexp#match


    Much Faster
    Hiroya Fujinami (makenowjust)

    View Slide

  2. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Hiroya Fujinami
    ‣ Ph.D. student at SOKENDAI (NII)



    t
    @make_now_just
    g
    @makenowjust


    ‣ Ruby committer since 2022/12


    - I contributed for speed up Regexp#match in CRuby

    at Cookpad Inc. internship

    View Slide

  3. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Agenda
    ‣ Regexp is powerful

    Powerful features and applications in Ruby's Regexp


    ‣ ReDoS — a vulnerability about Regexp

    What a vulnerability is ReDoS (Regular Expression Denial of Service)?


    ‣ How to speed up Regexp matching

    The optimization by memoization makes Regexp matching faster.


    ‣ Future work

    Talking about feature vision on Regexp matching

    View Slide

  4. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Regexp is powerful

    Powerful features and applications in Ruby's Regexp

    View Slide

  5. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Ruby's Regexp is powerful!
    ‣ Ruby's Regexp has some rich features:


    - multiple character encoding support,


    - look-around operators, atomic groups, conditional branches, absent operator,


    - subexpression calls, and back-references (irregular extensions).


    → These features help our developments and hobbies.

    (?=foo)
    (?!foo)
    (?<=foo)
    (?(?>foo)
    (?(x)foo|bar)
    look-around operators
    atomic group
    conditional branches
    (?foo){0}\g
    subexpression call
    (?foo*)\k
    back-reference
    (?~foo)
    absent operator

    View Slide

  6. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    How powerful is it?
    ‣ The following Regexp computes (veri
    fi
    es) a sum equation of two binary digits.


    - e.g. "01+01=10" and "110+001=111" "01+01=00"

    /(?[01](?=(?:(?(?<=0))|(?(?<=1)))[01]*\+(?(?:\k|(?!\k))
    [01])(?:(?(?<=0))|(?(?<=1)))[01]*=(?(?:\k|(?!\k))[01])(?:(?
    (?<=0))|(?(?<=1)))[01]*\z(?:(?\k\k\k|
    \k\k\k|\k\k\k|\k\k\k)|
    \k\k\k|\k\k\k|\k\k\k|
    \k\k\k)(?:\k(?:\k(?:\k(?!\k)|\k(?!
    \k))|\k(?:\k\k|\k(?!\k)))|\k(?:\k(?:
    \k\k|\k(?!\k))|\k(?:\k\k|
    \k\k))))\g|(?!\k)\+\k=\k){0}\A\g\z/
    Theoretically, this language (a set of strings) is neither regular nor context-free.

    View Slide

  7. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    How powerful is it?
    ‣ A brainf*ck implementation in Ruby's Regexp exists.


    - https://github.com/shinh/hack/blob/master/bf_rb_reg/bf.rb


    - BF_REG =~ bf + BF_SUFFIX executes a bf program on Regexp matching

    (loops may stop after 256 iterations.)


    ‣ Conclusion → ultimately powerful!

    View Slide

  8. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    With great power

    comes great responsibility.

    View Slide

  9. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    ReDoS — a vulnerability about Regexp

    What a vulnerability is ReDoS (Regular Expression Denial of Service)?

    View Slide

  10. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    ReDoS (Regular Expression Denial of Service)
    ‣ A vulnerability about Regexp matching.


    ‣ The time taken for Regexp matching can be explosive.


    - time ruby -e '/\A(a|a)*\z/ =~ "a" * 30 + "b"'

    30.99s user 0.09s system 99% cpu 31.188 total


    - The Regexp /\A(a|a)*\z/ takes an exponential matching time

    against such an input string "a" * n + "b".

    View Slide

  11. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    What a problem is Regexp matching explosion?
    One scenario:


    ‣ A web app uses /\A[^@][email protected][^@]+([.][^@]+)+\z/

    for validating E-mail address.


    ‣ Then, one user sends "[email protected]" + "." * 50 + "@" to this web app.


    ‣ For processing this request, web app takes some minutes

    because the Regexp matching time is explosive.


    ‣ Other users see a loading window or 500 Internal Server Error ¯\_(π)_/¯.


    → ReDoS gives a bad experience to users and hurts a business opportunity.

    View Slide

  12. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    ReDoS is a threat in the real world.
    ‣ For example, Cloud
    fl
    are was down for 27 minutes due to ReDoS.

    https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/


    ‣ Recently, many ReDoS vulnerabilities in Ruby gems are reported.


    - CVE-2022-24836 (Nokogiri), CVE-2022-30122 (Rack), etc.

    (See details in Japanese, https://zenn.dev/ooooooo_q/articles/ruby_3_2_redos)


    ‣ ReDoS vulnerabilities in Ruby's core libraries are also reported.


    → We need to prevent ReDoS in Ruby itself.

    View Slide

  13. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    How to prevent ReDoS
    ‣ Some regexes cause ReDoS and some do not.


    ‣ ReDoS can be prevented if we are careful...?


    ‣ One way: using a ReDoS detection tool.


    - https://github.com/makenowjust-labs/recheck


    ‣ Another way: improving Regexp matching implementation (today's talk).


    - Ruby 3.2+ has the improved Regexp implementation that prevents ReDoS.

    View Slide

  14. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Improvement result (1) polynomial case

    https://www.ruby-lang.org/ja/news/2022/12/25/ruby-3-2-0-released/

    View Slide

  15. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Improvement result (2) exponential case

    https://www.ruby-lang.org/ja/news/2022/12/25/ruby-3-2-0-released/

    View Slide

  16. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    How to speed up Regexp matching

    The optimization by memoization makes Regexp matching faster.

    View Slide

  17. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    VM-based Regexp engine
    ‣ CRuby's Regexp engine (Onigmo) is VM-based.


    ‣ VM-based means


    - Regexp is compiled into byte codes, and


    - byte code execution (matching) uses backtracking.


    ‣ push @label pushes a @label and the current position

    into the stack for backtracking.

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    /\A(?:a|a)*\z/
    Regexp
    byte codes
    compile

    View Slide

  18. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Memoization
    ‣ Memoization is a technique that records states and their results,

    and omits computations for the same state.


    ‣ On the VM-based matching, a pair of the current PC (program counter)

    and the position on the input string is recorded as a state.


    - If the same state is reached again, it is not necessary to record the result,

    since the matching has failed once from that state.


    ‣ Starting on the next slide, I will explain how VM-based matching works.

    View Slide

  19. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (1)
    ‣ Matching against the string "aab" is started.


    ‣ Initially, the position is 0, and the stack is empty.

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    0
    Position
    Stack
    []

    View Slide

  20. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (2)
    ‣ The position is 0, then begin-buf (\A) test is passed.

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    0
    Position
    Stack
    []

    View Slide

  21. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (3)
    ‣ push @loop-end and push @branch push

    pairs of labels and the current position into the stack.

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    0
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0]]

    View Slide

  22. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (4)

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    1
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0]]
    ‣ The position 0 points a character 'a', then exact1 'a' test is passed.


    ‣ Further, exact1 'a' advances the position.
    0

    View Slide

  23. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (5)
    ‣ jump @branch-end and jump @loop-begin update the PC (program counter).


    ‣ Then, the PC is @loop-begin (push @loop-end).

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    1
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0]]

    View Slide

  24. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (6)
    ‣ push @loop-end and push @branch push

    pairs of labels and the current position into the stack.

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    1
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1],


    [@branch, 1]]

    View Slide

  25. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (7)

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    2
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1],


    [@branch, 1]]
    ‣ The position 1 points a character 'a', then exact1 'a' test is passed.


    ‣ Further, exact1 'a' advances the position.
    1

    View Slide

  26. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (8)
    ‣ jump @branch-end and jump @loop-begin update the PC.


    ‣ Then, the PC is @loop-begin (push @loop-end).

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1],


    [@branch, 1]]
    2

    View Slide

  27. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (9)
    ‣ push @loop-end and push @branch push

    pairs of labels and the current position into the stack.

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1],


    [@branch, 1],


    [@loop-end, 2],


    [@branch, 2]]
    2

    View Slide

  28. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (10)

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1],


    [@branch, 1],


    [@loop-end, 2]]
    2
    Backtrack
    [@branch, 2]
    pop!
    ‣ The position 2 points a character 'b', then exact1 'a' test is failed.


    ‣ VM pops the label and the position from the stack

    and does backtrack.

    View Slide

  29. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    ‣ The position 2 points a character 'b', then exact1 'a' test is failed.


    ‣ VM pops the label and the position from the stack

    and does backtrack.
    Example: VM-based Regexp matching (11)

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1],


    [@branch, 1]]
    2
    Backtrack
    [@loop-end, 2]
    pop!

    View Slide

  30. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    ‣ The position 2 is not the end position, then end-buf (\z) test is failed.


    ‣ VM pops the label and the position from the stack

    and does backtrack.
    Example: VM-based Regexp matching (12)

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1]]
    1
    Backtrack
    [@branch, 1]
    pop!
    2

    View Slide

  31. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (13)

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    2
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1]]
    ‣ The position 1 points a character 'a', then exact1 'a' test is passed.


    ‣ Further, exact1 'a' advances the position.
    1

    View Slide

  32. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (14)
    ‣ jump @loop-begin updates the PC.


    ‣ Then, the PC is @loop-begin (push @loop-end).

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    2
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1]]

    View Slide

  33. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Example: VM-based Regexp matching (15)
    ‣ push @loop-end and push @branch push

    pairs of labels and the current position into the stack.


    ‣ But, we have already reached this situation at (9).

    → Memoization

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    2
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0],


    [@loop-end, 1],


    [@loop-end, 2],

    [@branch, 2]]
    (9)

    View Slide

  34. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Memoization
    ‣ In the previous example, (9) and (15) look similar situation except for the stack.


    - Both PCs are @loop-begin (push @loop-end), and positions are 2.


    ‣ That is, we already know the matching will be failed from (15).


    ‣ By recording the PC and the position pairs once reached,

    we can reduce unnecessary backtracking.


    → Let's introduce memoization!

    (9) (15)

    View Slide

  35. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    ‣ [push @loop-end, 2] is already reached (memoized) at (9).


    ‣ Thus, immediately VM does backtrack.


    ‣ After this, many backtracks are omitted

    by memoization, and the matching fails.


    ‣ Details are described in this article (in Japanese).

    https://techlife.cookpad.com/entry/2022/12/12/162023
    After introducing memoization

    begin-buf


    @loop-begin

    push @loop-end


    push @branch


    exact1 'a'


    jump @branch-end


    @branch


    exact1 'a'


    @branch-end


    jump @loop-begin


    @loop-end


    end-buf


    end
    "aab"
    Input byte codes
    1
    Position
    Stack
    [[@loop-end, 0],


    [@branch, 0]]
    Backtrack
    [@loop-end, 1]
    pop!
    2

    View Slide

  36. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Theoretical background of memoization
    ‣ [David et al, SP '21] studied memoization for Regexp matching.


    ‣ According to the paper,

    memoization makes matchings linear time against input string length .


    - Matching by backtracking may be exponential or polynomial time.


    - Thus, linear time is much faster.
    O(n) n

    View Slide

  37. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Pros of memoization
    ‣ Memoization is implemented to the original VM directly.


    ‣ Therefore,


    - we can guarantee high-level backward compatibility, and


    - the optimization can be implemented with fewer modi
    fi
    cations.


    ‣ Actual PR: https://github.com/ruby/ruby/pull/6486


    - Diff +744 -12

    View Slide

  38. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Cons of memoization (1): Memory consumption
    ‣ Memory consumption is greater for memoization than not using memoization.


    - A memoization table is a bit-array, and the memory consumption

    will increase by (string length x number of branch instructions) / 8.


    • Typically, the number of branch instructions is about 80 at most.


    - Allocating memoization table is delayed until the number of backtracks

    exceeds a certain number.


    ‣ Thus, we conclude memory consumption is not a big problem.

    View Slide

  39. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Cons of memoization (2): Unsupported features
    ‣ Memoization is not enabled when the following features are used.


    - look-around operators, atomic groups, conditional branches, absent operator,


    - subexpression calls, back-references, etc.


    ‣ There are also strange limitations due to the implementation.


    - Nested ranged repetitions are not supported.


    - Nested null loops are not supported too.


    ‣ Not all Regexps can be optimized by memoization.

    View Slide

  40. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Regexp.linear_time?
    ‣ Ruby 3.2 also introduced Regexp.linear_time?.


    ‣ Regexp.linear_time?(re) checks

    whether the given re can be optimized by memoization or not.


    - e.g. Regexp.linear_time?(/^(a|a)*$/) #=> true

    e.g. Regexp.linear_time?(/^(a*)\1*$/) #=> false

    View Slide

  41. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Future work

    Talking about feature vision on Regexp matching

    View Slide

  42. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Future work
    ‣ Showing a warning if memoization is not enabled.


    - Adding a new Rubocop rule.


    - Adding a new warning to Ruby.


    ‣ Introducing a new backtrack-less Regexp engine.


    → DFA-based Regexp engine

    View Slide

  43. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    DFA-based Regexp engine
    ‣ DFA (Deterministic Finite-state Automaton)


    ‣ This can be done in linear time for matching.


    ‣ Go and Rust (modern languages) use such Regexp engines.


    ‣ Irregular extensions (subexpression calls, back-references) cannot be supported.


    ‣ We would like to use a DFA-based engine in Ruby if possible.

    View Slide

  44. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Implementation plan
    ‣ We would reuse existing components as much as possible for backward compatibility.


    - Regexp parser, character class inclusion test, ignore-case expansion, etc.

    (Just a personal vision...)
    Compiler Matcher
    Parser
    ignore-case

    expansion
    char-class

    inclusion
    DFA-based

    Matcher

    View Slide

  45. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Implementation plan
    ‣ The recent paper [Moseley et al, PLDI '23] uses complex data structures.


    - It seems hard to implement it in C.


    ‣ Therefore, we would like to be able to implement a Regexp engine in Ruby.


    - It is getting ready to write fast code in Ruby (e.g. RJIT).


    ‣ To-do for implementing a Regexp engine in Ruby


    - Expose a Regexp internal APIs (parser, character-class, etc.) to Ruby.


    - Allow a Regexp engine to be replaced from the Ruby side.

    (Just a personal vision...)

    View Slide

  46. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    TRegex
    ‣ TRegex is a Regexp matching engine used by Truf
    fl
    eRuby.


    ‣ It is DFA-based and JIT-enabled [Daloze and Haider, RubyKaigi '21].


    ‣ However, TRegex is


    - implemented in Java (on GraalVM), and


    - not based on a CRuby's Regex engine

    (Thus, a backward compatibility issue may exist...?)

    View Slide

  47. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    Summary of today's talk
    ‣ ReDoS is a vulnerability about Regexp matching.


    ‣ To prevent ReDoS, the optimization by memoization is

    introduced by makenowjust (me).


    ‣ Memoization speeds up matchings in many cases,

    but in some cases memoization is not enabled.


    - Some extensions (look-around, back-references, etc.) are not supported.


    - There are some limitations (nested null loops, etc.) due to the implementation.


    ‣ Next: a new DFA-based Regexp engine written in Ruby...?

    Thank you!

    View Slide

  48. Make Regexp#match much faster - Hiroya Fujinami (@makenowjust) 2023/5/11 RubyKaigi 2023 at Matsumoto, Japan
    References
    ‣ Davis, James C., Francisco Servant, and Dongyoon Lee. "Using selective
    memoization to defeat regular expression denial of service (ReDoS)." 2021 IEEE
    symposium on security and privacy (SP). IEEE, 2021.


    ‣ Dan Moseley, Mario Nishio, Jose Perez Rodriguez, Olli Saarikivi, Stephen
    Toub, Margus Veanes, Tiki Wan, Eric Xu. "Derivative Based Nonbacktracking
    Real-World Regex Matching with Backtracking Semantics" Proceeding of ACM
    SIGPLAN 2023 conference on Programming Language Design and
    Implementation. 2023.


    ‣ Benoit Daloze, and Josef Haider. "Just-in-Time Compiling Ruby Regexps on
    Truf
    fl
    eRuby" RubyKaigi 2021. 2021.

    View Slide