Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Magica Journey through Regex Engine Internals -...

Avatar for mrzasa mrzasa
March 13, 2025

Magica Journey through Regex Engine Internals - T3chFest 2025

How dies the computer know that `^\w+\d$` matches "Java8" but "PHP15"? It's magic that allows it to match patterns with strings, right? Yes, and we, geeks love learning how such magic works!

It was a highly technical journey through the regex engine. From ancient scripts explaining theory (automata and regular languages), through thickets of various regex implementation to the cave of the regex performance. We've spicied up the adventure by falling into a trap of exponentially slow regex that will crash a real app. Finally we've analysed regex traps that Stackoverflow and Cloudflare set for themselves.

It's a geeky talk and you'll learn geeky stuff: how the language of regular expressions is implemented. But it'll be not in vain: you'll be able to optimise a regex and prevent your app from being killed by a badly-crafted one.

Avatar for mrzasa

mrzasa

March 13, 2025
Tweet

More Decks by mrzasa

Other Decks in Programming

Transcript

  1. Maciek Rząsa @mjrzasa Magical Journey through RegEx Engine Internals Maciek

    Rzą sa @mjrzasa S Tp c s D G Photo by Leo_Visions on Unsplash
  2. Maciek Rząsa @mjrzasa Weird # Ruby pattern = /(?<=Total:)(\d+\.\d+)/ pattern.match("Part1:10.43,

    Part2:2 Total:12.43") # => #<MatchData "12.43" 1:"12.43"> # grep $ cat test.txt Part1:10.43 Part2:2 Total:12.43 $ grep -e '(?<=Total:)(\d+\.\d+)' test.txt --count # => 0
  3. Maciek Rząsa @mjrzasa Surprising ipython> %time re.match( r'(\d+-?)+=', '123456-432156-123456-432156-123456=' )

    CPU times: user 85 μs, sys: 0 ns, total: 85 μs Wall time: 89.2 μs ipython> %time re.match( r'(\d+-?)+=', '123456-432156-123456-432156-123456' ) CPU times: user 6.38 s, sys: 0 ns, total: 6.38 s Wall time: 6.37 s
  4. Maciek Rząsa @mjrzasa Una mezcla of languages // Java Pattern

    p = Pattern.compile("\\d+"); Matcher m = p.matcher("123"); boolean b = m.matches(); // JavaScript var pattern = /\d+/; var results = pattern.exec("123");
  5. Maciek Rząsa @mjrzasa Real, separate language require 'onigmo' Onigmo.parse('-?\d+') =>

    list( quantifier(lower: 0, upper: nil, greedy: true, string("-")), quantifier(lower: 1, upper: nil, greedy: true, cclass(["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"])) ) Onigmo.compile('-?\d+') => [[:push, 2], [:exact1, "-"], [:cclass, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]], [:push, 38], [:cclass, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]], [:jump, -43], [:end]]
  6. Maciek Rząsa @mjrzasa Brackets & Balance (((((((a or a))))))) ((((((a))))))

    (a+b)-(a+(b+c)) # Regular A -> a A -> Ba B -> B( B -> ( # Context free A -> (A) A -> a
  7. Maciek Rząsa @mjrzasa Implementations Te x t - dire ct

    e d Thompson 1968, 400 LOC C Deterministic Finite Automata grep, awk, sed, golang Pat t e rn- dire ct e d Larry Wall, 1987 Nondeterministic Finite Automata Perl-Compatible Regular Expressions (Perl, JavaScript, Ruby, .Net, PHP...) "WikiWiki" -> /(\w+)\1/ Figures sources: Russ Cox: Regular Expression Matching Can Be Simple And Fast
  8. Maciek Rząsa @mjrzasa Regexing fast and slow /🢃{.*}/ "" /{🢃.*}/

    "{" /{🢃.*}/ "{1" /{🢃.*}/ "{1+" /{🢃.*}/ "{1+2" /{🢃.*}/ "{1+2}" /{🢃.*}/ "{1+2}t" /{🢃.*}/ "{1+2}te" /{🢃.*}/ "{1+2}tex" /{.*🢃}/ "{1+2}text" /{.*🢃}/ "{1+2}tex" /{.*🢃}/ "{1+2}te" /{.*🢃}/ "{1+2}t" /{.*🢃}/ "{1+2}" /{.*🢃}/ "{1+2" /{.*}🢃/ "{1+2}" /🢃{[^}]*}/ "" /{🢃[^}]*}/ "{" /{🢃[^}]*}/ "{1" /{🢃[^}]*}/ "{1+" /{[^}]*🢃}/ "{1+2" /{[^}]*}🢃/ "{1+2}" /{.*}/ -> "text{1+2}text" /{[^}]*}/
  9. Maciek Rząsa @mjrzasa Regexing fast and slow Onigmo.parse("a[bc]") => list(string("a"),

    cclass(["b", "c"])) Onigmo.parse("a[b]") => list(string("a"), string("b")) Onigmo.compile("a[bc]") => [ [:exact1, "a"], [:cclass, ["b", "c"]], [:end] ] Onigmo.compile("a(?:b|c)") => [ [:exact1, "a"], [:push, 7], [:exact1, "b"], [:jump, 2], [:exact1, "c"], [:end] ]
  10. Maciek Rząsa @mjrzasa Fast regex? Limit backtracking! precise patterns /<.+>/

    => /<[^>]+>/ possessive quantifiers /<.+>/ => /<.++>/ atomic groups /<.+>/ => /<(?>.+)>/ short alternatives (pre-d1|pre-e2|...) => pre-(d1|e2|...) character classes (,|;|\.) => [,.;]
  11. Maciek Rząsa @mjrzasa Very bad regex "(aaaaaaaaaaaaaaaaaaaaa)" "(aaaaaaaaaaaaaaaaaaaa)(a)" "(aaaaaaaaaaaaaaaaaaa)(aa)" …

    "(aaaaaaaaaaaaaaaaaaa)(a)(a)" "(aaaaaaaaaaaaaaaaaa)(aa)(a)" "(aaaaaaaaaaaaaaaaa)(aaa)(a)" … "(aaaaaaaaaaaaaaaaaa)(a)(a)(a)" "(aaaaaaaaaaaaaaaaa)(aa)(a)(a)" /(a+)+b/ "aaaaaaaaaaaaaaaaaaaaa" 120 000 steps
  12. Maciek Rząsa @mjrzasa How to write a slow regex? Excercise

    12-32: Please, calculate 1+2= or 12+43= Then do something harder and calculate 12+21+54= and 21-1+2= # v1.0 /\d+[-+]\d+=/ # v2.0-rc /(\d+[-+])+=/ # bugfix! /(\d+[-+]?)+=/ #v2.0 /(\d+|[-+])+=/
  13. Maciek Rząsa @mjrzasa Discourse client # ::Typography.to_html_french text.gsub( /(\s|)+([!?;]+(\s|\z))/, '&thinsp;\2\3'

    ) /(\s|)+/ # <-58 whitespaces -> GET /wp-login.php HTTP/1.1 69 GET /show.aspx HTTP/1.1 15
  14. Maciek Rząsa @mjrzasa Timeout Regexp.timeout = 1 /(a+)+\1b+/. match?( "aaaaaaaaaaaaaaaaaaaaaaaaaaa"

    ) # Regexp::TimeoutError: # regexp match timeout (Regexp::TimeoutError) # from (pry):122:in 'Regexp#match?'
  15. Maciek Rząsa @mjrzasa Memoisation Regexp.linear_time?(/.*(:?.*=.*)/) # => true Regexp.linear_time?(/.*.*=.*/) #

    => true Regexp.linear_time?(/(a+)+/) # => true Regexp.linear_time?(/(\s|)+/) # => true Regexp.linear_time?(/(a+)+\1/) # => false
  16. Maciek Rząsa @mjrzasa Regex engine optimisations Bhuiyan, Masudul & Çakar,

    Berk & Burmane, Ethan & Davis, James & Staicu, Cristian-Alexandru. (2024). SoK: A Literature and Engineering Review of Regular Expression Denial of Service. 10.48550/arXiv.2406.11618
  17. Maciek Rząsa @mjrzasa 1. Curiosity I went through life like

    this, discovering next something that had first been discovered in 1889, then something from 1921... And finally I discovered something that had the same date as when I discovered it. (...) You are unlikely to discover something new without a lot of practice on old stuff. Richard Feynman, Feynman Lectures on Computation
  18. Maciek Rząsa @mjrzasa Maciek Rzą sa @mjrzasa There is no

    magic, just software t Fn c , t Fn r