Upgrade to Pro — share decks privately, control downloads, hide ads and more …

I Can Kill Your Browser With a Simple Regexp. Workshop

mrzasa
March 09, 2020

I Can Kill Your Browser With a Simple Regexp. Workshop

mrzasa

March 09, 2020
Tweet

More Decks by mrzasa

Other Decks in Programming

Transcript

  1. I CAN KILL YOUR BROWSER I CAN KILL YOUR BROWSER

    WITH A SIMPLE REGEX WITH A SIMPLE REGEX MACIEK RZĄSA MACIEK RZĄSA TOPTAL TOPTAL @mjrzasa
  2. I'm definitely guilty of this. When I throw a regex

    together, I never worry about performance; I know the target strings will generally be far too small to ever cause a problem. Jeff Atwood, 2006
  3. JEFF ATWOOD DOESN'T JEFF ATWOOD DOESN'T WORRY ABOUT REGEX WORRY

    ABOUT REGEX PERFORMANCE, PERFORMANCE, WHY SHOULD I? WHY SHOULD I? <input id="operation" pattern='(\d+[+-]?)+='>
  4. WHAT'S NEXT? WHAT'S NEXT? . Warm-up: basics of regex performance.

    . Catastrophic performance. . Real world failures.
  5. DEVELOPER DEVELOPER @ TOPTAL @ TOPTAL at work Ruby &

    Postgres migrating to services Scrum Mastering after work Rzeszów Ruby User Group ( ), Rzeszów University of Technology software that matters, agile text processing, distributed systems rrug.pl
  6. RUBY RUBY JAVA JAVA JAVASCRIPT JAVASCRIPT pattern = /<.*>/ pattern.match("text

    <br>") Pattern p = Pattern.compile("<.*>"); Matcher m = p.matcher("text <br>"); boolean b = m.matches(); var pattern = /<.*>/; var results = re.exec("text <br>");
  7. THEORY... THEORY... regular grammar regular expression: abab|abbb finite automaton a

    b a a b b b b source: A -> abB B -> bb B -> ab https://swtch.com/~rsc/regexp/regexp1.html
  8. ...MEETS PRACTICE ...MEETS PRACTICE formal languages theory popular programming languages

    a* a+ a|b a? a(a|b) a* a+ a|b a? a(a|b) a*? \d \W (<!b)a \1 (?R)... /(\w+)\1/ -> papa WikiWiki /\(((?R)|\w+)\)/ -> (((12)))
  9. TWO TYPES OF TWO TYPES OF REGEX ENGINES REGEX ENGINES

    . Text-directed Thompson 1968, 400 LOC in C lang grep, awk, sed, go based on DFA (Deterministic Finite Automata) simpler implementation . Regex-directed Larry Wall, perl, 1987 Perl-Compatible Regular Expressions (JS, Ruby, .Net,...) based on NFA (Nondeterministic Finite Automata) broader feature set
  10. EXAMPLE EXAMPLE a b a a b b b b

    source of figures on this and few next slides: /abab|abbb/ =~ 'abbb' https://swtch.com/~rsc/regexp/regexp1.html
  11. Text-directed abab|abbb a b a a b b b b

    •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb•
  12. Regex-directed abab|abbb a b a a b b b b

    •abbb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb• failure, backtracking
  13. WARM-UP SUMMARY WARM-UP SUMMARY Don't be lazy .* => [^X]*

    .*? => [^X]* Avoid ambiguous alternatives (pre-d1|pre-e2|...) => pre-(d1|e2|...) Leverage character classes (,|;|\.) => [,.;]
  14. REPETITION INSIDE REPETITION REPETITION INSIDE REPETITION OVERLAPPING REPETITIONS OVERLAPPING REPETITIONS

    ALL OK? ALL OK? CATASTROPHIC CATASTROPHIC BACKTRACKING BACKTRACKING see Exercise 2.1 & 2.2 /(a+)*b/ /a*c?a*b/ aaaaaaaaaab aaaaaaaaaa
  15. ARITHMETIC OPERATIONS ARITHMETIC OPERATIONS 320-12= 430-32+1= pattern = /\d+[-+]\d+=/ #

    v1.0 pattern = /(\d+[-+])+=/ # v2.0-rc pattern = /(\d+[-+]?)+=/ #v2.0 pattern = /(\d+|[-+])+=/ "320-12=" # 12 steps "32-12+230=" # 18 steps "32+12-320-2132+32123=" # 28 steps "32+12-320-2132+32123" # 95 854 steps
  16. SOLUTIONS SOLUTIONS unrolling the loop (A+|B?)+ => A+(BA+)* (A+|B?)+ =>

    (A+B)*A+ possessive quantifier (repetition that never backtracks) (A+|B?)+ => (A++|B?)+ (*) atomic groups (A+|B?)+ => (?>A+|B?)+ Optimize pattern from 2.3, then do 2.4
  17. PATTERNS TO AVOID PATTERNS TO AVOID overlapping scopes /\d+\w*/ overlapping

    alternatives /(\d+|\w+)/ remote overlapping quantifiers /.*-some text-.*;/ nested quantifiers /(\d+)*\w/
  18. COMPUTERS ARE NOW COMPUTERS ARE NOW SO FAST THAT SO

    FAST THAT EVEN HAVING 100 000 STEPS EVEN HAVING 100 000 STEPS WON'T MATTER WON'T MATTER
  19. BENCHMARK: NUMBERS BENCHMARK: NUMBERS greedy lazy unrolled possessive Ruby 70.4

    i/s 65.9 i/s 1,296.1 i/s 1,187.0 i/s JavaScript 191 i/s 220 i/s 6,689 i/s - greedy = /(-?(\d+[,.]?)+)/ lazy = /(-?(\d+?[,.]?)+?)/ unrolled = /(-?\d+([,.]\d+)*)/ possessive = /(-?(\d++[,.]?)++)/ string = "..." # ~11000 chars def count(string, regex) # count how many times regex is matched on a string end
  20. BENCHMARK: LOG SEARCH BENCHMARK: LOG SEARCH NASA webserver, 196 MB

    199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 /.* - - \[(.*)\] "((.+) ?)*" (.*) (.*)/ # 5m18,298s /.* - - \[(.*)\] "(.+) (.*) (HTTP\/.*)" (.*) (.*)/ # 0m10,013s /\S* - - \[([^\]]*)\] "(\S+) (\S*) (HTTP\/\d.\d)" (\d+) (\d+)/ # 0m5,897s
  21. WHO DID MAKE SUCH WHO DID MAKE SUCH MISTAKE? MISTAKE?

    Virginia Tech study, 2018 pypi & npm modules 1-10% superlinear regexes Django, MongoDB, python core Also StackOverflow Cloudflare Rack
  22. COPY-PASTING COPY-PASTING (email validation) , Java Classname RegExLib, id=1757 ^([a-zA-Z0-9])(([\-.]|[_]+)?

    ([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1} (([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$ OWASP Validation Regex Repository ^(([a-z])+.)+[A-Z]([a-z])+$ 'aaaaaaaaaaaaaaa'
  23. GOOD PRACTICES GOOD PRACTICES .* => [^X]* .*? => [^X]*

    (pre-d1|pre-e2|...) => pre-(d1|e2|...) (,|;|\.) => [,.;] (A+|B?)+ => A+(BA+)* (A+|B?)+ => (A+B)*A+ \w+-\d+ => \w+-\d++, \w+-(?>\d+), \w+-(? =(\d+))\1
  24. FINAL ADVICE FINAL ADVICE TAFT: valid, invalid, almost valid input

    disambiguate, don't be lazy it's a sharp tool, use with care