I Can Kill Your Browser With a Simple Regexp. Workshop

Ba17945a06aac247b06548d5afe341e8?s=47 mrzasa
March 09, 2020

I Can Kill Your Browser With a Simple Regexp. Workshop

Ba17945a06aac247b06548d5afe341e8?s=128

mrzasa

March 09, 2020
Tweet

Transcript

  1. I CAN KILL YOUR BROWSER I CAN KILL YOUR BROWSER

    WITH A SIMPLE REGEX WITH A SIMPLE REGEX MACIEK RZĄSA MACIEK RZĄSA TOPTAL TOPTAL @mjrzasa
  2. None
  3. None
  4. None
  5. I'm definitely guilty of this. When I throw a regex

    together, I never worry about performance; I know the target strings will generally be far too small to ever cause a problem. Jeff Atwood, 2006
  6. JEFF ATWOOD DOESN'T JEFF ATWOOD DOESN'T WORRY ABOUT REGEX WORRY

    ABOUT REGEX PERFORMANCE, PERFORMANCE, WHY SHOULD I? WHY SHOULD I? <input id="operation" pattern='(\d+[+-]?)+='>
  7. WHAT'S NEXT? WHAT'S NEXT? . Warm-up: basics of regex performance.

    . Catastrophic performance. . Real world failures.
  8. DEVELOPER DEVELOPER @ TOPTAL @ TOPTAL at work Ruby &

    Postgres migrating to services Scrum Mastering after work Rzeszów Ruby User Group ( ), Rzeszów University of Technology software that matters, agile text processing, distributed systems rrug.pl
  9. HOW DO REGEXPS HOW DO REGEXPS REALLY WORK? REALLY WORK?

  10. RUBY RUBY JAVA JAVA JAVASCRIPT JAVASCRIPT pattern = /<.*>/ pattern.match("text

    <br>") Pattern p = Pattern.compile("<.*>"); Matcher m = p.matcher("text <br>"); boolean b = m.matches(); var pattern = /<.*>/; var results = re.exec("text <br>");
  11. THEORY... THEORY... regular grammar regular expression: abab|abbb finite automaton a

    b a a b b b b source: A -> abB B -> bb B -> ab https://swtch.com/~rsc/regexp/regexp1.html
  12. ...MEETS PRACTICE ...MEETS PRACTICE formal languages theory popular programming languages

    a* a+ a|b a? a(a|b) a* a+ a|b a? a(a|b) a*? \d \W (<!b)a \1 (?R)... /(\w+)\1/ -> papa WikiWiki /\(((?R)|\w+)\)/ -> (((12)))
  13. TWO TYPES OF TWO TYPES OF REGEX ENGINES REGEX ENGINES

    . Text-directed Thompson 1968, 400 LOC in C lang grep, awk, sed, go based on DFA (Deterministic Finite Automata) simpler implementation . Regex-directed Larry Wall, perl, 1987 Perl-Compatible Regular Expressions (JS, Ruby, .Net,...) based on NFA (Nondeterministic Finite Automata) broader feature set
  14. EXAMPLE EXAMPLE a b a a b b b b

    source of figures on this and few next slides: /abab|abbb/ =~ 'abbb' https://swtch.com/~rsc/regexp/regexp1.html
  15. Text-directed abab|abbb a b a a b b b b

    •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb•
  16. Regex-directed abab|abbb a b a a b b b b

    •abbb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb• failure, backtracking
  17. None
  18. WARM-UP EXERCISE (1) WARM-UP EXERCISE (1)

  19. WARM-UP SUMMARY WARM-UP SUMMARY Don't be lazy .* => [^X]*

    .*? => [^X]* Avoid ambiguous alternatives (pre-d1|pre-e2|...) => pre-(d1|e2|...) Leverage character classes (,|;|\.) => [,.;]
  20. IT'S ALL INTERESTING IT'S ALL INTERESTING But that's not the

    reason we are here
  21. REPETITION INSIDE REPETITION REPETITION INSIDE REPETITION OVERLAPPING REPETITIONS OVERLAPPING REPETITIONS

    ALL OK? ALL OK? CATASTROPHIC CATASTROPHIC BACKTRACKING BACKTRACKING see Exercise 2.1 & 2.2 /(a+)*b/ /a*c?a*b/ aaaaaaaaaab aaaaaaaaaa
  22. WHO WRITES SUCH WHO WRITES SUCH REGEXPS? REGEXPS? EXERCISE 2.3

    EXERCISE 2.3
  23. ARITHMETIC OPERATIONS ARITHMETIC OPERATIONS 320-12= 430-32+1= pattern = /\d+[-+]\d+=/ #

    v1.0 pattern = /(\d+[-+])+=/ # v2.0-rc pattern = /(\d+[-+]?)+=/ #v2.0 pattern = /(\d+|[-+])+=/ "320-12=" # 12 steps "32-12+230=" # 18 steps "32+12-320-2132+32123=" # 28 steps "32+12-320-2132+32123" # 95 854 steps
  24. SOLUTIONS SOLUTIONS unrolling the loop (A+|B?)+ => A+(BA+)* (A+|B?)+ =>

    (A+B)*A+ possessive quantifier (repetition that never backtracks) (A+|B?)+ => (A++|B?)+ (*) atomic groups (A+|B?)+ => (?>A+|B?)+ Optimize pattern from 2.3, then do 2.4
  25. PATTERNS TO AVOID PATTERNS TO AVOID overlapping scopes /\d+\w*/ overlapping

    alternatives /(\d+|\w+)/ remote overlapping quantifiers /.*-some text-.*;/ nested quantifiers /(\d+)*\w/
  26. None
  27. COMPUTERS ARE NOW COMPUTERS ARE NOW SO FAST THAT SO

    FAST THAT EVEN HAVING 100 000 STEPS EVEN HAVING 100 000 STEPS WON'T MATTER WON'T MATTER
  28. BENCHMARK: NUMBERS BENCHMARK: NUMBERS greedy lazy unrolled possessive Ruby 70.4

    i/s 65.9 i/s 1,296.1 i/s 1,187.0 i/s JavaScript 191 i/s 220 i/s 6,689 i/s - greedy = /(-?(\d+[,.]?)+)/ lazy = /(-?(\d+?[,.]?)+?)/ unrolled = /(-?\d+([,.]\d+)*)/ possessive = /(-?(\d++[,.]?)++)/ string = "..." # ~11000 chars def count(string, regex) # count how many times regex is matched on a string end
  29. BENCHMARK: LOG SEARCH BENCHMARK: LOG SEARCH NASA webserver, 196 MB

    199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 /.* - - \[(.*)\] "((.+) ?)*" (.*) (.*)/ # 5m18,298s /.* - - \[(.*)\] "(.+) (.*) (HTTP\/.*)" (.*) (.*)/ # 0m10,013s /\S* - - \[([^\]]*)\] "(\S+) (\S*) (HTTP\/\d.\d)" (\d+) (\d+)/ # 0m5,897s
  30. WHO COULD MAKE SUCH WHO COULD MAKE SUCH MISTAKE? MISTAKE?

  31. WHO DID MAKE SUCH WHO DID MAKE SUCH MISTAKE? MISTAKE?

    Virginia Tech study, 2018 pypi & npm modules 1-10% superlinear regexes Django, MongoDB, python core Also StackOverflow Cloudflare Rack
  32. COPY-PASTING COPY-PASTING (email validation) , Java Classname RegExLib, id=1757 ^([a-zA-Z0-9])(([\-.]|[_]+)?

    ([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1} (([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$ OWASP Validation Regex Repository ^(([a-z])+.)+[A-Z]([a-z])+$ 'aaaaaaaaaaaaaaa'
  33. EXERCISE 3 - REAL WORLD EXERCISE 3 - REAL WORLD

    EXAMPLES EXAMPLES
  34. LET THE USERS WRITE REGEXPS! LET THE USERS WRITE REGEXPS!

    feature request (<.*>\s*)*
  35. None
  36. None
  37. SUMMARY SUMMARY

  38. GOOD PRACTICES GOOD PRACTICES .* => [^X]* .*? => [^X]*

    (pre-d1|pre-e2|...) => pre-(d1|e2|...) (,|;|\.) => [,.;] (A+|B?)+ => A+(BA+)* (A+|B?)+ => (A+B)*A+ \w+-\d+ => \w+-\d++, \w+-(?>\d+), \w+-(? =(\d+))\1
  39. FINAL ADVICE FINAL ADVICE TAFT: valid, invalid, almost valid input

    disambiguate, don't be lazy it's a sharp tool, use with care
  40. None
  41. WHAT DO YOU THINK? WHAT DO YOU THINK? LET'S TALK!

    LET'S TALK! @MJRZASA @MJRZASA
  42. None