$30 off During Our Annual Pro Sale. View Details »

I Can Kill Your Browser With a Simple Regexp. Workshop

mrzasa
March 09, 2020

I Can Kill Your Browser With a Simple Regexp. Workshop

mrzasa

March 09, 2020
Tweet

More Decks by mrzasa

Other Decks in Programming

Transcript

  1. I CAN KILL YOUR BROWSER
    I CAN KILL YOUR BROWSER
    WITH A SIMPLE REGEX
    WITH A SIMPLE REGEX
    MACIEK RZĄSA
    MACIEK RZĄSA
    TOPTAL
    TOPTAL
    @mjrzasa

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. I'm definitely guilty of this. When I throw a regex
    together, I never worry about performance;
    I know the target strings will generally be
    far too small to ever cause a problem.
    Jeff Atwood, 2006

    View Slide

  6. JEFF ATWOOD DOESN'T
    JEFF ATWOOD DOESN'T
    WORRY ABOUT REGEX
    WORRY ABOUT REGEX
    PERFORMANCE,
    PERFORMANCE,
    WHY SHOULD I?
    WHY SHOULD I?

    View Slide

  7. WHAT'S NEXT?
    WHAT'S NEXT?
    . Warm-up: basics of regex performance.
    . Catastrophic performance.
    . Real world failures.

    View Slide

  8. DEVELOPER
    DEVELOPER
    @ TOPTAL
    @ TOPTAL
    at work
    Ruby & Postgres
    migrating to services
    Scrum Mastering
    after work
    Rzeszów Ruby User Group ( ),
    Rzeszów University of Technology
    software that matters, agile
    text processing, distributed systems
    rrug.pl

    View Slide

  9. HOW DO REGEXPS
    HOW DO REGEXPS
    REALLY WORK?
    REALLY WORK?

    View Slide

  10. RUBY
    RUBY
    JAVA
    JAVA
    JAVASCRIPT
    JAVASCRIPT
    pattern = /<.*>/
    pattern.match("text
    ")
    Pattern p = Pattern.compile("<.*>");
    Matcher m = p.matcher("text
    ");
    boolean b = m.matches();
    var pattern = /<.*>/;
    var results = re.exec("text
    ");

    View Slide

  11. THEORY...
    THEORY...
    regular grammar
    regular expression: abab|abbb
    finite automaton
    a b a
    a
    b
    b b b
    source:
    A -> abB
    B -> bb
    B -> ab
    https://swtch.com/~rsc/regexp/regexp1.html

    View Slide

  12. ...MEETS PRACTICE
    ...MEETS PRACTICE
    formal languages theory
    popular programming languages
    a* a+ a|b a? a(a|b)
    a* a+ a|b a? a(a|b) a*? \d
    \W (/(\w+)\1/ -> papa WikiWiki
    /\(((?R)|\w+)\)/ -> (((12)))

    View Slide

  13. TWO TYPES OF
    TWO TYPES OF
    REGEX ENGINES
    REGEX ENGINES
    . Text-directed
    Thompson 1968, 400 LOC in C lang
    grep, awk, sed, go
    based on DFA (Deterministic Finite Automata)
    simpler implementation
    . Regex-directed
    Larry Wall, perl, 1987
    Perl-Compatible Regular Expressions (JS, Ruby, .Net,...)
    based on NFA (Nondeterministic Finite Automata)
    broader feature set

    View Slide

  14. EXAMPLE
    EXAMPLE
    a b a
    a
    b
    b b b
    source of figures on this and few next slides:
    /abab|abbb/ =~ 'abbb'
    https://swtch.com/~rsc/regexp/regexp1.html

    View Slide

  15. Text-directed abab|abbb
    a b a
    a
    b
    b b b
    •abbb
    a b a
    a
    b
    b b b
    a•bbb
    a b a
    a
    b
    b b b
    ab•bb
    a b a
    a
    b
    b b b
    abb•b
    a b a
    a
    b
    b b b
    abbb•

    View Slide

  16. Regex-directed abab|abbb
    a b a
    a
    b
    b b b
    •abbb
    a b a
    a
    b
    b b b
    •abbb
    a b a
    a
    b
    b b b
    a•bbb
    a b a
    a
    b
    b b b
    ab•bb
    a b a
    a
    b
    b b b
    •abbb
    a b a
    a
    b
    b b b
    a•bbb
    a b a
    a
    b
    b b b
    ab•bb
    a b a
    a
    b
    b b b
    abb•b
    a b a
    a
    b
    b b b
    abbb•
    failure, backtracking

    View Slide

  17. View Slide

  18. WARM-UP EXERCISE (1)
    WARM-UP EXERCISE (1)

    View Slide

  19. WARM-UP SUMMARY
    WARM-UP SUMMARY
    Don't be lazy
    .* => [^X]*
    .*? => [^X]*
    Avoid ambiguous alternatives
    (pre-d1|pre-e2|...) => pre-(d1|e2|...)
    Leverage character classes
    (,|;|\.) => [,.;]

    View Slide

  20. IT'S ALL INTERESTING
    IT'S ALL INTERESTING
    But that's not the reason we are here

    View Slide

  21. REPETITION INSIDE REPETITION
    REPETITION INSIDE REPETITION
    OVERLAPPING REPETITIONS
    OVERLAPPING REPETITIONS
    ALL OK?
    ALL OK?
    CATASTROPHIC
    CATASTROPHIC
    BACKTRACKING
    BACKTRACKING
    see Exercise 2.1 & 2.2
    /(a+)*b/
    /a*c?a*b/
    aaaaaaaaaab
    aaaaaaaaaa

    View Slide

  22. WHO WRITES SUCH
    WHO WRITES SUCH
    REGEXPS?
    REGEXPS?
    EXERCISE 2.3
    EXERCISE 2.3

    View Slide

  23. ARITHMETIC OPERATIONS
    ARITHMETIC OPERATIONS
    320-12=
    430-32+1=
    pattern = /\d+[-+]\d+=/ # v1.0
    pattern = /(\d+[-+])+=/ # v2.0-rc
    pattern = /(\d+[-+]?)+=/ #v2.0
    pattern = /(\d+|[-+])+=/
    "320-12=" # 12 steps
    "32-12+230=" # 18 steps
    "32+12-320-2132+32123=" # 28 steps
    "32+12-320-2132+32123" # 95 854 steps

    View Slide

  24. SOLUTIONS
    SOLUTIONS
    unrolling the loop
    (A+|B?)+ => A+(BA+)*
    (A+|B?)+ => (A+B)*A+
    possessive quantifier (repetition that never backtracks)
    (A+|B?)+ => (A++|B?)+
    (*) atomic groups
    (A+|B?)+ => (?>A+|B?)+
    Optimize pattern from 2.3, then do 2.4

    View Slide

  25. PATTERNS TO AVOID
    PATTERNS TO AVOID
    overlapping scopes /\d+\w*/
    overlapping alternatives /(\d+|\w+)/
    remote overlapping quantifiers
    /.*-some text-.*;/
    nested quantifiers /(\d+)*\w/

    View Slide

  26. View Slide

  27. COMPUTERS ARE NOW
    COMPUTERS ARE NOW
    SO FAST THAT
    SO FAST THAT
    EVEN HAVING 100 000 STEPS
    EVEN HAVING 100 000 STEPS
    WON'T MATTER
    WON'T MATTER

    View Slide

  28. BENCHMARK: NUMBERS
    BENCHMARK: NUMBERS
    greedy lazy unrolled possessive
    Ruby 70.4 i/s 65.9 i/s 1,296.1 i/s 1,187.0 i/s
    JavaScript 191 i/s 220 i/s 6,689 i/s -
    greedy = /(-?(\d+[,.]?)+)/
    lazy = /(-?(\d+?[,.]?)+?)/
    unrolled = /(-?\d+([,.]\d+)*)/
    possessive = /(-?(\d++[,.]?)++)/
    string = "..." # ~11000 chars
    def count(string, regex)
    # count how many times regex is matched on a string
    end

    View Slide

  29. BENCHMARK: LOG SEARCH
    BENCHMARK: LOG SEARCH
    NASA webserver, 196 MB
    199.72.81.55 - - [01/Jul/1995:00:00:01 -0400]
    "GET /history/apollo/ HTTP/1.0" 200 6245
    /.* - - \[(.*)\] "((.+) ?)*" (.*) (.*)/
    # 5m18,298s
    /.* - - \[(.*)\] "(.+) (.*) (HTTP\/.*)" (.*) (.*)/
    # 0m10,013s
    /\S* - - \[([^\]]*)\] "(\S+) (\S*) (HTTP\/\d.\d)"
    (\d+) (\d+)/
    # 0m5,897s

    View Slide

  30. WHO COULD MAKE SUCH
    WHO COULD MAKE SUCH
    MISTAKE?
    MISTAKE?

    View Slide

  31. WHO DID MAKE SUCH
    WHO DID MAKE SUCH
    MISTAKE?
    MISTAKE?
    Virginia Tech study, 2018
    pypi & npm modules
    1-10% superlinear regexes
    Django, MongoDB, python core
    Also
    StackOverflow
    Cloudflare
    Rack

    View Slide

  32. COPY-PASTING
    COPY-PASTING
    (email validation)
    , Java Classname
    RegExLib, id=1757
    ^([a-zA-Z0-9])(([\-.]|[_]+)?
    ([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1}
    (([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$
    OWASP Validation Regex Repository
    ^(([a-z])+.)+[A-Z]([a-z])+$
    'aaaaaaaaaaaaaaa'

    View Slide

  33. EXERCISE 3 - REAL WORLD
    EXERCISE 3 - REAL WORLD
    EXAMPLES
    EXAMPLES

    View Slide

  34. LET THE USERS WRITE REGEXPS!
    LET THE USERS WRITE REGEXPS!
    feature request
    (<.*>\s*)*

    View Slide

  35. View Slide

  36. View Slide

  37. SUMMARY
    SUMMARY

    View Slide

  38. GOOD PRACTICES
    GOOD PRACTICES
    .* => [^X]*
    .*? => [^X]*
    (pre-d1|pre-e2|...) => pre-(d1|e2|...)
    (,|;|\.) => [,.;]
    (A+|B?)+ => A+(BA+)*
    (A+|B?)+ => (A+B)*A+
    \w+-\d+ => \w+-\d++, \w+-(?>\d+), \w+-(?
    =(\d+))\1

    View Slide

  39. FINAL ADVICE
    FINAL ADVICE
    TAFT: valid, invalid, almost valid input
    disambiguate, don't be lazy
    it's a sharp tool, use with care

    View Slide

  40. View Slide

  41. WHAT DO YOU THINK?
    WHAT DO YOU THINK?
    LET'S TALK!
    LET'S TALK!
    @MJRZASA
    @MJRZASA

    View Slide

  42. View Slide