$30 off During Our Annual Pro Sale. View Details »

Writing slow regexp is easier than you think (and want it to be)

mrzasa
July 02, 2018

Writing slow regexp is easier than you think (and want it to be)

Although regular expressions are commonly used in software development, few developers think about their performance. The sad truth is that a badly written regexp can severely damage application performance (both on server-side and in a browser). How to write a regular expression that is not only correct but also efficient?

Slides for the presentation given in SPA Software in Practice Conference, London 2018 (https://www.spaconference.org/spa2018/).

mrzasa

July 02, 2018
Tweet

More Decks by mrzasa

Other Decks in Programming

Transcript

  1. WRITING SLOW REGEXP IS
    WRITING SLOW REGEXP IS
    EASIER THAN YOU THINK
    EASIER THAN YOU THINK
    (AND WANT IT TO BE)
    (AND WANT IT TO BE)
    MACIEK RZĄSA
    MACIEK RZĄSA
    TEXTMASTER
    TEXTMASTER
    SPA Software London, 2nd July 2018
    @mjrzasa

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. I'm definitely guilty of this. When I throw a regex
    together, I never worry about performance;
    I know the target strings will generally be
    far too small to ever cause a problem.
    Jeff Atwood, 2006

    View Slide

  6. JEFF ATWOOD DOESN'T
    JEFF ATWOOD DOESN'T
    WORRY ABOUT REGEX
    WORRY ABOUT REGEX
    PERFORMANCE,
    PERFORMANCE,
    WHY SHOULD I?
    WHY SHOULD I?

    https://link.do/spa-input
    https://link.do/spa-page

    View Slide

  7. WHAT'S NEXT
    WHAT'S NEXT
    regex engines: theory&internals
    performance of basic regex
    elements
    examples&applications
    what could go wrong?
    can regex be fast?
    https://link.do/spa-page

    View Slide

  8. RUBY DEVELOPER
    RUBY DEVELOPER
    @ TEXTMASTER
    @ TEXTMASTER
    at work
    translation solution available online
    network of expert translators, >50 langs
    SaaS platform, API, integrations
    text processing
    Ruby, Java, mongodb, elastic search
    after work
    Rzeszów Ruby User Group ( ),
    Rzeszów University of Technology
    software that matters, agile
    rrug.pl

    View Slide

  9. HOW DO REGEXPS
    HOW DO REGEXPS
    REALLY WORK?
    REALLY WORK?

    View Slide

  10. RUBY
    RUBY
    JAVA
    JAVA
    JAVASCRIPT
    JAVASCRIPT
    pattern = /<.*>/
    pattern.match("text
    ")
    Pattern p = Pattern.compile("<.*>");
    Matcher m = p.matcher("text
    ");
    boolean b = m.matches();
    var pattern = /<.*>/;
    var results = re.exec("text
    ");

    View Slide

  11. THEORY...
    THEORY...
    regular grammar
    regular expression: abab|abbb
    finite automaton
    a b a
    a
    b
    b b b
    source:
    A -> abB
    B -> bb
    B -> ab
    https://swtch.com/~rsc/regexp/regexp1.html

    View Slide

  12. ...MEETS PRACTICE
    ...MEETS PRACTICE
    formal languages theory
    popular programming
    languages
    a* a+ a|b a? a(a|b)
    a* a+ a|b a? a(a|b) a*? \d
    \W (/(\w+)\1/ -> papa WikiWiki
    /\(((?R)|\w+)\)/ -> (((12)))

    View Slide

  13. TWO TYPES OF
    TWO TYPES OF
    REGEX ENGINES
    REGEX ENGINES

    View Slide

  14. EXAMPLE
    EXAMPLE
    a b a
    a
    b
    b b b
    source of figures on this and few next slides:
    /abab|abbb/ =~ 'abbb'
    https://swtch.com/~rsc/regexp/regexp1.html

    View Slide

  15. Text-directed abab|abbb
    a b a
    a
    b
    b b b
    •abbb
    a b a
    a
    b
    b b b
    a•bbb
    a b a
    a
    b
    b b b
    ab•bb
    a b a
    a
    b
    b b b
    abb•b
    a b a
    a
    b
    b b b
    abbb•

    View Slide

  16. Regex-directed abab|abbb
    a b a
    a
    b
    b b b
    •abbb
    a b a
    a
    b
    b b b
    •abbb
    a b a
    a
    b
    b b b
    a•bbb
    a b a
    a
    b
    b b b
    ab•bb
    a b a
    a
    b
    b b b
    •abbb
    a b a
    a
    b
    b b b
    a•bbb
    a b a
    a
    b
    b b b
    ab•bb
    a b a
    a
    b
    b b b
    abb•b
    a b a
    a
    b
    b b b
    abbb•
    failure, backtracking

    View Slide

  17. View Slide

  18. WHAT WE ALREADY KNOW?
    WHAT WE ALREADY KNOW?
    regexps - separate programming language
    two types of regexp engines (virtual machines) - text-
    directed, regex-directed
    performance dependent on number of steps and
    backtracks

    View Slide

  19. PERFORMANCE ANALYSIS
    PERFORMANCE ANALYSIS

    View Slide

  20. EXERCISE 1
    EXERCISE 1
    Match HTML tags.
    |
    1. Add text inside/after the tag, see if step count changes;
    see debugger
    2. Add another tag, see the result
    3. Two solutions: limit repetition .*?, limit scope [^>]
    4. Try both, add text inside/after the tag, see step count
    changes
    https://link.do/spa-page https://link.do/spa-greedy

    View Slide

  21. QUANTIFIERS (REPETITION)
    QUANTIFIERS (REPETITION)
    .* greedy
    .*? lazy
    /<.*>/=~"
    regexp text "
    => "
    " # so far so good
    /<.*>/=~" regexp text "
    => " regexp text " #hmmm...
    /<[^>]*>/=~" regexp text "
    => "" # great!
    /<.*?>/=~" regexp text "
    => "" # great!

    View Slide

  22. SHOULD WE BE LAZY?
    SHOULD WE BE LAZY?
    " regexp text "
    /<[^>]*>/ /<.*?>/

    View Slide

  23. CONTEXT IS THE KING
    CONTEXT IS THE KING
    "
    some really long text"
    <.*> 27 steps
    <.*?> 7 steps
    <[^>]*> 4 steps
    "some really long text

    "
    <.*> 5 steps
    <.*?> 7 steps
    <[^>]*> 4 steps

    View Slide

  24. EXERCISE 2
    EXERCISE 2
    Match numbers with units ending with semicolon:
    123cm; 32kg; 1m3;
    1. Try to add digits to the number
    2. Remove semicolon - see steps and backtracking in
    debugger
    3. Replace greedy quantifier with the possessive one ++,
    see steps in debugger
    https://link.do/spa-possessive

    View Slide

  25. ALMOST
    ALMOST MATCHED
    MATCHED
    possesive .++
    numbers with units
    what if it almost matches?
    123cm; 32kg; 1m3;
    /^(\d+)(\w+);/
    123cm 32kg 1m3

    View Slide

  26. 19 STEPS
    19 STEPS 9 STEPS
    9 STEPS
    /^(\d+)(\w+);/ /^(\d++)(\w+);/ # (Java, Ruby)
    /^(?>\d+)(\w+)/ #(.Net)
    /^(?=\d+)\1(\w+)/ #(JavaScript)

    View Slide

  27. QUANTIFIERS
    QUANTIFIERS
    performance dependent on context
    greedy .* .+ - perfomance depends on text after match
    lazy .*? .+? - performance depends on the lenght of
    the match
    possesive .*+ .++ - no backtracking
    positive tests: matching substing in various position in
    test string
    negative tests: test string very similar to the desired one

    View Slide

  28. EXERCISE 3
    EXERCISE 3
    Optimize those two expressions:
    1. Match Tea column in CSV text:
    2. (*) Find some CSS classes related to product: product-
    size, product-column, product-info and
    product ids that has digits 1,2,3.
    https://link.do/spa-csv
    https://link.do/spa-css

    View Slide

  29. IT'S ALL INTERESTING
    IT'S ALL INTERESTING
    But that's not the reason we are here

    View Slide

  30. CATASTROPHIC
    CATASTROPHIC
    BACKTRACKING
    BACKTRACKING

    View Slide

  31. REPETITION INSIDE REPETITION
    REPETITION INSIDE REPETITION
    (ex. 4)
    ...but who writes such regexps?
    /(a+)*b/
    aaaaaaaaaab
    aaaaaaaaaa
    https://link.do/spa-exp

    View Slide

  32. EXERCISE 5 (*)
    EXERCISE 5 (*)
    Arithmetic operations.
    You have a regex matching simple arithmetic operations.
    Allowed: two numbers separated with plus or minus sign,
    ending with equals sign, e.g. 12+34= or 32121-23=
    1. Enhance regex to allow 3 numbers and 2 signs (e.g.
    12+322-1= ).
    2. Enhance regex to allow any lenght of the operation (e.g.
    12+322-1+223-2323+...=).
    3. Remove equals sign from the test string, check steps in
    debugger.
    https://link.do/spa-operations

    View Slide

  33. ARITHMETIC OPERATIONS
    ARITHMETIC OPERATIONS
    320-12=
    430-
    32+1=
    pattern = /\d+[-+]\d+=/ # v1.0
    pattern = /(\d+[-+])+=/ # v2.0-rc
    pattern = /(\d+[-+]?)+=/ #v2.0
    pattern = /(\d+|[-+])+=/
    "320-12=" # 12 steps
    "32-12+230=" # 18 steps
    "32+12-320-2132+32123=" # 28 steps
    "32+12-320-2132+32123" # 95 854 steps

    View Slide

  34. OPTIMIZATION: UNROLLING THE
    OPTIMIZATION: UNROLLING THE
    LOOP
    LOOP
    extracting the mandatory part \d+[-+]\d+
    repeating the long optional part ([-
    +]\d+)*
    pattern = /(\d+[-+]?)+=/ #v2.0
    pattern = /\d+[-+]\d+([-+]\d+)*=/ #v2.0.1

    View Slide

  35. COPY-PASTING
    COPY-PASTING
    (email validation)
    , Java Classname
    RegExLib, id=1757
    ^([a-zA-Z0-9])(([\-.]|[_]+)?
    ([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1}
    (([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$
    OWASP Validation Regex Repository
    ^(([a-z])+.)+[A-Z]([a-z])+$
    'aaaaaaaaaaaaaaa'

    View Slide

  36. PATTERNS TO AVOID
    PATTERNS TO AVOID
    overlapping scopes /\d+\w*/
    overlapping alternatives
    /(\d+|\w+)/
    remote overlapping quantifiers
    /.*-some text-.*;/
    nested quantifiers /(\d+)*\w/

    View Slide

  37. View Slide

  38. COMPUTERS ARE NOW
    COMPUTERS ARE NOW
    SO FAST THAT
    SO FAST THAT
    EVEN HAVING 100 000 STEPS
    EVEN HAVING 100 000 STEPS
    WON'T MATTER
    WON'T MATTER
    a couple applications

    View Slide

  39. COUNTING NUMBER OCCURRENCES
    COUNTING NUMBER OCCURRENCES
    IN TEXT
    IN TEXT
    piece of cake, right?
    # number
    # 1, 3243, 4323
    pattern = /\d+/
    # number with decimal part and minus
    # -1, 1, 32.32, -2.2324
    pattern = /(-?\d+(\.\d+)?)/
    # number with decimal part (dot or comma)
    # -23,23 4323.23
    pattern = /(-?\d+([.,]\d+)?)/

    View Slide

  40. COUNTING NUMBER OCCURRENCES
    COUNTING NUMBER OCCURRENCES
    IN TEXT
    IN TEXT
    # number with decimal part and thousands separator
    # -21,321,321.1111 433.233,12
    greedy = /(-?(\d+[,.]?)+)/
    # same, lazy
    lazy = /(-?((\d+?[,.])+?)/
    # same, limited backtracking
    unrolled = /(-?(\d+[,.])*\d+)/
    # same, possesive
    possesive = /(-?((\d++[,.])++))/

    View Slide

  41. View Slide

  42. TESTS
    TESTS
    greedy lazy unrolled possessive
    Ruby 70.4 i/s 65.9 i/s 1,296.1 i/s 1,187.0 i/s
    JavaScript 191 i/s 220 i/s 6,689 i/s -
    greedy = /(-?(\d+[,.]?)+)/
    lazy = /(-?(\d+?[,.]?)+?)/
    unrolled = /(-?\d+([,.]\d+)*)/
    possessive = /(-?(\d++[,.]?)++)/
    string = "..." # ~11000 chars
    def count(string, regex)
    # count how many times regex is matched on a string
    end

    View Slide

  43. LOG SEARCH
    LOG SEARCH
    NASA webserver, 196 MB
    199.72.81.55 - - [01/Jul/1995:00:00:01 -0400]
    "GET /history/apollo/ HTTP/1.0" 200 6245
    /.* - - \[(.*)\] "((.+) ?)*" (.*) (.*)/
    # 5m18,298s
    /.* - - \[(.*)\] "(.+) (.*) (HTTP\/.*)" (.*) (.*)/
    # 0m10,013s
    /\S* - - \[([^\]]*)\] "(\S+) (\S*) (HTTP\/\d.\d)"
    (\d+) (\d+)/
    # 0m5,897s

    View Slide

  44. REDOS - WHAT COULD
    REDOS - WHAT COULD
    REALLY GO WRONG?
    REALLY GO WRONG?
    Regular expression Denial of Service

    View Slide

  45. HOW TO STOP A FRONTEND APP?
    HOW TO STOP A FRONTEND APP?
    vue.js https://link.do/spa-vue

    View Slide

  46. HOW TO TAKE 100% CPU?
    HOW TO TAKE 100% CPU?
    Witaj! → ¡Hola! → Salut !
    # Ruby
    # Function source
    ::Typography.to_html_french
    # Put thin space before punctuation
    text.gsub(/(\s|)+([!?;]+(\s|\z))/, ' \2\3')
    # Data
    # <-58 space chars ->
    GET /wp-login.php HTTP/1.1 69
    GET /show.aspx HTTP/1.1 15
    customer of Discourse, described by Sam Saffron

    View Slide

  47. LET THE USERS WRITE REGEXPS!
    LET THE USERS WRITE REGEXPS!
    feature request
    (<.*>\s*)*

    View Slide

  48. View Slide

  49. View Slide

  50. GOOD PRACTICES
    GOOD PRACTICES
    .* => [^X]*
    .*? => [^X]*
    (pre-d1|pre-e2|...) => pre-(d1|e2|...)
    (,|;|\.) => [,.;]
    (A*|B?)+ => A+(BA*)*
    \w+-\d+ => \w+-\d++, \w+-(?>\d+), \w+-(?
    =(\d+))\1

    View Slide

  51. CONCLUSIONS
    CONCLUSIONS
    regular expression - theory (formal languages) and
    practice (programming languages) differ
    every engine is different - check yours
    main issues: overlapping scopes or alternatives, nested
    quantifiers
    tests: matching, not matching, almost matching; various
    positions in text

    View Slide

  52. REFERENCE
    REFERENCE
    Mastering Regular Expressions, 3rd Edition, Jefferey Friedl,
    2009
    and next parts (1-4)
    thanks to
    http://www.rexegg.com
    https://www.regular-expressions.info
    https://regex101.com/
    Russ Cox
    test app - original Kamil Szubrycht
    OWASP ReDoS
    Loggly: 5 techniquess
    Loggly: regexes bad better...
    Katafrakt: Regular expression how do they work?

    View Slide

  53. View Slide

  54. WHAT DO YOU THINK?
    WHAT DO YOU THINK?
    LET'S TALK!
    LET'S TALK!
    @MJRZASA
    @MJRZASA

    View Slide

  55. View Slide