Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Writing slow regexp is easier than you think (and want it to be)

mrzasa
July 02, 2018

Writing slow regexp is easier than you think (and want it to be)

Although regular expressions are commonly used in software development, few developers think about their performance. The sad truth is that a badly written regexp can severely damage application performance (both on server-side and in a browser). How to write a regular expression that is not only correct but also efficient?

Slides for the presentation given in SPA Software in Practice Conference, London 2018 (https://www.spaconference.org/spa2018/).

mrzasa

July 02, 2018
Tweet

More Decks by mrzasa

Other Decks in Programming

Transcript

  1. WRITING SLOW REGEXP IS WRITING SLOW REGEXP IS EASIER THAN

    YOU THINK EASIER THAN YOU THINK (AND WANT IT TO BE) (AND WANT IT TO BE) MACIEK RZĄSA MACIEK RZĄSA TEXTMASTER TEXTMASTER SPA Software London, 2nd July 2018 @mjrzasa
  2. I'm definitely guilty of this. When I throw a regex

    together, I never worry about performance; I know the target strings will generally be far too small to ever cause a problem. Jeff Atwood, 2006
  3. JEFF ATWOOD DOESN'T JEFF ATWOOD DOESN'T WORRY ABOUT REGEX WORRY

    ABOUT REGEX PERFORMANCE, PERFORMANCE, WHY SHOULD I? WHY SHOULD I? <input id="operation" pattern='(\d+[+-]?)+='> https://link.do/spa-input https://link.do/spa-page
  4. WHAT'S NEXT WHAT'S NEXT regex engines: theory&internals performance of basic

    regex elements examples&applications what could go wrong? can regex be fast? https://link.do/spa-page
  5. RUBY DEVELOPER RUBY DEVELOPER @ TEXTMASTER @ TEXTMASTER at work

    translation solution available online network of expert translators, >50 langs SaaS platform, API, integrations text processing Ruby, Java, mongodb, elastic search after work Rzeszów Ruby User Group ( ), Rzeszów University of Technology software that matters, agile rrug.pl
  6. RUBY RUBY JAVA JAVA JAVASCRIPT JAVASCRIPT pattern = /<.*>/ pattern.match("text

    <br>") Pattern p = Pattern.compile("<.*>"); Matcher m = p.matcher("text <br>"); boolean b = m.matches(); var pattern = /<.*>/; var results = re.exec("text <br>");
  7. THEORY... THEORY... regular grammar regular expression: abab|abbb finite automaton a

    b a a b b b b source: A -> abB B -> bb B -> ab https://swtch.com/~rsc/regexp/regexp1.html
  8. ...MEETS PRACTICE ...MEETS PRACTICE formal languages theory popular programming languages

    a* a+ a|b a? a(a|b) a* a+ a|b a? a(a|b) a*? \d \W (<!b)a \1 (?R)... /(\w+)\1/ -> papa WikiWiki /\(((?R)|\w+)\)/ -> (((12)))
  9. EXAMPLE EXAMPLE a b a a b b b b

    source of figures on this and few next slides: /abab|abbb/ =~ 'abbb' https://swtch.com/~rsc/regexp/regexp1.html
  10. Text-directed abab|abbb a b a a b b b b

    •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb•
  11. Regex-directed abab|abbb a b a a b b b b

    •abbb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb• failure, backtracking
  12. WHAT WE ALREADY KNOW? WHAT WE ALREADY KNOW? regexps -

    separate programming language two types of regexp engines (virtual machines) - text- directed, regex-directed performance dependent on number of steps and backtracks
  13. EXERCISE 1 EXERCISE 1 Match HTML tags. | 1. Add

    text inside/after the tag, see if step count changes; see debugger 2. Add another tag, see the result 3. Two solutions: limit repetition .*?, limit scope [^>] 4. Try both, add text inside/after the tag, see step count changes https://link.do/spa-page https://link.do/spa-greedy
  14. QUANTIFIERS (REPETITION) QUANTIFIERS (REPETITION) .* greedy .*? lazy /<.*>/=~"<br />

    regexp text " => "<br/>" # so far so good /<.*>/=~"<abbr> regexp text </abbr>" => "<abbr> regexp text </abbr>" #hmmm... /<[^>]*>/=~"<abbr> regexp text </abbr>" => "<abbr>" # great! /<.*?>/=~"<abbr> regexp text </abbr>" => "<abbr>" # great!
  15. CONTEXT IS THE KING CONTEXT IS THE KING "<br/> some

    really long text" <.*> 27 steps <.*?> 7 steps <[^>]*> 4 steps "some really long text <br/> " <.*> 5 steps <.*?> 7 steps <[^>]*> 4 steps
  16. EXERCISE 2 EXERCISE 2 Match numbers with units ending with

    semicolon: 123cm; 32kg; 1m3; 1. Try to add digits to the number 2. Remove semicolon - see steps and backtracking in debugger 3. Replace greedy quantifier with the possessive one ++, see steps in debugger https://link.do/spa-possessive
  17. ALMOST ALMOST MATCHED MATCHED possesive .++ numbers with units what

    if it almost matches? 123cm; 32kg; 1m3; /^(\d+)(\w+);/ 123cm 32kg 1m3
  18. 19 STEPS 19 STEPS 9 STEPS 9 STEPS /^(\d+)(\w+);/ /^(\d++)(\w+);/

    # (Java, Ruby) /^(?>\d+)(\w+)/ #(.Net) /^(?=\d+)\1(\w+)/ #(JavaScript)
  19. QUANTIFIERS QUANTIFIERS performance dependent on context greedy .* .+ -

    perfomance depends on text after match lazy .*? .+? - performance depends on the lenght of the match possesive .*+ .++ - no backtracking positive tests: matching substing in various position in test string negative tests: test string very similar to the desired one
  20. EXERCISE 3 EXERCISE 3 Optimize those two expressions: 1. Match

    Tea column in CSV text: 2. (*) Find some CSS classes related to product: product- size, product-column, product-info and product ids that has digits 1,2,3. https://link.do/spa-csv https://link.do/spa-css
  21. REPETITION INSIDE REPETITION REPETITION INSIDE REPETITION (ex. 4) ...but who

    writes such regexps? /(a+)*b/ aaaaaaaaaab aaaaaaaaaa https://link.do/spa-exp
  22. EXERCISE 5 (*) EXERCISE 5 (*) Arithmetic operations. You have

    a regex matching simple arithmetic operations. Allowed: two numbers separated with plus or minus sign, ending with equals sign, e.g. 12+34= or 32121-23= 1. Enhance regex to allow 3 numbers and 2 signs (e.g. 12+322-1= ). 2. Enhance regex to allow any lenght of the operation (e.g. 12+322-1+223-2323+...=). 3. Remove equals sign from the test string, check steps in debugger. https://link.do/spa-operations
  23. ARITHMETIC OPERATIONS ARITHMETIC OPERATIONS 320-12= 430- 32+1= pattern = /\d+[-+]\d+=/

    # v1.0 pattern = /(\d+[-+])+=/ # v2.0-rc pattern = /(\d+[-+]?)+=/ #v2.0 pattern = /(\d+|[-+])+=/ "320-12=" # 12 steps "32-12+230=" # 18 steps "32+12-320-2132+32123=" # 28 steps "32+12-320-2132+32123" # 95 854 steps
  24. OPTIMIZATION: UNROLLING THE OPTIMIZATION: UNROLLING THE LOOP LOOP extracting the

    mandatory part \d+[-+]\d+ repeating the long optional part ([- +]\d+)* pattern = /(\d+[-+]?)+=/ #v2.0 pattern = /\d+[-+]\d+([-+]\d+)*=/ #v2.0.1
  25. COPY-PASTING COPY-PASTING (email validation) , Java Classname RegExLib, id=1757 ^([a-zA-Z0-9])(([\-.]|[_]+)?

    ([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1} (([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$ OWASP Validation Regex Repository ^(([a-z])+.)+[A-Z]([a-z])+$ 'aaaaaaaaaaaaaaa'
  26. PATTERNS TO AVOID PATTERNS TO AVOID overlapping scopes /\d+\w*/ overlapping

    alternatives /(\d+|\w+)/ remote overlapping quantifiers /.*-some text-.*;/ nested quantifiers /(\d+)*\w/
  27. COMPUTERS ARE NOW COMPUTERS ARE NOW SO FAST THAT SO

    FAST THAT EVEN HAVING 100 000 STEPS EVEN HAVING 100 000 STEPS WON'T MATTER WON'T MATTER a couple applications
  28. COUNTING NUMBER OCCURRENCES COUNTING NUMBER OCCURRENCES IN TEXT IN TEXT

    piece of cake, right? # number # 1, 3243, 4323 pattern = /\d+/ # number with decimal part and minus # -1, 1, 32.32, -2.2324 pattern = /(-?\d+(\.\d+)?)/ # number with decimal part (dot or comma) # -23,23 4323.23 pattern = /(-?\d+([.,]\d+)?)/
  29. COUNTING NUMBER OCCURRENCES COUNTING NUMBER OCCURRENCES IN TEXT IN TEXT

    # number with decimal part and thousands separator # -21,321,321.1111 433.233,12 greedy = /(-?(\d+[,.]?)+)/ # same, lazy lazy = /(-?((\d+?[,.])+?)/ # same, limited backtracking unrolled = /(-?(\d+[,.])*\d+)/ # same, possesive possesive = /(-?((\d++[,.])++))/
  30. TESTS TESTS greedy lazy unrolled possessive Ruby 70.4 i/s 65.9

    i/s 1,296.1 i/s 1,187.0 i/s JavaScript 191 i/s 220 i/s 6,689 i/s - greedy = /(-?(\d+[,.]?)+)/ lazy = /(-?(\d+?[,.]?)+?)/ unrolled = /(-?\d+([,.]\d+)*)/ possessive = /(-?(\d++[,.]?)++)/ string = "..." # ~11000 chars def count(string, regex) # count how many times regex is matched on a string end
  31. LOG SEARCH LOG SEARCH NASA webserver, 196 MB 199.72.81.55 -

    - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 /.* - - \[(.*)\] "((.+) ?)*" (.*) (.*)/ # 5m18,298s /.* - - \[(.*)\] "(.+) (.*) (HTTP\/.*)" (.*) (.*)/ # 0m10,013s /\S* - - \[([^\]]*)\] "(\S+) (\S*) (HTTP\/\d.\d)" (\d+) (\d+)/ # 0m5,897s
  32. REDOS - WHAT COULD REDOS - WHAT COULD REALLY GO

    WRONG? REALLY GO WRONG? Regular expression Denial of Service
  33. HOW TO STOP A FRONTEND APP? HOW TO STOP A

    FRONTEND APP? vue.js https://link.do/spa-vue
  34. HOW TO TAKE 100% CPU? HOW TO TAKE 100% CPU?

    Witaj! → ¡Hola! → Salut ! # Ruby # Function source ::Typography.to_html_french # Put thin space before punctuation text.gsub(/(\s|)+([!?;]+(\s|\z))/, '&thinsp;\2\3') # Data # <-58 space chars -> GET /wp-login.php HTTP/1.1 69 GET /show.aspx HTTP/1.1 15 customer of Discourse, described by Sam Saffron
  35. GOOD PRACTICES GOOD PRACTICES .* => [^X]* .*? => [^X]*

    (pre-d1|pre-e2|...) => pre-(d1|e2|...) (,|;|\.) => [,.;] (A*|B?)+ => A+(BA*)* \w+-\d+ => \w+-\d++, \w+-(?>\d+), \w+-(? =(\d+))\1
  36. CONCLUSIONS CONCLUSIONS regular expression - theory (formal languages) and practice

    (programming languages) differ every engine is different - check yours main issues: overlapping scopes or alternatives, nested quantifiers tests: matching, not matching, almost matching; various positions in text
  37. REFERENCE REFERENCE Mastering Regular Expressions, 3rd Edition, Jefferey Friedl, 2009

    and next parts (1-4) thanks to http://www.rexegg.com https://www.regular-expressions.info https://regex101.com/ Russ Cox test app - original Kamil Szubrycht OWASP ReDoS Loggly: 5 techniquess Loggly: regexes bad better... Katafrakt: Regular expression how do they work?