Writing slow regexp is easier than you think (and want it to be)

Writing slow regexp is easier than you think (and want it to be)

Although regular expressions are commonly used in software development, few developers think about their performance. The sad truth is that a badly written regexp can severely damage application performance (both on server-side and in a browser). How to write a regular expression that is not only correct but also efficient?

Slides for the presentation given in SPA Software in Practice Conference, London 2018 (https://www.spaconference.org/spa2018/).

Ba17945a06aac247b06548d5afe341e8?s=128

mrzasa

July 02, 2018
Tweet

Transcript

  1. WRITING SLOW REGEXP IS WRITING SLOW REGEXP IS EASIER THAN

    YOU THINK EASIER THAN YOU THINK (AND WANT IT TO BE) (AND WANT IT TO BE) MACIEK RZĄSA MACIEK RZĄSA TEXTMASTER TEXTMASTER SPA Software London, 2nd July 2018 @mjrzasa
  2. None
  3. None
  4. None
  5. I'm definitely guilty of this. When I throw a regex

    together, I never worry about performance; I know the target strings will generally be far too small to ever cause a problem. Jeff Atwood, 2006
  6. JEFF ATWOOD DOESN'T JEFF ATWOOD DOESN'T WORRY ABOUT REGEX WORRY

    ABOUT REGEX PERFORMANCE, PERFORMANCE, WHY SHOULD I? WHY SHOULD I? <input id="operation" pattern='(\d+[+-]?)+='> https://link.do/spa-input https://link.do/spa-page
  7. WHAT'S NEXT WHAT'S NEXT regex engines: theory&internals performance of basic

    regex elements examples&applications what could go wrong? can regex be fast? https://link.do/spa-page
  8. RUBY DEVELOPER RUBY DEVELOPER @ TEXTMASTER @ TEXTMASTER at work

    translation solution available online network of expert translators, >50 langs SaaS platform, API, integrations text processing Ruby, Java, mongodb, elastic search after work Rzeszów Ruby User Group ( ), Rzeszów University of Technology software that matters, agile rrug.pl
  9. HOW DO REGEXPS HOW DO REGEXPS REALLY WORK? REALLY WORK?

  10. RUBY RUBY JAVA JAVA JAVASCRIPT JAVASCRIPT pattern = /<.*>/ pattern.match("text

    <br>") Pattern p = Pattern.compile("<.*>"); Matcher m = p.matcher("text <br>"); boolean b = m.matches(); var pattern = /<.*>/; var results = re.exec("text <br>");
  11. THEORY... THEORY... regular grammar regular expression: abab|abbb finite automaton a

    b a a b b b b source: A -> abB B -> bb B -> ab https://swtch.com/~rsc/regexp/regexp1.html
  12. ...MEETS PRACTICE ...MEETS PRACTICE formal languages theory popular programming languages

    a* a+ a|b a? a(a|b) a* a+ a|b a? a(a|b) a*? \d \W (<!b)a \1 (?R)... /(\w+)\1/ -> papa WikiWiki /\(((?R)|\w+)\)/ -> (((12)))
  13. TWO TYPES OF TWO TYPES OF REGEX ENGINES REGEX ENGINES

  14. EXAMPLE EXAMPLE a b a a b b b b

    source of figures on this and few next slides: /abab|abbb/ =~ 'abbb' https://swtch.com/~rsc/regexp/regexp1.html
  15. Text-directed abab|abbb a b a a b b b b

    •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb•
  16. Regex-directed abab|abbb a b a a b b b b

    •abbb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb• failure, backtracking
  17. None
  18. WHAT WE ALREADY KNOW? WHAT WE ALREADY KNOW? regexps -

    separate programming language two types of regexp engines (virtual machines) - text- directed, regex-directed performance dependent on number of steps and backtracks
  19. PERFORMANCE ANALYSIS PERFORMANCE ANALYSIS

  20. EXERCISE 1 EXERCISE 1 Match HTML tags. | 1. Add

    text inside/after the tag, see if step count changes; see debugger 2. Add another tag, see the result 3. Two solutions: limit repetition .*?, limit scope [^>] 4. Try both, add text inside/after the tag, see step count changes https://link.do/spa-page https://link.do/spa-greedy
  21. QUANTIFIERS (REPETITION) QUANTIFIERS (REPETITION) .* greedy .*? lazy /<.*>/=~"<br />

    regexp text " => "<br/>" # so far so good /<.*>/=~"<abbr> regexp text </abbr>" => "<abbr> regexp text </abbr>" #hmmm... /<[^>]*>/=~"<abbr> regexp text </abbr>" => "<abbr>" # great! /<.*?>/=~"<abbr> regexp text </abbr>" => "<abbr>" # great!
  22. SHOULD WE BE LAZY? SHOULD WE BE LAZY? "<abbr> regexp

    text </abbr>" /<[^>]*>/ /<.*?>/
  23. CONTEXT IS THE KING CONTEXT IS THE KING "<br/> some

    really long text" <.*> 27 steps <.*?> 7 steps <[^>]*> 4 steps "some really long text <br/> " <.*> 5 steps <.*?> 7 steps <[^>]*> 4 steps
  24. EXERCISE 2 EXERCISE 2 Match numbers with units ending with

    semicolon: 123cm; 32kg; 1m3; 1. Try to add digits to the number 2. Remove semicolon - see steps and backtracking in debugger 3. Replace greedy quantifier with the possessive one ++, see steps in debugger https://link.do/spa-possessive
  25. ALMOST ALMOST MATCHED MATCHED possesive .++ numbers with units what

    if it almost matches? 123cm; 32kg; 1m3; /^(\d+)(\w+);/ 123cm 32kg 1m3
  26. 19 STEPS 19 STEPS 9 STEPS 9 STEPS /^(\d+)(\w+);/ /^(\d++)(\w+);/

    # (Java, Ruby) /^(?>\d+)(\w+)/ #(.Net) /^(?=\d+)\1(\w+)/ #(JavaScript)
  27. QUANTIFIERS QUANTIFIERS performance dependent on context greedy .* .+ -

    perfomance depends on text after match lazy .*? .+? - performance depends on the lenght of the match possesive .*+ .++ - no backtracking positive tests: matching substing in various position in test string negative tests: test string very similar to the desired one
  28. EXERCISE 3 EXERCISE 3 Optimize those two expressions: 1. Match

    Tea column in CSV text: 2. (*) Find some CSS classes related to product: product- size, product-column, product-info and product ids that has digits 1,2,3. https://link.do/spa-csv https://link.do/spa-css
  29. IT'S ALL INTERESTING IT'S ALL INTERESTING But that's not the

    reason we are here
  30. CATASTROPHIC CATASTROPHIC BACKTRACKING BACKTRACKING

  31. REPETITION INSIDE REPETITION REPETITION INSIDE REPETITION (ex. 4) ...but who

    writes such regexps? /(a+)*b/ aaaaaaaaaab aaaaaaaaaa https://link.do/spa-exp
  32. EXERCISE 5 (*) EXERCISE 5 (*) Arithmetic operations. You have

    a regex matching simple arithmetic operations. Allowed: two numbers separated with plus or minus sign, ending with equals sign, e.g. 12+34= or 32121-23= 1. Enhance regex to allow 3 numbers and 2 signs (e.g. 12+322-1= ). 2. Enhance regex to allow any lenght of the operation (e.g. 12+322-1+223-2323+...=). 3. Remove equals sign from the test string, check steps in debugger. https://link.do/spa-operations
  33. ARITHMETIC OPERATIONS ARITHMETIC OPERATIONS 320-12= 430- 32+1= pattern = /\d+[-+]\d+=/

    # v1.0 pattern = /(\d+[-+])+=/ # v2.0-rc pattern = /(\d+[-+]?)+=/ #v2.0 pattern = /(\d+|[-+])+=/ "320-12=" # 12 steps "32-12+230=" # 18 steps "32+12-320-2132+32123=" # 28 steps "32+12-320-2132+32123" # 95 854 steps
  34. OPTIMIZATION: UNROLLING THE OPTIMIZATION: UNROLLING THE LOOP LOOP extracting the

    mandatory part \d+[-+]\d+ repeating the long optional part ([- +]\d+)* pattern = /(\d+[-+]?)+=/ #v2.0 pattern = /\d+[-+]\d+([-+]\d+)*=/ #v2.0.1
  35. COPY-PASTING COPY-PASTING (email validation) , Java Classname RegExLib, id=1757 ^([a-zA-Z0-9])(([\-.]|[_]+)?

    ([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1} (([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$ OWASP Validation Regex Repository ^(([a-z])+.)+[A-Z]([a-z])+$ 'aaaaaaaaaaaaaaa'
  36. PATTERNS TO AVOID PATTERNS TO AVOID overlapping scopes /\d+\w*/ overlapping

    alternatives /(\d+|\w+)/ remote overlapping quantifiers /.*-some text-.*;/ nested quantifiers /(\d+)*\w/
  37. None
  38. COMPUTERS ARE NOW COMPUTERS ARE NOW SO FAST THAT SO

    FAST THAT EVEN HAVING 100 000 STEPS EVEN HAVING 100 000 STEPS WON'T MATTER WON'T MATTER a couple applications
  39. COUNTING NUMBER OCCURRENCES COUNTING NUMBER OCCURRENCES IN TEXT IN TEXT

    piece of cake, right? # number # 1, 3243, 4323 pattern = /\d+/ # number with decimal part and minus # -1, 1, 32.32, -2.2324 pattern = /(-?\d+(\.\d+)?)/ # number with decimal part (dot or comma) # -23,23 4323.23 pattern = /(-?\d+([.,]\d+)?)/
  40. COUNTING NUMBER OCCURRENCES COUNTING NUMBER OCCURRENCES IN TEXT IN TEXT

    # number with decimal part and thousands separator # -21,321,321.1111 433.233,12 greedy = /(-?(\d+[,.]?)+)/ # same, lazy lazy = /(-?((\d+?[,.])+?)/ # same, limited backtracking unrolled = /(-?(\d+[,.])*\d+)/ # same, possesive possesive = /(-?((\d++[,.])++))/
  41. None
  42. TESTS TESTS greedy lazy unrolled possessive Ruby 70.4 i/s 65.9

    i/s 1,296.1 i/s 1,187.0 i/s JavaScript 191 i/s 220 i/s 6,689 i/s - greedy = /(-?(\d+[,.]?)+)/ lazy = /(-?(\d+?[,.]?)+?)/ unrolled = /(-?\d+([,.]\d+)*)/ possessive = /(-?(\d++[,.]?)++)/ string = "..." # ~11000 chars def count(string, regex) # count how many times regex is matched on a string end
  43. LOG SEARCH LOG SEARCH NASA webserver, 196 MB 199.72.81.55 -

    - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 /.* - - \[(.*)\] "((.+) ?)*" (.*) (.*)/ # 5m18,298s /.* - - \[(.*)\] "(.+) (.*) (HTTP\/.*)" (.*) (.*)/ # 0m10,013s /\S* - - \[([^\]]*)\] "(\S+) (\S*) (HTTP\/\d.\d)" (\d+) (\d+)/ # 0m5,897s
  44. REDOS - WHAT COULD REDOS - WHAT COULD REALLY GO

    WRONG? REALLY GO WRONG? Regular expression Denial of Service
  45. HOW TO STOP A FRONTEND APP? HOW TO STOP A

    FRONTEND APP? vue.js https://link.do/spa-vue
  46. HOW TO TAKE 100% CPU? HOW TO TAKE 100% CPU?

    Witaj! → ¡Hola! → Salut ! # Ruby # Function source ::Typography.to_html_french # Put thin space before punctuation text.gsub(/(\s|)+([!?;]+(\s|\z))/, '&thinsp;\2\3') # Data # <-58 space chars -> GET /wp-login.php HTTP/1.1 69 GET /show.aspx HTTP/1.1 15 customer of Discourse, described by Sam Saffron
  47. LET THE USERS WRITE REGEXPS! LET THE USERS WRITE REGEXPS!

    feature request (<.*>\s*)*
  48. None
  49. None
  50. GOOD PRACTICES GOOD PRACTICES .* => [^X]* .*? => [^X]*

    (pre-d1|pre-e2|...) => pre-(d1|e2|...) (,|;|\.) => [,.;] (A*|B?)+ => A+(BA*)* \w+-\d+ => \w+-\d++, \w+-(?>\d+), \w+-(? =(\d+))\1
  51. CONCLUSIONS CONCLUSIONS regular expression - theory (formal languages) and practice

    (programming languages) differ every engine is different - check yours main issues: overlapping scopes or alternatives, nested quantifiers tests: matching, not matching, almost matching; various positions in text
  52. REFERENCE REFERENCE Mastering Regular Expressions, 3rd Edition, Jefferey Friedl, 2009

    and next parts (1-4) thanks to http://www.rexegg.com https://www.regular-expressions.info https://regex101.com/ Russ Cox test app - original Kamil Szubrycht OWASP ReDoS Loggly: 5 techniquess Loggly: regexes bad better... Katafrakt: Regular expression how do they work?
  53. None
  54. WHAT DO YOU THINK? WHAT DO YOU THINK? LET'S TALK!

    LET'S TALK! @MJRZASA @MJRZASA
  55. None