Slide 1

Slide 1 text

WRITING SLOW REGEXP IS WRITING SLOW REGEXP IS EASIER THAN YOU THINK EASIER THAN YOU THINK (AND WANT IT TO BE) (AND WANT IT TO BE) MACIEK RZĄSA MACIEK RZĄSA TEXTMASTER TEXTMASTER SPA Software London, 2nd July 2018 @mjrzasa

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

I'm definitely guilty of this. When I throw a regex together, I never worry about performance; I know the target strings will generally be far too small to ever cause a problem. Jeff Atwood, 2006

Slide 6

Slide 6 text

JEFF ATWOOD DOESN'T JEFF ATWOOD DOESN'T WORRY ABOUT REGEX WORRY ABOUT REGEX PERFORMANCE, PERFORMANCE, WHY SHOULD I? WHY SHOULD I? https://link.do/spa-input https://link.do/spa-page

Slide 7

Slide 7 text

WHAT'S NEXT WHAT'S NEXT regex engines: theory&internals performance of basic regex elements examples&applications what could go wrong? can regex be fast? https://link.do/spa-page

Slide 8

Slide 8 text

RUBY DEVELOPER RUBY DEVELOPER @ TEXTMASTER @ TEXTMASTER at work translation solution available online network of expert translators, >50 langs SaaS platform, API, integrations text processing Ruby, Java, mongodb, elastic search after work Rzeszów Ruby User Group ( ), Rzeszów University of Technology software that matters, agile rrug.pl

Slide 9

Slide 9 text

HOW DO REGEXPS HOW DO REGEXPS REALLY WORK? REALLY WORK?

Slide 10

Slide 10 text

RUBY RUBY JAVA JAVA JAVASCRIPT JAVASCRIPT pattern = /<.*>/ pattern.match("text
") Pattern p = Pattern.compile("<.*>"); Matcher m = p.matcher("text
"); boolean b = m.matches(); var pattern = /<.*>/; var results = re.exec("text
");

Slide 11

Slide 11 text

THEORY... THEORY... regular grammar regular expression: abab|abbb finite automaton a b a a b b b b source: A -> abB B -> bb B -> ab https://swtch.com/~rsc/regexp/regexp1.html

Slide 12

Slide 12 text

...MEETS PRACTICE ...MEETS PRACTICE formal languages theory popular programming languages a* a+ a|b a? a(a|b) a* a+ a|b a? a(a|b) a*? \d \W ( papa WikiWiki /\(((?R)|\w+)\)/ -> (((12)))

Slide 13

Slide 13 text

TWO TYPES OF TWO TYPES OF REGEX ENGINES REGEX ENGINES

Slide 14

Slide 14 text

EXAMPLE EXAMPLE a b a a b b b b source of figures on this and few next slides: /abab|abbb/ =~ 'abbb' https://swtch.com/~rsc/regexp/regexp1.html

Slide 15

Slide 15 text

Text-directed abab|abbb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb•

Slide 16

Slide 16 text

Regex-directed abab|abbb a b a a b b b b •abbb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb• failure, backtracking

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

WHAT WE ALREADY KNOW? WHAT WE ALREADY KNOW? regexps - separate programming language two types of regexp engines (virtual machines) - text- directed, regex-directed performance dependent on number of steps and backtracks

Slide 19

Slide 19 text

PERFORMANCE ANALYSIS PERFORMANCE ANALYSIS

Slide 20

Slide 20 text

EXERCISE 1 EXERCISE 1 Match HTML tags. | 1. Add text inside/after the tag, see if step count changes; see debugger 2. Add another tag, see the result 3. Two solutions: limit repetition .*?, limit scope [^>] 4. Try both, add text inside/after the tag, see step count changes https://link.do/spa-page https://link.do/spa-greedy

Slide 21

Slide 21 text

QUANTIFIERS (REPETITION) QUANTIFIERS (REPETITION) .* greedy .*? lazy /<.*>/=~"
regexp text " => "
" # so far so good /<.*>/=~" regexp text " => " regexp text " #hmmm... /<[^>]*>/=~" regexp text " => "" # great! /<.*?>/=~" regexp text " => "" # great!

Slide 22

Slide 22 text

SHOULD WE BE LAZY? SHOULD WE BE LAZY? " regexp text " /<[^>]*>/ /<.*?>/

Slide 23

Slide 23 text

CONTEXT IS THE KING CONTEXT IS THE KING "
some really long text" <.*> 27 steps <.*?> 7 steps <[^>]*> 4 steps "some really long text
" <.*> 5 steps <.*?> 7 steps <[^>]*> 4 steps

Slide 24

Slide 24 text

EXERCISE 2 EXERCISE 2 Match numbers with units ending with semicolon: 123cm; 32kg; 1m3; 1. Try to add digits to the number 2. Remove semicolon - see steps and backtracking in debugger 3. Replace greedy quantifier with the possessive one ++, see steps in debugger https://link.do/spa-possessive

Slide 25

Slide 25 text

ALMOST ALMOST MATCHED MATCHED possesive .++ numbers with units what if it almost matches? 123cm; 32kg; 1m3; /^(\d+)(\w+);/ 123cm 32kg 1m3

Slide 26

Slide 26 text

19 STEPS 19 STEPS 9 STEPS 9 STEPS /^(\d+)(\w+);/ /^(\d++)(\w+);/ # (Java, Ruby) /^(?>\d+)(\w+)/ #(.Net) /^(?=\d+)\1(\w+)/ #(JavaScript)

Slide 27

Slide 27 text

QUANTIFIERS QUANTIFIERS performance dependent on context greedy .* .+ - perfomance depends on text after match lazy .*? .+? - performance depends on the lenght of the match possesive .*+ .++ - no backtracking positive tests: matching substing in various position in test string negative tests: test string very similar to the desired one

Slide 28

Slide 28 text

EXERCISE 3 EXERCISE 3 Optimize those two expressions: 1. Match Tea column in CSV text: 2. (*) Find some CSS classes related to product: product- size, product-column, product-info and product ids that has digits 1,2,3. https://link.do/spa-csv https://link.do/spa-css

Slide 29

Slide 29 text

IT'S ALL INTERESTING IT'S ALL INTERESTING But that's not the reason we are here

Slide 30

Slide 30 text

CATASTROPHIC CATASTROPHIC BACKTRACKING BACKTRACKING

Slide 31

Slide 31 text

REPETITION INSIDE REPETITION REPETITION INSIDE REPETITION (ex. 4) ...but who writes such regexps? /(a+)*b/ aaaaaaaaaab aaaaaaaaaa https://link.do/spa-exp

Slide 32

Slide 32 text

EXERCISE 5 (*) EXERCISE 5 (*) Arithmetic operations. You have a regex matching simple arithmetic operations. Allowed: two numbers separated with plus or minus sign, ending with equals sign, e.g. 12+34= or 32121-23= 1. Enhance regex to allow 3 numbers and 2 signs (e.g. 12+322-1= ). 2. Enhance regex to allow any lenght of the operation (e.g. 12+322-1+223-2323+...=). 3. Remove equals sign from the test string, check steps in debugger. https://link.do/spa-operations

Slide 33

Slide 33 text

ARITHMETIC OPERATIONS ARITHMETIC OPERATIONS 320-12= 430- 32+1= pattern = /\d+[-+]\d+=/ # v1.0 pattern = /(\d+[-+])+=/ # v2.0-rc pattern = /(\d+[-+]?)+=/ #v2.0 pattern = /(\d+|[-+])+=/ "320-12=" # 12 steps "32-12+230=" # 18 steps "32+12-320-2132+32123=" # 28 steps "32+12-320-2132+32123" # 95 854 steps

Slide 34

Slide 34 text

OPTIMIZATION: UNROLLING THE OPTIMIZATION: UNROLLING THE LOOP LOOP extracting the mandatory part \d+[-+]\d+ repeating the long optional part ([- +]\d+)* pattern = /(\d+[-+]?)+=/ #v2.0 pattern = /\d+[-+]\d+([-+]\d+)*=/ #v2.0.1

Slide 35

Slide 35 text

COPY-PASTING COPY-PASTING (email validation) , Java Classname RegExLib, id=1757 ^([a-zA-Z0-9])(([\-.]|[_]+)? ([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1} (([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$ OWASP Validation Regex Repository ^(([a-z])+.)+[A-Z]([a-z])+$ 'aaaaaaaaaaaaaaa'

Slide 36

Slide 36 text

PATTERNS TO AVOID PATTERNS TO AVOID overlapping scopes /\d+\w*/ overlapping alternatives /(\d+|\w+)/ remote overlapping quantifiers /.*-some text-.*;/ nested quantifiers /(\d+)*\w/

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

COMPUTERS ARE NOW COMPUTERS ARE NOW SO FAST THAT SO FAST THAT EVEN HAVING 100 000 STEPS EVEN HAVING 100 000 STEPS WON'T MATTER WON'T MATTER a couple applications

Slide 39

Slide 39 text

COUNTING NUMBER OCCURRENCES COUNTING NUMBER OCCURRENCES IN TEXT IN TEXT piece of cake, right? # number # 1, 3243, 4323 pattern = /\d+/ # number with decimal part and minus # -1, 1, 32.32, -2.2324 pattern = /(-?\d+(\.\d+)?)/ # number with decimal part (dot or comma) # -23,23 4323.23 pattern = /(-?\d+([.,]\d+)?)/

Slide 40

Slide 40 text

COUNTING NUMBER OCCURRENCES COUNTING NUMBER OCCURRENCES IN TEXT IN TEXT # number with decimal part and thousands separator # -21,321,321.1111 433.233,12 greedy = /(-?(\d+[,.]?)+)/ # same, lazy lazy = /(-?((\d+?[,.])+?)/ # same, limited backtracking unrolled = /(-?(\d+[,.])*\d+)/ # same, possesive possesive = /(-?((\d++[,.])++))/

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

TESTS TESTS greedy lazy unrolled possessive Ruby 70.4 i/s 65.9 i/s 1,296.1 i/s 1,187.0 i/s JavaScript 191 i/s 220 i/s 6,689 i/s - greedy = /(-?(\d+[,.]?)+)/ lazy = /(-?(\d+?[,.]?)+?)/ unrolled = /(-?\d+([,.]\d+)*)/ possessive = /(-?(\d++[,.]?)++)/ string = "..." # ~11000 chars def count(string, regex) # count how many times regex is matched on a string end

Slide 43

Slide 43 text

LOG SEARCH LOG SEARCH NASA webserver, 196 MB 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 /.* - - \[(.*)\] "((.+) ?)*" (.*) (.*)/ # 5m18,298s /.* - - \[(.*)\] "(.+) (.*) (HTTP\/.*)" (.*) (.*)/ # 0m10,013s /\S* - - \[([^\]]*)\] "(\S+) (\S*) (HTTP\/\d.\d)" (\d+) (\d+)/ # 0m5,897s

Slide 44

Slide 44 text

REDOS - WHAT COULD REDOS - WHAT COULD REALLY GO WRONG? REALLY GO WRONG? Regular expression Denial of Service

Slide 45

Slide 45 text

HOW TO STOP A FRONTEND APP? HOW TO STOP A FRONTEND APP? vue.js https://link.do/spa-vue

Slide 46

Slide 46 text

HOW TO TAKE 100% CPU? HOW TO TAKE 100% CPU? Witaj! → ¡Hola! → Salut ! # Ruby # Function source ::Typography.to_html_french # Put thin space before punctuation text.gsub(/(\s|)+([!?;]+(\s|\z))/, ' \2\3') # Data # <-58 space chars -> GET /wp-login.php HTTP/1.1 69 GET /show.aspx HTTP/1.1 15 customer of Discourse, described by Sam Saffron

Slide 47

Slide 47 text

LET THE USERS WRITE REGEXPS! LET THE USERS WRITE REGEXPS! feature request (<.*>\s*)*

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

GOOD PRACTICES GOOD PRACTICES .* => [^X]* .*? => [^X]* (pre-d1|pre-e2|...) => pre-(d1|e2|...) (,|;|\.) => [,.;] (A*|B?)+ => A+(BA*)* \w+-\d+ => \w+-\d++, \w+-(?>\d+), \w+-(? =(\d+))\1

Slide 51

Slide 51 text

CONCLUSIONS CONCLUSIONS regular expression - theory (formal languages) and practice (programming languages) differ every engine is different - check yours main issues: overlapping scopes or alternatives, nested quantifiers tests: matching, not matching, almost matching; various positions in text

Slide 52

Slide 52 text

REFERENCE REFERENCE Mastering Regular Expressions, 3rd Edition, Jefferey Friedl, 2009 and next parts (1-4) thanks to http://www.rexegg.com https://www.regular-expressions.info https://regex101.com/ Russ Cox test app - original Kamil Szubrycht OWASP ReDoS Loggly: 5 techniquess Loggly: regexes bad better... Katafrakt: Regular expression how do they work?

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

WHAT DO YOU THINK? WHAT DO YOU THINK? LET'S TALK! LET'S TALK! @MJRZASA @MJRZASA

Slide 55

Slide 55 text

No content