Slide 1

Slide 1 text

I CAN KILL YOUR BROWSER I CAN KILL YOUR BROWSER WITH A SIMPLE REGEX WITH A SIMPLE REGEX MACIEK RZĄSA MACIEK RZĄSA TOPTAL TOPTAL @mjrzasa

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

I'm definitely guilty of this. When I throw a regex together, I never worry about performance; I know the target strings will generally be far too small to ever cause a problem. Jeff Atwood, 2006

Slide 6

Slide 6 text

JEFF ATWOOD DOESN'T JEFF ATWOOD DOESN'T WORRY ABOUT REGEX WORRY ABOUT REGEX PERFORMANCE, PERFORMANCE, WHY SHOULD I? WHY SHOULD I?

Slide 7

Slide 7 text

WHAT'S NEXT? WHAT'S NEXT? . Warm-up: basics of regex performance. . Catastrophic performance. . Real world failures.

Slide 8

Slide 8 text

DEVELOPER DEVELOPER @ TOPTAL @ TOPTAL at work Ruby & Postgres migrating to services Scrum Mastering after work Rzeszów Ruby User Group ( ), Rzeszów University of Technology software that matters, agile text processing, distributed systems rrug.pl

Slide 9

Slide 9 text

HOW DO REGEXPS HOW DO REGEXPS REALLY WORK? REALLY WORK?

Slide 10

Slide 10 text

RUBY RUBY JAVA JAVA JAVASCRIPT JAVASCRIPT pattern = /<.*>/ pattern.match("text
") Pattern p = Pattern.compile("<.*>"); Matcher m = p.matcher("text
"); boolean b = m.matches(); var pattern = /<.*>/; var results = re.exec("text
");

Slide 11

Slide 11 text

THEORY... THEORY... regular grammar regular expression: abab|abbb finite automaton a b a a b b b b source: A -> abB B -> bb B -> ab https://swtch.com/~rsc/regexp/regexp1.html

Slide 12

Slide 12 text

...MEETS PRACTICE ...MEETS PRACTICE formal languages theory popular programming languages a* a+ a|b a? a(a|b) a* a+ a|b a? a(a|b) a*? \d \W ( papa WikiWiki /\(((?R)|\w+)\)/ -> (((12)))

Slide 13

Slide 13 text

TWO TYPES OF TWO TYPES OF REGEX ENGINES REGEX ENGINES . Text-directed Thompson 1968, 400 LOC in C lang grep, awk, sed, go based on DFA (Deterministic Finite Automata) simpler implementation . Regex-directed Larry Wall, perl, 1987 Perl-Compatible Regular Expressions (JS, Ruby, .Net,...) based on NFA (Nondeterministic Finite Automata) broader feature set

Slide 14

Slide 14 text

EXAMPLE EXAMPLE a b a a b b b b source of figures on this and few next slides: /abab|abbb/ =~ 'abbb' https://swtch.com/~rsc/regexp/regexp1.html

Slide 15

Slide 15 text

Text-directed abab|abbb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb•

Slide 16

Slide 16 text

Regex-directed abab|abbb a b a a b b b b •abbb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b •abbb a b a a b b b b a•bbb a b a a b b b b ab•bb a b a a b b b b abb•b a b a a b b b b abbb• failure, backtracking

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

WARM-UP EXERCISE (1) WARM-UP EXERCISE (1)

Slide 19

Slide 19 text

WARM-UP SUMMARY WARM-UP SUMMARY Don't be lazy .* => [^X]* .*? => [^X]* Avoid ambiguous alternatives (pre-d1|pre-e2|...) => pre-(d1|e2|...) Leverage character classes (,|;|\.) => [,.;]

Slide 20

Slide 20 text

IT'S ALL INTERESTING IT'S ALL INTERESTING But that's not the reason we are here

Slide 21

Slide 21 text

REPETITION INSIDE REPETITION REPETITION INSIDE REPETITION OVERLAPPING REPETITIONS OVERLAPPING REPETITIONS ALL OK? ALL OK? CATASTROPHIC CATASTROPHIC BACKTRACKING BACKTRACKING see Exercise 2.1 & 2.2 /(a+)*b/ /a*c?a*b/ aaaaaaaaaab aaaaaaaaaa

Slide 22

Slide 22 text

WHO WRITES SUCH WHO WRITES SUCH REGEXPS? REGEXPS? EXERCISE 2.3 EXERCISE 2.3

Slide 23

Slide 23 text

ARITHMETIC OPERATIONS ARITHMETIC OPERATIONS 320-12= 430-32+1= pattern = /\d+[-+]\d+=/ # v1.0 pattern = /(\d+[-+])+=/ # v2.0-rc pattern = /(\d+[-+]?)+=/ #v2.0 pattern = /(\d+|[-+])+=/ "320-12=" # 12 steps "32-12+230=" # 18 steps "32+12-320-2132+32123=" # 28 steps "32+12-320-2132+32123" # 95 854 steps

Slide 24

Slide 24 text

SOLUTIONS SOLUTIONS unrolling the loop (A+|B?)+ => A+(BA+)* (A+|B?)+ => (A+B)*A+ possessive quantifier (repetition that never backtracks) (A+|B?)+ => (A++|B?)+ (*) atomic groups (A+|B?)+ => (?>A+|B?)+ Optimize pattern from 2.3, then do 2.4

Slide 25

Slide 25 text

PATTERNS TO AVOID PATTERNS TO AVOID overlapping scopes /\d+\w*/ overlapping alternatives /(\d+|\w+)/ remote overlapping quantifiers /.*-some text-.*;/ nested quantifiers /(\d+)*\w/

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

COMPUTERS ARE NOW COMPUTERS ARE NOW SO FAST THAT SO FAST THAT EVEN HAVING 100 000 STEPS EVEN HAVING 100 000 STEPS WON'T MATTER WON'T MATTER

Slide 28

Slide 28 text

BENCHMARK: NUMBERS BENCHMARK: NUMBERS greedy lazy unrolled possessive Ruby 70.4 i/s 65.9 i/s 1,296.1 i/s 1,187.0 i/s JavaScript 191 i/s 220 i/s 6,689 i/s - greedy = /(-?(\d+[,.]?)+)/ lazy = /(-?(\d+?[,.]?)+?)/ unrolled = /(-?\d+([,.]\d+)*)/ possessive = /(-?(\d++[,.]?)++)/ string = "..." # ~11000 chars def count(string, regex) # count how many times regex is matched on a string end

Slide 29

Slide 29 text

BENCHMARK: LOG SEARCH BENCHMARK: LOG SEARCH NASA webserver, 196 MB 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 /.* - - \[(.*)\] "((.+) ?)*" (.*) (.*)/ # 5m18,298s /.* - - \[(.*)\] "(.+) (.*) (HTTP\/.*)" (.*) (.*)/ # 0m10,013s /\S* - - \[([^\]]*)\] "(\S+) (\S*) (HTTP\/\d.\d)" (\d+) (\d+)/ # 0m5,897s

Slide 30

Slide 30 text

WHO COULD MAKE SUCH WHO COULD MAKE SUCH MISTAKE? MISTAKE?

Slide 31

Slide 31 text

WHO DID MAKE SUCH WHO DID MAKE SUCH MISTAKE? MISTAKE? Virginia Tech study, 2018 pypi & npm modules 1-10% superlinear regexes Django, MongoDB, python core Also StackOverflow Cloudflare Rack

Slide 32

Slide 32 text

COPY-PASTING COPY-PASTING (email validation) , Java Classname RegExLib, id=1757 ^([a-zA-Z0-9])(([\-.]|[_]+)? ([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1} (([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$ OWASP Validation Regex Repository ^(([a-z])+.)+[A-Z]([a-z])+$ 'aaaaaaaaaaaaaaa'

Slide 33

Slide 33 text

EXERCISE 3 - REAL WORLD EXERCISE 3 - REAL WORLD EXAMPLES EXAMPLES

Slide 34

Slide 34 text

LET THE USERS WRITE REGEXPS! LET THE USERS WRITE REGEXPS! feature request (<.*>\s*)*

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

SUMMARY SUMMARY

Slide 38

Slide 38 text

GOOD PRACTICES GOOD PRACTICES .* => [^X]* .*? => [^X]* (pre-d1|pre-e2|...) => pre-(d1|e2|...) (,|;|\.) => [,.;] (A+|B?)+ => A+(BA+)* (A+|B?)+ => (A+B)*A+ \w+-\d+ => \w+-\d++, \w+-(?>\d+), \w+-(? =(\d+))\1

Slide 39

Slide 39 text

FINAL ADVICE FINAL ADVICE TAFT: valid, invalid, almost valid input disambiguate, don't be lazy it's a sharp tool, use with care

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

WHAT DO YOU THINK? WHAT DO YOU THINK? LET'S TALK! LET'S TALK! @MJRZASA @MJRZASA

Slide 42

Slide 42 text

No content