How to do regexp analysis

Slide 1

Slide 1 text

How to do regexp analysis @quasilyte / GolangKazan 2020

Slide 2

Slide 2 text

Not why, but how Implementation advice and potential issues overview.

Slide 3

Slide 3 text

go-critic NoVerify Open-Source analyzers

Slide 4

Slide 4 text

Discussion plan ● Handling regexp syntax ● Analyzing regexp ﬂow ● Finding bugs in regular expressions ● Regexp rewriting

Slide 5

Slide 5 text

Handling regexp syntax

Slide 6

Slide 6 text

Why making own parser? Most regexp libraries use parsers that give up on the ﬁrst error. For analysis, we need rich AST (parse tree even) and error-tolerant parser.

Slide 7

Slide 7 text

Writing a parser Useful resources: ● Regexp syntax docs (BNF, re2-syntax) ● Pratt parsers tutorial (RU, EN) ● Regexp corpus for tests (gist) ● Dialect-speciﬁc documentation

Slide 8

Slide 8 text

Composition operators Only two: ● Concatenation: xy (“x” followed by “y”) ● Alternation: x|y (“x” or “y”) Concatenation is implicit. And we want it to be explicit in AST.

Slide 9

Slide 9 text

Concat operation `0|xy[a-z]` ⬇ 0 | x ⋅ y ⋅ [a-z]

Slide 10

Slide 10 text

Parsing concatenation ● Insert concat tokens ● Parse regexp like it has explicit concat xy? ⬇ “x” “⋅” “y” “?”

Slide 11

Slide 11 text

Char classes (are hard) ● Different escaping rules ● Char-ranges can be tricky This is char range: [\n-\r] 4 chars This is not: [\d-\r] \d, “-” and “\r”

Slide 12

Slide 12 text

Char classes syntax `[][]` What is it?

Slide 13

Slide 13 text

Char classes syntax `[][]` A char class of “]” and “[“! `[\]\[]`

Slide 14

Slide 14 text

Char classes syntax `[^]*|\[[^\]]` What is it?

Slide 15

Slide 15 text

Char classes syntax `[^]*|\[[^\]]` A single char class! `[^\]*|\[\[^\]]`

Slide 16

Slide 16 text

Char classes syntax `[+=-_]` What will be matched?

Slide 17

Slide 17 text

Char classes syntax `[+=-_]` “F” matched

Slide 18

Slide 18 text

Char classes syntax `[+=\-_]` “F” not matched

Slide 19

Slide 19 text

Chars and literals ● Consecutive “chars” can be merged ● Single char should not be converted Both forms (with and without merge) are useful. Merged chars simplify literal substring analysis.

Slide 20

Slide 20 text

Concat operation `foox?y` ⬇ lit(foo) ⋅ ?(char(x)) ⋅ char(y)

Slide 21

Slide 21 text

AST types There are at least two approaches: ● One type + enum tags ● Many types + shared interface/base Both have pros and cons.

Slide 22

Slide 22 text

AST types type Expr struct { Kind ExprKind // enum tag Value string // source text Args []Expr // sub-expr list } type ExprKind int

Slide 23

Slide 23 text

AST types const ( ExprNone ExprKind = iota ExprChar ExprLiteral // list of chars ExprConcat // xy ExprAlt // x|y // etc. )

Slide 24

Slide 24 text

Helper for the next slide func charExpr(val string) Expr { return Expr{ Kind: ExprChar, Value: val, } }

Slide 25

Slide 25 text

AST of `x|yz` Expr{ Kind: ExprAlt, Value: "x|yz", Args: []Expr{ charExpr("x"), { Kind: ExprConcat, Value: "yz", Args: []Expr{ charExpr("y"), charExpr("z"), }, }, }, }

Slide 26

Slide 26 text

Go regexp parsing library https://github.com/quasilyte/regex contains a `regex/syntax` package that is used in both NoVerify and go-critic. It can parse both re2 and pcre patterns.

Slide 27

Slide 27 text

Analyzing regexp ﬂow

Slide 28

Slide 28 text

Regexp ﬂags A regular expression can have an initial set of ﬂags, then it can add or remove any of them inside the expression. The effect is localized to the current (potentially capturing) group.

Slide 29

Slide 29 text

Concat operation `/((?i)a(?m)b(?-m)c)d/s` ^--------- flags: si Entered a group with “i” flag

Slide 30

Slide 30 text

Concat operation `/((?i)a(?m)b(?-m)c)d/s` -^ flags: sim Mid-group flags: add “m”

Slide 31

Slide 31 text

Concat operation `/((?i)a(?m)b(?-m)c)d/s` -------------^ flags: si Mid-group flags: clear “m”

Slide 32

Slide 32 text

Concat operation `/((?i)a(?m)b(?-m)c)d/s` -----------------^ flags: s Left a group with “i” flag

Slide 33

Slide 33 text

Flags ﬂow ● Flags are lexically scoped ● Groups are a scoping unit ● Leaving a group drops a scope ● Entering a group adds a scope

Slide 34

Slide 34 text

Back references ● Rules vary among engines/dialects ● Syntax may clash with octal literals ● Can also be relative/named: \g{-1}, etc We’ll use PHP rules as an example.

Slide 35

Slide 35 text

Back reference QUIZ! (PHP) \0 ??? \1 … \9 ??? \10 … \77 ???

Slide 36

Slide 36 text

Back reference QUIZ! (PHP) \0 Octal literal \1 … \9 ??? \10 … \77 ???

Slide 37

Slide 37 text

Back reference QUIZ! (PHP) \0 Octal literal \1 … \9 Back reference \10 … \77 ???

Slide 38

Slide 38 text

Back reference QUIZ! (PHP) \0 Octal literal \1 … \9 Back reference \10 … \77 It depends!

Slide 39

Slide 39 text

Groups ﬂow ● Capturing groups are numbered from left to right. ● Non-capturing groups are ignored. ● Groups can have a name.

Slide 40

Slide 40 text

Finding bugs in regular expressions

Slide 41

Slide 41 text

“^” anchor diagnostic Let’s check that “^” is used only in the beginning position of the pattern. Because if it follows a non-empty match, it’ll never succeed.

Slide 42

Slide 42 text

Correct “^” usages `^foo` `^a|^b` `a|(b|^c)`

Slide 43

Slide 43 text

Incorrect “^” usages `foo^` `a^b` `(a|b)^c`

Slide 44

Slide 44 text

Algorithm ● Traverse all starting branches ● Mark all reached “^” as “good” Then traverse a pattern AST normally and report any “^” that was not marked.

Slide 45

Slide 45 text

The starting branches? ● For every “concat” met, it’s the ﬁrst element (applied recursively). ● If root regexp element is not “concat”, consider it to be a concat of 1 element.

Slide 46

Slide 46 text

URL matching `google.com`

Slide 47

Slide 47 text

URL matching `google.com` http://googleocom.ru

Slide 48

Slide 48 text

URL matching `google.com` http://googleocom.ru http://a.github.io/google.com

Slide 49

Slide 49 text

URL matching `google\.com` http://googleocom.ru http://a.github.io/google.com

Slide 50

Slide 50 text

URL matching `^https?://google\.com/` http://googleocom.ru http://a.github.io/google.com

Slide 51

Slide 51 text

URL matching When “.” is used before common domain name like “com”, it’s probably a mistake. If we have char sequences represented as a single AST node, this analysis is trivial.

Slide 52

Slide 52 text

Handling unescaped dot `google.com` lit(google) ⋅ . ⋅ lit(com) Warn if “.” is followed by a lit with domain name value.

Slide 53

Slide 53 text

Regexp rewriting

Slide 54

Slide 54 text

Regexp input generation It’s quite simple to generate a string that will be matched by a regular expression if you have that regexp AST.

Slide 55

Slide 55 text

Generating matching string (N=2) `\w*[0-9]?$` *(\w) ⋅ ?([0-9]) ⋅ $

Slide 56

Slide 56 text

Generating matching string (N=2) `\w*[0-9]?$` *(\w) ⋅ ?([0-9]) ⋅ $ aa N matches of \w

Slide 57

Slide 57 text

Generating matching string (N=2) `\w*[0-9]?$` *(\w) ⋅ ?([0-9]) ⋅ $ aa7 1 match of [0-9]

Slide 58

Slide 58 text

Generating matching string (N=2) `\w*[0-9]?$` *(\w) ⋅ ?([0-9]) ⋅ $ aa7 May do nothing for $

Slide 59

Slide 59 text

Regexp input generation Generating a non-matching strings can be useful for catastrophic backtracking evaluation.

Slide 60

Slide 60 text

Regexp simpliﬁcation Instead of writing a matching characters we can write the pattern syntax itself. By replacing recognized AST node sequences with something simpler, we can perform a regexp simpliﬁcation.

Slide 61

Slide 61 text

Regexp simpliﬁcation `\dxx*` \d ⋅ x ⋅ *(x)

Slide 62

Slide 62 text

Regexp simpliﬁcation `\dxx*` \d ⋅ x ⋅ *(x) \d Can’t simplify \d, write as is

Slide 63

Slide 63 text

Regexp simpliﬁcation `\dxx*` \d ⋅ x ⋅ *(x) \dx+ xx* -> x+

Slide 64

Slide 64 text

Oh, the possibilities! x{1,} -> x+ [a-z\d][a-z\d] -> [a-z\d]{2} [^\d] -> \D a|b|c -> [abc]

Slide 65

Slide 65 text

https://quasilyte.dev/regexp-lint/ Online Demo

Slide 66

Slide 66 text

Submit your ideas! :) If you have a particular regexp simpliﬁcation or bug pattern that is not detected by regexp-lint, let me know.

Slide 67

Slide 67 text

Thank you.