Slide 1

Slide 1 text

phpgrep syntax-aware code search Искандер @quasilyte ВКонтакте

Slide 2

Slide 2 text

@quasilyte ❏ Go compiler (Go + x86 asm) @ Intel ❏ Backend infrastructure (Go) @ VK ❏ Gopher Konstructor Working on dev tools is my passion.

Slide 3

Slide 3 text

Talk structure ❏ phpgrep vs grep ❏ phpgrep features, pattern language ❏ Good use cases and examples ❏ PhpStorm structural search ❏ Code normalization and its applications

Slide 4

Slide 4 text

Talk structure ❏ phpgrep vs grep ❏ phpgrep features, pattern language ❏ Good use cases and examples ❏ PhpStorm structural search ❏ Code normalization and its applications

Slide 5

Slide 5 text

Talk structure ❏ phpgrep vs grep ❏ phpgrep features, pattern language ❏ Good use cases and examples ❏ PhpStorm structural search ❏ Code normalization and its applications

Slide 6

Slide 6 text

Talk structure ❏ phpgrep vs grep ❏ phpgrep features, pattern language ❏ Good use cases and examples ❏ PhpStorm structural search ❏ Code normalization and its applications

Slide 7

Slide 7 text

Talk structure ❏ phpgrep vs grep ❏ phpgrep features, pattern language ❏ Good use cases and examples ❏ PhpStorm structural search ❏ Code normalization and its applications

Slide 8

Slide 8 text

Today we’re the code detective

Slide 9

Slide 9 text

Find all assignments, where assigned value is a string longer than 10 chars First mission

Slide 10

Slide 10 text

$s = "quite a long text"; $x = "text with \" escaped quote"; $arr[$key] = "a string key"; Examples that should be matched

Slide 11

Slide 11 text

Basically, regular expressions Let’s try grep

Slide 12

Slide 12 text

grep (text level)

Slide 13

Slide 13 text

grep $x = "this is a text"; Implication 1: sees a line above as a sequence of characters

Slide 14

Slide 14 text

grep $x = "this is a text"; Implication 2: uses char-oriented pattern language (regexp)

Slide 15

Slide 15 text

grep $x = "this is a text"; Implication 3: doesn’t know anything about PHP

Slide 16

Slide 16 text

$x = "this is a text"; \$\w+\s*=\s*"[^"]{10,}"\s* We need to deal with optional whitespace, but that’s OK

Slide 17

Slide 17 text

$x = "this is a text"; \$\w+\s*=\s*"[^"]{10,}"\s* But this solutions is wrong. It doesn’t handle quote escaping

Slide 18

Slide 18 text

$x = "this is a text"; \$\w+\s*=\s*"(?:[^"\\]|\\.){10,}"\s* Is it sufficient now?

Slide 19

Slide 19 text

$x = "this is a text"; \$\w+\s*=\s*"(?:[^"\\]|\\.){10,}"\s* Is it sufficient now? Not really, we’re still matching only variable assignments

Slide 20

Slide 20 text

Matching code with regexp is like trying to parse PHP using only regular expressions We (almost) succeeded, but...

Slide 21

Slide 21 text

phpgrep (syntax level)

Slide 22

Slide 22 text

$x = "this is a text"; $_ = ${"s:str"} Note that we don’t care about whitespace anymore

Slide 23

Slide 23 text

$x = "this is a text"; $_ = ${"s:str"} s~.{10,} To apply “10 char length” restrictions, we use result filtering (more on that later)

Slide 24

Slide 24 text

Find fstat function call with 2 arguments Second mission

Slide 25

Slide 25 text

fstat($f, $flags) fstat(getFile($ctx, $name), 0) fstat($f->name, "b") Examples that should be matched

Slide 26

Slide 26 text

fstat(f($x, $y), $flags); fstat\(.*?, [^,]*\) It’s hard to match with regexp because arguments may be complex and contain commas (,)

Slide 27

Slide 27 text

fstat(f($x, $y), $flags); fstat($_, $_) With phpgrep it’s super simple to match arbitrary expressions

Slide 28

Slide 28 text

Find array literals with duplicated keys (No regexp example this time!) Third mission

Slide 29

Slide 29 text

[7=>"1", 7=>"2"] [$x, 7=>"1", 7=>"2"] [7=>"1", $x, 7=>"2", $y] Examples that should be matched

Slide 30

Slide 30 text

[1=>$x, 2=>$y, 1=>$z] [${"*"},$k=>$_,${"*"},$k=>$_,${"*"}] ${"*"} - capturing of 0-N exprs

Slide 31

Slide 31 text

[1=>$x, 2=>$y, 1=>$z] [${"*"},$k=>$_,${"*"},$k=>$_,${"*"}] $k - named non-empty expr capture

Slide 32

Slide 32 text

[1=>$x, 2=>$y, 1=>$z] [${"*"},$k=>$_,${"*"},$k=>$_,${"*"}] $_ - any non-empty expr capture

Slide 33

Slide 33 text

[1=>$x, 2=>$y, 1=>$z] [${"*"},$k=>$_,${"*"},$k=>$_,${"*"}] An array with at least 2 identical key exprs. They can be located at any position

Slide 34

Slide 34 text

Features and syntax overview The pattern language

Slide 35

Slide 35 text

Running phpgrep phpgrep . '${"x:var"}++' 'x=i,j' File or directory name to search. By default, phpgrep recurses into nested directories

Slide 36

Slide 36 text

Running phpgrep phpgrep . '${"x:var"}++' 'x=i,j' Pattern to search, written in phpgrep pattern language (PPL)

Slide 37

Slide 37 text

Running phpgrep phpgrep . '${"x:var"}++' 'x=i,j' Additional filter (can have many) that excludes results if they don’t match given criteria. Every filter is a separate command-line arg

Slide 38

Slide 38 text

PPL (phpgrep pattern language) It’s almost normal PHP code, but with 2 differences to keep in mind. 1. $ is used for “any expr” matching 2. ${""} is a special matcher expression

Slide 39

Slide 39 text

PPL (phpgrep pattern language) It’s almost normal PHP code, but with 2 differences to keep in mind. => Can be parsed by any PHP parser.

Slide 40

Slide 40 text

PPL (phpgrep pattern language) Matcher expressions can specify the kind of nodes to match.

Slide 41

Slide 41 text

PPL (phpgrep pattern language) Matcher expressions can specify the kind of nodes to match. Filters are used to add additional conditions to the matcher variables.

Slide 42

Slide 42 text

Pattern language Matcher variables $x = $y All assignments $x = $x Self-assignments

Slide 43

Slide 43 text

Pattern language Matching variables literally $x 'x=foo' $foo variable $x 'x~_id$' Variable with “_id” suffix

Slide 44

Slide 44 text

Pattern language Matching strings "abc" "abc" string ${"x:str"} 'x~abc' String that contains "abc"

Slide 45

Slide 45 text

Pattern language Matching numbers 15 Int of 15 ${"s:int"} 'x=10,15' Int of 10 or 15

Slide 46

Slide 46 text

Let’s use phpgrep for something cool Use cases and (more) examples

Slide 47

Slide 47 text

Finding a bug over entire code base Use case 1

Slide 48

Slide 48 text

Weird operations precedence Easy to make mistake, tough consequences const MASK = 0xff00; $x = 0x00ff; $x & MASK > 128; // => false (?)

Slide 49

Slide 49 text

Weird operations precedence Easy to make mistake, tough consequences const MASK = 0xff00; $x = 0x00ff; $x & (MASK > 128); No, it returns 1 (which is true)!

Slide 50

Slide 50 text

Weird operations precedence Easy to make mistake, tough consequences const MASK = 0xff00; $x = 0x00ff; ($x & MASK) > 128; // => false

Slide 51

Slide 51 text

$x & $mask > $y Finds all similar defects in the code base

Slide 52

Slide 52 text

Pattern alternations Currently, there is no way to express any kind of pattern alternation.

Slide 53

Slide 53 text

Pattern alternations Currently, there is no way to express any kind of pattern alternation. Can’t say `$x $y` to match any kind of binary operator .

Slide 54

Slide 54 text

Patterns alternation Workaround: running phpgrep several times (And so on) $x & $mask < $y $x & $mask == $y $x & $mask != $y

Slide 55

Slide 55 text

Running a set of phpgrep patterns as checks as a part of CI pipeline Use case 2

Slide 56

Slide 56 text

Project-specific CI checks Imagine that there are some project conventions you want to enforce.

Slide 57

Slide 57 text

Project-specific CI checks Imagine that there are some project conventions you want to enforce. You can write a set of patterns that catch them and make CI reject the revision.

Slide 58

Slide 58 text

Project-specific CI checks 1. Prepare a list of patterns. 2. For every pattern, write associated message. 3. Run phpgrep for every pattern inside pipeline. 4. If any of phpgrep runs matches, stop build. For every match, print associated message

Slide 59

Slide 59 text

Project-specific CI checks 1. Prepare a list of patterns. 2. For every pattern, write associated message. 3. Run phpgrep for every pattern inside pipeline. 4. If any of phpgrep runs matches, stop build. For every match, print associated message

Slide 60

Slide 60 text

Project-specific CI checks 1. Prepare a list of patterns. 2. For every pattern, write associated message. 3. Run phpgrep for every pattern inside pipeline. 4. If any of phpgrep runs matches, stop build. For every match, print associated message

Slide 61

Slide 61 text

Project-specific CI checks 1. Prepare a list of patterns. 2. For every pattern, write associated message. 3. Run phpgrep for every pattern inside pipeline. 4. If any of phpgrep runs matches, stop build. For every match, print associated message

Slide 62

Slide 62 text

How phpgrep is planned to be used inside NoVerify linter Pluggable linter rules

Slide 63

Slide 63 text

NoVerify

Slide 64

Slide 64 text

NoVerify

Slide 65

Slide 65 text

Refactoring (search and replace) Use case 3

Slide 66

Slide 66 text

Refactoring array(${"*"}) => [${"*"}] Replace old array syntax with new isset($x) ? $x : $y => $x ?? $y Use null coalescing operator Modernizing the code

Slide 67

Slide 67 text

Refactoring Project-specific evolution // Don’t use $conn default value! function derp($query, $conn = null) derp($x) derp($x, null) Find derp unwanted derp calls

Slide 68

Slide 68 text

phpgrep performance Running a list of patterns: - O(N) complexity - Becomes slow with high N => Optimizations are required

Slide 69

Slide 69 text

phpgrep performance Can still be many times faster than grep with intricate regular expression. It’s a question of “a few seconds” vs “a several tens of minutes”.

Slide 70

Slide 70 text

The closest functional equivalent to phpgrep PhpStorm structural search

Slide 71

Slide 71 text

Structural search and replace (SSR) There are some differences between the pattern languages used by PhpStorm and phpgrep. ❏ $$ used for all search “variables” ❏ All filters & options are external to the pattern

Slide 72

Slide 72 text

Structural search and replace (SSR) There are some differences between the pattern languages used by PhpStorm and phpgrep. ❏ $$ used for all search “variables” ❏ All filters & options are external to the pattern

Slide 73

Slide 73 text

Structural search and replace (SSR) Filter examples. ❏ Regular expressions ❏ Type constraints ❏ Count (range) ❏ PSI-tree Groovy scripts

Slide 74

Slide 74 text

fstat(f($x, $y), $flags); fstat($x$, $y$) You can solve same tasks with SSR in almost the same way

Slide 75

Slide 75 text

So, why making phpgrep? We know that PhpStorm is cool, but... ❏ Not everyone is using PhpStorm ❏ phpgrep is a standalone tool ❏ phpgrep is a Go library, not just an utility

Slide 76

Slide 76 text

Why making phpgrep? We know that PhpStorm is cool, but... ❏ Not everyone is using PhpStorm ❏ phpgrep is a standalone tool ❏ phpgrep is a Go library, not just an utility

Slide 77

Slide 77 text

Why making phpgrep? We know that PhpStorm is cool, but... ❏ Not everyone is using PhpStorm ❏ phpgrep is a standalone tool ❏ phpgrep is a Go library, not just an utility

Slide 78

Slide 78 text

Why making phpgrep? We know that PhpStorm is cool, but... ❏ Not everyone is using PhpStorm ❏ phpgrep is a standalone tool without deps ❏ phpgrep is a Go library, not just an utility Everything becomes better when re-written in Go!

Slide 79

Slide 79 text

How we can do it and what it enables Code normalization

Slide 80

Slide 80 text

What is code normalization? It’s a way to turn input source code X into a normal (canonical) form.

Slide 81

Slide 81 text

What is code normalization? It’s a way to turn input source code X into a normal (canonical) form. Different input sources X and Y may end up in a same output after normalization.

Slide 82

Slide 82 text

What is code normalization? The exact rules of what is “normalized” are not that much relevant.

Slide 83

Slide 83 text

What is code normalization? The exact rules of what is “normalized” are not that much relevant. What is relevant is that among N alternatives we call only one of them as canonical.

Slide 84

Slide 84 text

Code normalization

Slide 85

Slide 85 text

Why we need normalization? So your pattern can match more identical code. ❏ Fuzzy code search ❏ Code duplication/similarity analysis ❏ Code simplifications, easier static analysis

Slide 86

Slide 86 text

Why we need normalization? So your pattern can match more identical code. ❏ Fuzzy code search ❏ Code duplication/similarity analysis ❏ Code simplifications, easier static analysis

Slide 87

Slide 87 text

Why we need normalization? So your pattern can match more identical code. ❏ Fuzzy code search ❏ Code duplication/similarity analysis ❏ Code simplifications, easier static analysis

Slide 88

Slide 88 text

But what about subtle details? Some forms are *almost* identical, but we still might want to consider them as 100% interchangeable.

Slide 89

Slide 89 text

But what about subtle details? Some forms are *almost* identical, but we still might want to consider them as 100% interchangeable. We use “normalization levels” to control that.

Slide 90

Slide 90 text

Normalization levels The best rule set depends on the goals. Next statements apply:

Slide 91

Slide 91 text

Normalization levels The best rule set depends on the goals. Next statements apply: ❏ More strict => less normalization ❏ Less strict => more normalization

Slide 92

Slide 92 text

Matching more with less Operation equivalence Are expressions below identical? intval($x) (int)$x

Slide 93

Slide 93 text

Matching more with less Operation equivalence Are expressions below identical? intval($x) (int)$x Yes!

Slide 94

Slide 94 text

Matching more with less Operation equivalence Are expressions below identical? +$x (int)$x

Slide 95

Slide 95 text

Matching more with less Operation equivalence Are expressions below identical? +$x (int)$x Not always!

Slide 96

Slide 96 text

Matching more with less Operation equivalence Are expressions below identical? +$x (int)$x But sometimes we don’t care

Slide 97

Slide 97 text

Matching more with less Operation reordering Are expressions below identical? $x++; $y--; $y--; $x++;

Slide 98

Slide 98 text

Matching more with less Operation reordering Are expressions below identical? $x++; $y--; $y--; $x++; Independent ops can be reordered

Slide 99

Slide 99 text

Normalization level is a phpgrep parameter that can improve the search results

Slide 100

Slide 100 text

Code search v2

Slide 101

Slide 101 text

Smooth transition slide...

Slide 102

Slide 102 text

The next time you’re going to do code search, make sure you’re using the proper tools. Like phpgrep and code normalization.

Slide 103

Slide 103 text

Closing words... #golang user group in Kazan #GolangKazan

Slide 104

Slide 104 text

The end

Slide 105

Slide 105 text

Slides that are optional. Need more?

Slide 106

Slide 106 text

Answering why using Go is a viable option for PHP tool. Go performance

Slide 107

Slide 107 text

What phpgrep does? 1. File I/O 2. PHP files parsing 3. The matching itself (AST against pattern)

Slide 108

Slide 108 text

What phpgrep does? 1. File I/O 2. PHP files parsing 3. The matching itself (AST against pattern) With a careful use of goroutines, it’s possible to make I/O faster.

Slide 109

Slide 109 text

What phpgrep does? 1. File I/O 2. PHP files parsing 3. The matching itself (AST against pattern) (2) and (3) get a lot of benefits from the performance of compiled language.

Slide 110

Slide 110 text

Go memory management story ❏ Garbage collection ❏ Slices are the main “memory resource” ❏ Pointers should be local and short-lived I’ll explain why it matters.

Slide 111

Slide 111 text

Garbage collector Needs to visit every reachable pointer.

Slide 112

Slide 112 text

Garbage collector Needs to visit every reachable pointer. More pointers => more work.

Slide 113

Slide 113 text

Garbage collector Needs to visit every reachable pointer. More pointers => more work. Can take a *lot* of execution time.

Slide 114

Slide 114 text

Slice of strings, “hidden” pointers pool := []string{s1, s2, s3} slice := make([]string, N) for i := range slice { slice[i] = pool[0] }

Slide 115

Slide 115 text

Slice of pool indexes, pointer-free pool := []string{s1, s2, s3} slice := make([]int, N) for i := range slice { slice[i] = 0 }

Slide 116

Slide 116 text

Slice of pool indexes, pointer-free - slice := make([]string, N) + slice := make([]int, N) - slice[i] = pool[0] + slice[i] = 0

Slide 117

Slide 117 text

Performance comparison old time/op new time/op delta 2.53ms ± 1% 0.42ms ± 1% -83.25% Pointer-free code spends far less time in “runtime” (GC).

Slide 118

Slide 118 text

Go memory management story ❏ Garbage collection ❏ Slices are the main “memory resource” ❏ Pointers should be local and short-lived Your memory pools should be slices of value types (i.e. [ ]T instead of [ ]*T).

Slide 119

Slide 119 text

Go memory management story ❏ Garbage collection ❏ Slices are the main “memory resource” ❏ Pointers should be local and short-lived You return a pointer to a pool slice element. That pointer should be as local as possible.