@quasilyte
❏ Go compiler (Go + x86 asm) @ Intel
❏ Backend infrastructure (Go) @ VK
❏ Gopher Konstructor
Working on dev tools is my passion.
Slide 3
Slide 3 text
Talk structure
❏ phpgrep vs grep
❏ phpgrep features, pattern language
❏ Good use cases and examples
❏ PhpStorm structural search
❏ Code normalization and its applications
Slide 4
Slide 4 text
Talk structure
❏ phpgrep vs grep
❏ phpgrep features, pattern language
❏ Good use cases and examples
❏ PhpStorm structural search
❏ Code normalization and its applications
Slide 5
Slide 5 text
Talk structure
❏ phpgrep vs grep
❏ phpgrep features, pattern language
❏ Good use cases and examples
❏ PhpStorm structural search
❏ Code normalization and its applications
Slide 6
Slide 6 text
Talk structure
❏ phpgrep vs grep
❏ phpgrep features, pattern language
❏ Good use cases and examples
❏ PhpStorm structural search
❏ Code normalization and its applications
Slide 7
Slide 7 text
Talk structure
❏ phpgrep vs grep
❏ phpgrep features, pattern language
❏ Good use cases and examples
❏ PhpStorm structural search
❏ Code normalization and its applications
Slide 8
Slide 8 text
Today we’re the code detective
Slide 9
Slide 9 text
Find all assignments, where assigned
value is a string longer than 10 chars
First mission
Slide 10
Slide 10 text
$s = "quite a long text";
$x = "text with \" escaped quote";
$arr[$key] = "a string key";
Examples that should be matched
Slide 11
Slide 11 text
Basically, regular expressions
Let’s try grep
Slide 12
Slide 12 text
grep (text level)
Slide 13
Slide 13 text
grep
$x = "this is a text";
Implication 1:
sees a line above as a sequence of characters
Slide 14
Slide 14 text
grep
$x = "this is a text";
Implication 2:
uses char-oriented pattern language (regexp)
Slide 15
Slide 15 text
grep
$x = "this is a text";
Implication 3:
doesn’t know anything about PHP
Slide 16
Slide 16 text
$x = "this is a text";
\$\w+\s*=\s*"[^"]{10,}"\s*
We need to deal with optional
whitespace, but that’s OK
Slide 17
Slide 17 text
$x = "this is a text";
\$\w+\s*=\s*"[^"]{10,}"\s*
But this solutions is wrong.
It doesn’t handle quote escaping
Slide 18
Slide 18 text
$x = "this is a text";
\$\w+\s*=\s*"(?:[^"\\]|\\.){10,}"\s*
Is it sufficient now?
Slide 19
Slide 19 text
$x = "this is a text";
\$\w+\s*=\s*"(?:[^"\\]|\\.){10,}"\s*
Is it sufficient now?
Not really, we’re still matching
only variable assignments
Slide 20
Slide 20 text
Matching code with regexp is like trying to
parse PHP using only regular expressions
We (almost) succeeded, but...
Slide 21
Slide 21 text
phpgrep (syntax level)
Slide 22
Slide 22 text
$x = "this is a text";
$_ = ${"s:str"}
Note that we don’t care about
whitespace anymore
Slide 23
Slide 23 text
$x = "this is a text";
$_ = ${"s:str"} s~.{10,}
To apply “10 char length”
restrictions, we use result filtering
(more on that later)
Slide 24
Slide 24 text
Find fstat function call with 2 arguments
Second mission
Slide 25
Slide 25 text
fstat($f, $flags)
fstat(getFile($ctx, $name), 0)
fstat($f->name, "b")
Examples that should be matched
Slide 26
Slide 26 text
fstat(f($x, $y), $flags);
fstat\(.*?, [^,]*\)
It’s hard to match with regexp
because arguments may be
complex and contain commas (,)
Slide 27
Slide 27 text
fstat(f($x, $y), $flags);
fstat($_, $_)
With phpgrep it’s super simple to
match arbitrary expressions
Slide 28
Slide 28 text
Find array literals with duplicated keys
(No regexp example this time!)
Third mission
Slide 29
Slide 29 text
[7=>"1", 7=>"2"]
[$x, 7=>"1", 7=>"2"]
[7=>"1", $x, 7=>"2", $y]
Examples that should be matched
Slide 30
Slide 30 text
[1=>$x, 2=>$y, 1=>$z]
[${"*"},$k=>$_,${"*"},$k=>$_,${"*"}]
${"*"} - capturing of 0-N exprs
Slide 31
Slide 31 text
[1=>$x, 2=>$y, 1=>$z]
[${"*"},$k=>$_,${"*"},$k=>$_,${"*"}]
$k - named non-empty expr
capture
Slide 32
Slide 32 text
[1=>$x, 2=>$y, 1=>$z]
[${"*"},$k=>$_,${"*"},$k=>$_,${"*"}]
$_ - any non-empty expr capture
Slide 33
Slide 33 text
[1=>$x, 2=>$y, 1=>$z]
[${"*"},$k=>$_,${"*"},$k=>$_,${"*"}]
An array with at least 2 identical
key exprs. They can be located at
any position
Slide 34
Slide 34 text
Features and syntax overview
The pattern language
Slide 35
Slide 35 text
Running phpgrep
phpgrep . '${"x:var"}++' 'x=i,j'
File or directory name to search.
By default, phpgrep recurses into nested
directories
Slide 36
Slide 36 text
Running phpgrep
phpgrep . '${"x:var"}++' 'x=i,j'
Pattern to search, written in phpgrep pattern
language (PPL)
Slide 37
Slide 37 text
Running phpgrep
phpgrep . '${"x:var"}++' 'x=i,j'
Additional filter (can have many) that excludes
results if they don’t match given criteria.
Every filter is a separate command-line arg
Slide 38
Slide 38 text
PPL (phpgrep pattern language)
It’s almost normal PHP code, but with 2
differences to keep in mind.
1. $ is used for “any expr” matching
2. ${""} is a special matcher expression
Slide 39
Slide 39 text
PPL (phpgrep pattern language)
It’s almost normal PHP code, but with 2
differences to keep in mind.
=> Can be parsed by any PHP parser.
Slide 40
Slide 40 text
PPL (phpgrep pattern language)
Matcher expressions can specify the kind of
nodes to match.
Slide 41
Slide 41 text
PPL (phpgrep pattern language)
Matcher expressions can specify the kind of
nodes to match.
Filters are used to add additional conditions to
the matcher variables.
Slide 42
Slide 42 text
Pattern language
Matcher variables
$x = $y
All assignments
$x = $x
Self-assignments
Slide 43
Slide 43 text
Pattern language
Matching variables literally
$x 'x=foo'
$foo variable
$x 'x~_id$'
Variable with “_id” suffix
Slide 44
Slide 44 text
Pattern language
Matching strings
"abc"
"abc" string
${"x:str"} 'x~abc'
String that contains "abc"
Slide 45
Slide 45 text
Pattern language
Matching numbers
15
Int of 15
${"s:int"} 'x=10,15'
Int of 10 or 15
Slide 46
Slide 46 text
Let’s use phpgrep for something cool
Use cases and (more) examples
$x & $mask > $y
Finds all similar defects in the
code base
Slide 52
Slide 52 text
Pattern alternations
Currently, there is no way to express any kind of
pattern alternation.
Slide 53
Slide 53 text
Pattern alternations
Currently, there is no way to express any kind of
pattern alternation.
Can’t say `$x $y` to match any kind of
binary operator .
Slide 54
Slide 54 text
Patterns alternation
Workaround: running phpgrep several times
(And so on)
$x & $mask < $y
$x & $mask == $y
$x & $mask != $y
Slide 55
Slide 55 text
Running a set of phpgrep patterns as
checks as a part of CI pipeline
Use case 2
Slide 56
Slide 56 text
Project-specific CI checks
Imagine that there are some project conventions
you want to enforce.
Slide 57
Slide 57 text
Project-specific CI checks
Imagine that there are some project conventions
you want to enforce.
You can write a set of patterns that catch them
and make CI reject the revision.
Slide 58
Slide 58 text
Project-specific CI checks
1. Prepare a list of patterns.
2. For every pattern, write associated message.
3. Run phpgrep for every pattern inside pipeline.
4. If any of phpgrep runs matches, stop build.
For every match, print associated message
Slide 59
Slide 59 text
Project-specific CI checks
1. Prepare a list of patterns.
2. For every pattern, write associated message.
3. Run phpgrep for every pattern inside pipeline.
4. If any of phpgrep runs matches, stop build.
For every match, print associated message
Slide 60
Slide 60 text
Project-specific CI checks
1. Prepare a list of patterns.
2. For every pattern, write associated message.
3. Run phpgrep for every pattern inside pipeline.
4. If any of phpgrep runs matches, stop build.
For every match, print associated message
Slide 61
Slide 61 text
Project-specific CI checks
1. Prepare a list of patterns.
2. For every pattern, write associated message.
3. Run phpgrep for every pattern inside pipeline.
4. If any of phpgrep runs matches, stop build.
For every match, print associated message
Slide 62
Slide 62 text
How phpgrep is planned to be used inside
NoVerify linter
Pluggable linter rules
Slide 63
Slide 63 text
NoVerify
Slide 64
Slide 64 text
NoVerify
Slide 65
Slide 65 text
Refactoring (search and replace)
Use case 3
Slide 66
Slide 66 text
Refactoring
array(${"*"}) => [${"*"}]
Replace old array syntax with new
isset($x) ? $x : $y => $x ?? $y
Use null coalescing operator
Modernizing the code
phpgrep performance
Running a list of patterns:
- O(N) complexity
- Becomes slow with high N
=> Optimizations are required
Slide 69
Slide 69 text
phpgrep performance
Can still be many times faster than grep with
intricate regular expression.
It’s a question of “a few seconds” vs
“a several tens of minutes”.
Slide 70
Slide 70 text
The closest functional equivalent to phpgrep
PhpStorm structural search
Slide 71
Slide 71 text
Structural search and replace (SSR)
There are some differences between the pattern
languages used by PhpStorm and phpgrep.
❏ $$ used for all search “variables”
❏ All filters & options are external to the pattern
Slide 72
Slide 72 text
Structural search and replace (SSR)
There are some differences between the pattern
languages used by PhpStorm and phpgrep.
❏ $$ used for all search “variables”
❏ All filters & options are external to the pattern
Slide 73
Slide 73 text
Structural search and replace (SSR)
Filter examples.
❏ Regular expressions
❏ Type constraints
❏ Count (range)
❏ PSI-tree Groovy scripts
Slide 74
Slide 74 text
fstat(f($x, $y), $flags);
fstat($x$, $y$)
You can solve same tasks with
SSR in almost the same way
Slide 75
Slide 75 text
So, why making phpgrep?
We know that PhpStorm is cool, but...
❏ Not everyone is using PhpStorm
❏ phpgrep is a standalone tool
❏ phpgrep is a Go library, not just an utility
Slide 76
Slide 76 text
Why making phpgrep?
We know that PhpStorm is cool, but...
❏ Not everyone is using PhpStorm
❏ phpgrep is a standalone tool
❏ phpgrep is a Go library, not just an utility
Slide 77
Slide 77 text
Why making phpgrep?
We know that PhpStorm is cool, but...
❏ Not everyone is using PhpStorm
❏ phpgrep is a standalone tool
❏ phpgrep is a Go library, not just an utility
Slide 78
Slide 78 text
Why making phpgrep?
We know that PhpStorm is cool, but...
❏
Not everyone is using PhpStorm
❏
phpgrep is a standalone tool without deps
❏
phpgrep is a Go library, not just an utility
Everything becomes better when re-written in Go!
Slide 79
Slide 79 text
How we can do it and what it enables
Code normalization
Slide 80
Slide 80 text
What is code normalization?
It’s a way to turn input source code X into a
normal (canonical) form.
Slide 81
Slide 81 text
What is code normalization?
It’s a way to turn input source code X into a
normal (canonical) form.
Different input sources X and Y may end up in a
same output after normalization.
Slide 82
Slide 82 text
What is code normalization?
The exact rules of what is “normalized” are not
that much relevant.
Slide 83
Slide 83 text
What is code normalization?
The exact rules of what is “normalized” are not
that much relevant.
What is relevant is that among N alternatives we
call only one of them as canonical.
Slide 84
Slide 84 text
Code normalization
Slide 85
Slide 85 text
Why we need normalization?
So your pattern can match more identical code.
❏ Fuzzy code search
❏ Code duplication/similarity analysis
❏ Code simplifications, easier static analysis
Slide 86
Slide 86 text
Why we need normalization?
So your pattern can match more identical code.
❏ Fuzzy code search
❏ Code duplication/similarity analysis
❏ Code simplifications, easier static analysis
Slide 87
Slide 87 text
Why we need normalization?
So your pattern can match more identical code.
❏ Fuzzy code search
❏ Code duplication/similarity analysis
❏ Code simplifications, easier static analysis
Slide 88
Slide 88 text
But what about subtle details?
Some forms are *almost* identical,
but we still might want to consider them as
100% interchangeable.
Slide 89
Slide 89 text
But what about subtle details?
Some forms are *almost* identical,
but we still might want to consider them as
100% interchangeable.
We use “normalization levels” to control that.
Slide 90
Slide 90 text
Normalization levels
The best rule set depends on the goals.
Next statements apply:
Slide 91
Slide 91 text
Normalization levels
The best rule set depends on the goals.
Next statements apply:
❏ More strict => less normalization
❏ Less strict => more normalization
Slide 92
Slide 92 text
Matching more with less
Operation equivalence
Are expressions below identical?
intval($x)
(int)$x
Slide 93
Slide 93 text
Matching more with less
Operation equivalence
Are expressions below identical?
intval($x)
(int)$x
Yes!
Slide 94
Slide 94 text
Matching more with less
Operation equivalence
Are expressions below identical?
+$x
(int)$x
Slide 95
Slide 95 text
Matching more with less
Operation equivalence
Are expressions below identical?
+$x
(int)$x
Not always!
Slide 96
Slide 96 text
Matching more with less
Operation equivalence
Are expressions below identical?
+$x
(int)$x
But sometimes we don’t care
Slide 97
Slide 97 text
Matching more with less
Operation reordering
Are expressions below identical?
$x++; $y--;
$y--; $x++;
Slide 98
Slide 98 text
Matching more with less
Operation reordering
Are expressions below identical?
$x++; $y--;
$y--; $x++;
Independent ops can be reordered
Slide 99
Slide 99 text
Normalization level is a phpgrep parameter that
can improve the search results
Slide 100
Slide 100 text
Code search v2
Slide 101
Slide 101 text
Smooth transition slide...
Slide 102
Slide 102 text
The next time you’re going to do code search,
make sure you’re using the proper tools.
Like phpgrep and code normalization.
Slide 103
Slide 103 text
Closing words...
#golang
user group in Kazan
#GolangKazan
Slide 104
Slide 104 text
The end
Slide 105
Slide 105 text
Slides that are optional.
Need more?
Slide 106
Slide 106 text
Answering why using Go is a viable option
for PHP tool.
Go performance
Slide 107
Slide 107 text
What phpgrep does?
1. File I/O
2. PHP files parsing
3. The matching itself (AST against pattern)
Slide 108
Slide 108 text
What phpgrep does?
1. File I/O
2. PHP files parsing
3. The matching itself (AST against pattern)
With a careful use of goroutines, it’s possible to
make I/O faster.
Slide 109
Slide 109 text
What phpgrep does?
1. File I/O
2. PHP files parsing
3. The matching itself (AST against pattern)
(2) and (3) get a lot of benefits from the
performance of compiled language.
Slide 110
Slide 110 text
Go memory management story
❏ Garbage collection
❏ Slices are the main “memory resource”
❏ Pointers should be local and short-lived
I’ll explain why it matters.
Slide 111
Slide 111 text
Garbage collector
Needs to visit every reachable pointer.
Slide 112
Slide 112 text
Garbage collector
Needs to visit every reachable pointer.
More pointers => more work.
Slide 113
Slide 113 text
Garbage collector
Needs to visit every reachable pointer.
More pointers => more work.
Can take a *lot* of execution time.
Slide 114
Slide 114 text
Slice of strings, “hidden” pointers
pool := []string{s1, s2, s3}
slice := make([]string, N)
for i := range slice {
slice[i] = pool[0]
}
Slide 115
Slide 115 text
Slice of pool indexes, pointer-free
pool := []string{s1, s2, s3}
slice := make([]int, N)
for i := range slice {
slice[i] = 0
}
Performance comparison
old time/op new time/op delta
2.53ms ± 1% 0.42ms ± 1% -83.25%
Pointer-free code spends far less
time in “runtime” (GC).
Slide 118
Slide 118 text
Go memory management story
❏ Garbage collection
❏ Slices are the main “memory resource”
❏ Pointers should be local and short-lived
Your memory pools should be slices of value
types (i.e. [ ]T instead of [ ]*T).
Slide 119
Slide 119 text
Go memory management story
❏ Garbage collection
❏ Slices are the main “memory resource”
❏ Pointers should be local and short-lived
You return a pointer to a pool slice element.
That pointer should be as local as possible.