Upgrade to Pro — share decks privately, control downloads, hide ads and more …

phpgrep: syntax-aware code search

phpgrep: syntax-aware code search

Iskander (Alex) Sharipov

August 02, 2019
Tweet

More Decks by Iskander (Alex) Sharipov

Other Decks in Programming

Transcript

  1. phpgrep
    syntax-aware code search
    Искандер @quasilyte
    ВКонтакте

    View Slide

  2. @quasilyte
    ❏ Go compiler (Go + x86 asm) @ Intel
    ❏ Backend infrastructure (Go) @ VK
    ❏ Gopher Konstructor
    Working on dev tools is my passion.

    View Slide

  3. Talk structure
    ❏ phpgrep vs grep
    ❏ phpgrep features, pattern language
    ❏ Good use cases and examples
    ❏ PhpStorm structural search
    ❏ Code normalization and its applications

    View Slide

  4. Talk structure
    ❏ phpgrep vs grep
    ❏ phpgrep features, pattern language
    ❏ Good use cases and examples
    ❏ PhpStorm structural search
    ❏ Code normalization and its applications

    View Slide

  5. Talk structure
    ❏ phpgrep vs grep
    ❏ phpgrep features, pattern language
    ❏ Good use cases and examples
    ❏ PhpStorm structural search
    ❏ Code normalization and its applications

    View Slide

  6. Talk structure
    ❏ phpgrep vs grep
    ❏ phpgrep features, pattern language
    ❏ Good use cases and examples
    ❏ PhpStorm structural search
    ❏ Code normalization and its applications

    View Slide

  7. Talk structure
    ❏ phpgrep vs grep
    ❏ phpgrep features, pattern language
    ❏ Good use cases and examples
    ❏ PhpStorm structural search
    ❏ Code normalization and its applications

    View Slide

  8. Today we’re the code detective

    View Slide

  9. Find all assignments, where assigned
    value is a string longer than 10 chars
    First mission

    View Slide

  10. $s = "quite a long text";
    $x = "text with \" escaped quote";
    $arr[$key] = "a string key";
    Examples that should be matched

    View Slide

  11. Basically, regular expressions
    Let’s try grep

    View Slide

  12. grep (text level)

    View Slide

  13. grep
    $x = "this is a text";
    Implication 1:
    sees a line above as a sequence of characters

    View Slide

  14. grep
    $x = "this is a text";
    Implication 2:
    uses char-oriented pattern language (regexp)

    View Slide

  15. grep
    $x = "this is a text";
    Implication 3:
    doesn’t know anything about PHP

    View Slide

  16. $x = "this is a text";
    \$\w+\s*=\s*"[^"]{10,}"\s*
    We need to deal with optional
    whitespace, but that’s OK

    View Slide

  17. $x = "this is a text";
    \$\w+\s*=\s*"[^"]{10,}"\s*
    But this solutions is wrong.
    It doesn’t handle quote escaping

    View Slide

  18. $x = "this is a text";
    \$\w+\s*=\s*"(?:[^"\\]|\\.){10,}"\s*
    Is it sufficient now?

    View Slide

  19. $x = "this is a text";
    \$\w+\s*=\s*"(?:[^"\\]|\\.){10,}"\s*
    Is it sufficient now?
    Not really, we’re still matching
    only variable assignments

    View Slide

  20. Matching code with regexp is like trying to
    parse PHP using only regular expressions
    We (almost) succeeded, but...

    View Slide

  21. phpgrep (syntax level)

    View Slide

  22. $x = "this is a text";
    $_ = ${"s:str"}
    Note that we don’t care about
    whitespace anymore

    View Slide

  23. $x = "this is a text";
    $_ = ${"s:str"} s~.{10,}
    To apply “10 char length”
    restrictions, we use result filtering
    (more on that later)

    View Slide

  24. Find fstat function call with 2 arguments
    Second mission

    View Slide

  25. fstat($f, $flags)
    fstat(getFile($ctx, $name), 0)
    fstat($f->name, "b")
    Examples that should be matched

    View Slide

  26. fstat(f($x, $y), $flags);
    fstat\(.*?, [^,]*\)
    It’s hard to match with regexp
    because arguments may be
    complex and contain commas (,)

    View Slide

  27. fstat(f($x, $y), $flags);
    fstat($_, $_)
    With phpgrep it’s super simple to
    match arbitrary expressions

    View Slide

  28. Find array literals with duplicated keys
    (No regexp example this time!)
    Third mission

    View Slide

  29. [7=>"1", 7=>"2"]
    [$x, 7=>"1", 7=>"2"]
    [7=>"1", $x, 7=>"2", $y]
    Examples that should be matched

    View Slide

  30. [1=>$x, 2=>$y, 1=>$z]
    [${"*"},$k=>$_,${"*"},$k=>$_,${"*"}]
    ${"*"} - capturing of 0-N exprs

    View Slide

  31. [1=>$x, 2=>$y, 1=>$z]
    [${"*"},$k=>$_,${"*"},$k=>$_,${"*"}]
    $k - named non-empty expr
    capture

    View Slide

  32. [1=>$x, 2=>$y, 1=>$z]
    [${"*"},$k=>$_,${"*"},$k=>$_,${"*"}]
    $_ - any non-empty expr capture

    View Slide

  33. [1=>$x, 2=>$y, 1=>$z]
    [${"*"},$k=>$_,${"*"},$k=>$_,${"*"}]
    An array with at least 2 identical
    key exprs. They can be located at
    any position

    View Slide

  34. Features and syntax overview
    The pattern language

    View Slide

  35. Running phpgrep
    phpgrep . '${"x:var"}++' 'x=i,j'
    File or directory name to search.
    By default, phpgrep recurses into nested
    directories

    View Slide

  36. Running phpgrep
    phpgrep . '${"x:var"}++' 'x=i,j'
    Pattern to search, written in phpgrep pattern
    language (PPL)

    View Slide

  37. Running phpgrep
    phpgrep . '${"x:var"}++' 'x=i,j'
    Additional filter (can have many) that excludes
    results if they don’t match given criteria.
    Every filter is a separate command-line arg

    View Slide

  38. PPL (phpgrep pattern language)
    It’s almost normal PHP code, but with 2
    differences to keep in mind.
    1. $ is used for “any expr” matching
    2. ${""} is a special matcher expression

    View Slide

  39. PPL (phpgrep pattern language)
    It’s almost normal PHP code, but with 2
    differences to keep in mind.
    => Can be parsed by any PHP parser.

    View Slide

  40. PPL (phpgrep pattern language)
    Matcher expressions can specify the kind of
    nodes to match.

    View Slide

  41. PPL (phpgrep pattern language)
    Matcher expressions can specify the kind of
    nodes to match.
    Filters are used to add additional conditions to
    the matcher variables.

    View Slide

  42. Pattern language
    Matcher variables
    $x = $y
    All assignments
    $x = $x
    Self-assignments

    View Slide

  43. Pattern language
    Matching variables literally
    $x 'x=foo'
    $foo variable
    $x 'x~_id$'
    Variable with “_id” suffix

    View Slide

  44. Pattern language
    Matching strings
    "abc"
    "abc" string
    ${"x:str"} 'x~abc'
    String that contains "abc"

    View Slide

  45. Pattern language
    Matching numbers
    15
    Int of 15
    ${"s:int"} 'x=10,15'
    Int of 10 or 15

    View Slide

  46. Let’s use phpgrep for something cool
    Use cases and (more) examples

    View Slide

  47. Finding a bug over entire code base
    Use case 1

    View Slide

  48. Weird operations precedence
    Easy to make mistake, tough consequences
    const MASK = 0xff00;
    $x = 0x00ff;
    $x & MASK > 128;
    // => false (?)

    View Slide

  49. Weird operations precedence
    Easy to make mistake, tough consequences
    const MASK = 0xff00;
    $x = 0x00ff;
    $x & (MASK > 128);
    No, it returns 1 (which is true)!

    View Slide

  50. Weird operations precedence
    Easy to make mistake, tough consequences
    const MASK = 0xff00;
    $x = 0x00ff;
    ($x & MASK) > 128;
    // => false

    View Slide

  51. $x & $mask > $y
    Finds all similar defects in the
    code base

    View Slide

  52. Pattern alternations
    Currently, there is no way to express any kind of
    pattern alternation.

    View Slide

  53. Pattern alternations
    Currently, there is no way to express any kind of
    pattern alternation.
    Can’t say `$x $y` to match any kind of
    binary operator .

    View Slide

  54. Patterns alternation
    Workaround: running phpgrep several times
    (And so on)
    $x & $mask < $y
    $x & $mask == $y
    $x & $mask != $y

    View Slide

  55. Running a set of phpgrep patterns as
    checks as a part of CI pipeline
    Use case 2

    View Slide

  56. Project-specific CI checks
    Imagine that there are some project conventions
    you want to enforce.

    View Slide

  57. Project-specific CI checks
    Imagine that there are some project conventions
    you want to enforce.
    You can write a set of patterns that catch them
    and make CI reject the revision.

    View Slide

  58. Project-specific CI checks
    1. Prepare a list of patterns.
    2. For every pattern, write associated message.
    3. Run phpgrep for every pattern inside pipeline.
    4. If any of phpgrep runs matches, stop build.
    For every match, print associated message

    View Slide

  59. Project-specific CI checks
    1. Prepare a list of patterns.
    2. For every pattern, write associated message.
    3. Run phpgrep for every pattern inside pipeline.
    4. If any of phpgrep runs matches, stop build.
    For every match, print associated message

    View Slide

  60. Project-specific CI checks
    1. Prepare a list of patterns.
    2. For every pattern, write associated message.
    3. Run phpgrep for every pattern inside pipeline.
    4. If any of phpgrep runs matches, stop build.
    For every match, print associated message

    View Slide

  61. Project-specific CI checks
    1. Prepare a list of patterns.
    2. For every pattern, write associated message.
    3. Run phpgrep for every pattern inside pipeline.
    4. If any of phpgrep runs matches, stop build.
    For every match, print associated message

    View Slide

  62. How phpgrep is planned to be used inside
    NoVerify linter
    Pluggable linter rules

    View Slide

  63. NoVerify

    View Slide

  64. NoVerify

    View Slide

  65. Refactoring (search and replace)
    Use case 3

    View Slide

  66. Refactoring
    array(${"*"}) => [${"*"}]
    Replace old array syntax with new
    isset($x) ? $x : $y => $x ?? $y
    Use null coalescing operator
    Modernizing the code

    View Slide

  67. Refactoring
    Project-specific evolution
    // Don’t use $conn default value!
    function derp($query, $conn = null)
    derp($x) derp($x, null)
    Find derp unwanted derp calls

    View Slide

  68. phpgrep performance
    Running a list of patterns:
    - O(N) complexity
    - Becomes slow with high N
    => Optimizations are required

    View Slide

  69. phpgrep performance
    Can still be many times faster than grep with
    intricate regular expression.
    It’s a question of “a few seconds” vs
    “a several tens of minutes”.

    View Slide

  70. The closest functional equivalent to phpgrep
    PhpStorm structural search

    View Slide

  71. Structural search and replace (SSR)
    There are some differences between the pattern
    languages used by PhpStorm and phpgrep.
    ❏ $$ used for all search “variables”
    ❏ All filters & options are external to the pattern

    View Slide

  72. Structural search and replace (SSR)
    There are some differences between the pattern
    languages used by PhpStorm and phpgrep.
    ❏ $$ used for all search “variables”
    ❏ All filters & options are external to the pattern

    View Slide

  73. Structural search and replace (SSR)
    Filter examples.
    ❏ Regular expressions
    ❏ Type constraints
    ❏ Count (range)
    ❏ PSI-tree Groovy scripts

    View Slide

  74. fstat(f($x, $y), $flags);
    fstat($x$, $y$)
    You can solve same tasks with
    SSR in almost the same way

    View Slide

  75. So, why making phpgrep?
    We know that PhpStorm is cool, but...
    ❏ Not everyone is using PhpStorm
    ❏ phpgrep is a standalone tool
    ❏ phpgrep is a Go library, not just an utility

    View Slide

  76. Why making phpgrep?
    We know that PhpStorm is cool, but...
    ❏ Not everyone is using PhpStorm
    ❏ phpgrep is a standalone tool
    ❏ phpgrep is a Go library, not just an utility

    View Slide

  77. Why making phpgrep?
    We know that PhpStorm is cool, but...
    ❏ Not everyone is using PhpStorm
    ❏ phpgrep is a standalone tool
    ❏ phpgrep is a Go library, not just an utility

    View Slide

  78. Why making phpgrep?
    We know that PhpStorm is cool, but...

    Not everyone is using PhpStorm

    phpgrep is a standalone tool without deps

    phpgrep is a Go library, not just an utility
    Everything becomes better when re-written in Go!

    View Slide

  79. How we can do it and what it enables
    Code normalization

    View Slide

  80. What is code normalization?
    It’s a way to turn input source code X into a
    normal (canonical) form.

    View Slide

  81. What is code normalization?
    It’s a way to turn input source code X into a
    normal (canonical) form.
    Different input sources X and Y may end up in a
    same output after normalization.

    View Slide

  82. What is code normalization?
    The exact rules of what is “normalized” are not
    that much relevant.

    View Slide

  83. What is code normalization?
    The exact rules of what is “normalized” are not
    that much relevant.
    What is relevant is that among N alternatives we
    call only one of them as canonical.

    View Slide

  84. Code normalization

    View Slide

  85. Why we need normalization?
    So your pattern can match more identical code.
    ❏ Fuzzy code search
    ❏ Code duplication/similarity analysis
    ❏ Code simplifications, easier static analysis

    View Slide

  86. Why we need normalization?
    So your pattern can match more identical code.
    ❏ Fuzzy code search
    ❏ Code duplication/similarity analysis
    ❏ Code simplifications, easier static analysis

    View Slide

  87. Why we need normalization?
    So your pattern can match more identical code.
    ❏ Fuzzy code search
    ❏ Code duplication/similarity analysis
    ❏ Code simplifications, easier static analysis

    View Slide

  88. But what about subtle details?
    Some forms are *almost* identical,
    but we still might want to consider them as
    100% interchangeable.

    View Slide

  89. But what about subtle details?
    Some forms are *almost* identical,
    but we still might want to consider them as
    100% interchangeable.
    We use “normalization levels” to control that.

    View Slide

  90. Normalization levels
    The best rule set depends on the goals.
    Next statements apply:

    View Slide

  91. Normalization levels
    The best rule set depends on the goals.
    Next statements apply:
    ❏ More strict => less normalization
    ❏ Less strict => more normalization

    View Slide

  92. Matching more with less
    Operation equivalence
    Are expressions below identical?
    intval($x)
    (int)$x

    View Slide

  93. Matching more with less
    Operation equivalence
    Are expressions below identical?
    intval($x)
    (int)$x
    Yes!

    View Slide

  94. Matching more with less
    Operation equivalence
    Are expressions below identical?
    +$x
    (int)$x

    View Slide

  95. Matching more with less
    Operation equivalence
    Are expressions below identical?
    +$x
    (int)$x
    Not always!

    View Slide

  96. Matching more with less
    Operation equivalence
    Are expressions below identical?
    +$x
    (int)$x
    But sometimes we don’t care

    View Slide

  97. Matching more with less
    Operation reordering
    Are expressions below identical?
    $x++; $y--;
    $y--; $x++;

    View Slide

  98. Matching more with less
    Operation reordering
    Are expressions below identical?
    $x++; $y--;
    $y--; $x++;
    Independent ops can be reordered

    View Slide

  99. Normalization level is a phpgrep parameter that
    can improve the search results

    View Slide

  100. Code search v2

    View Slide

  101. Smooth transition slide...

    View Slide

  102. The next time you’re going to do code search,
    make sure you’re using the proper tools.
    Like phpgrep and code normalization.

    View Slide

  103. Closing words...
    #golang
    user group in Kazan
    #GolangKazan

    View Slide

  104. The end

    View Slide

  105. Slides that are optional.
    Need more?

    View Slide

  106. Answering why using Go is a viable option
    for PHP tool.
    Go performance

    View Slide

  107. What phpgrep does?
    1. File I/O
    2. PHP files parsing
    3. The matching itself (AST against pattern)

    View Slide

  108. What phpgrep does?
    1. File I/O
    2. PHP files parsing
    3. The matching itself (AST against pattern)
    With a careful use of goroutines, it’s possible to
    make I/O faster.

    View Slide

  109. What phpgrep does?
    1. File I/O
    2. PHP files parsing
    3. The matching itself (AST against pattern)
    (2) and (3) get a lot of benefits from the
    performance of compiled language.

    View Slide

  110. Go memory management story
    ❏ Garbage collection
    ❏ Slices are the main “memory resource”
    ❏ Pointers should be local and short-lived
    I’ll explain why it matters.

    View Slide

  111. Garbage collector
    Needs to visit every reachable pointer.

    View Slide

  112. Garbage collector
    Needs to visit every reachable pointer.
    More pointers => more work.

    View Slide

  113. Garbage collector
    Needs to visit every reachable pointer.
    More pointers => more work.
    Can take a *lot* of execution time.

    View Slide

  114. Slice of strings, “hidden” pointers
    pool := []string{s1, s2, s3}
    slice := make([]string, N)
    for i := range slice {
    slice[i] = pool[0]
    }

    View Slide

  115. Slice of pool indexes, pointer-free
    pool := []string{s1, s2, s3}
    slice := make([]int, N)
    for i := range slice {
    slice[i] = 0
    }

    View Slide

  116. Slice of pool indexes, pointer-free
    - slice := make([]string, N)
    + slice := make([]int, N)
    - slice[i] = pool[0]
    + slice[i] = 0

    View Slide

  117. Performance comparison
    old time/op new time/op delta
    2.53ms ± 1% 0.42ms ± 1% -83.25%
    Pointer-free code spends far less
    time in “runtime” (GC).

    View Slide

  118. Go memory management story
    ❏ Garbage collection
    ❏ Slices are the main “memory resource”
    ❏ Pointers should be local and short-lived
    Your memory pools should be slices of value
    types (i.e. [ ]T instead of [ ]*T).

    View Slide

  119. Go memory management story
    ❏ Garbage collection
    ❏ Slices are the main “memory resource”
    ❏ Pointers should be local and short-lived
    You return a pointer to a pool slice element.
    That pointer should be as local as possible.

    View Slide