Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mastering regex incantations

Mastering regex incantations

Regular expressions are a very powerful tool in every software engineer's toolbox. If you know them well, you can access their power whenever analysing any regular construct. In this talk, I will present the most advanced parts of regex syntax and discuss other related concepts.

Tomasz Kowalczyk

November 17, 2017
Tweet

More Decks by Tomasz Kowalczyk

Other Decks in Programming

Transcript

  1. regex incantations
    mastering
    Tomasz Kowalczyk / @tmmx

    View Slide

  2. View Slide

  3. View Slide

  4. /introduction/

    View Slide

  5. regular expression

    View Slide

  6. pattern match

    View Slide

  7. regex flavors
    syntax and feature differences

    View Slide

  8. Perl-Compatible
    Regular Expressions

    View Slide

  9. PHP Hypertext
    Preprocessor

    View Slide

  10. PCRE in PHP
    preg_​filter
    preg_​grep
    preg_​last_​error
    preg_​match_​all
    preg_​match
    preg_​quote
    preg_​replace_​callback_​array
    preg_​replace_​callback
    preg_​replace
    preg_​split

    View Slide

  11. ereg_​replace
    ereg
    eregi_​replace
    eregi
    split
    spliti
    sql_​regcase
    DEPRECATED IN PHP 5.3.0
    REMOVED IN PHP 7.0.0

    View Slide

  12. [character] [^classes]
    (capture) (?:groups)
    alter | nations
    optionals?
    quan+ tif* iers{1,2}

    View Slide

  13. /incantations/

    View Slide

  14. xkcd.com/208

    View Slide

  15. View Slide

  16. atomic groups
    lookarounds

    View Slide

  17. (?=regex) lookahead
    (?!regex) negative lookahead
    (?<=regex) lookbehind
    (?

    View Slide

  18. ~^A(?=\d)\d(?<=\d)Z$~
    A4Z

    View Slide

  19. branch reset
    different captures under single group

    View Slide

  20. (?|(branch)|(reset))

    View Slide

  21. ~(?|(\d+)|([a-zA-Z]+))~g
    SymfonyCon2017
    Matches: SymfonyCon, 2017

    View Slide

  22. backreferences
    reference previous capture groups

    View Slide

  23. (?\d+)(?&foo)
    (?\d+)(?1)
    (?\d+)\g{-1}

    View Slide

  24. ~^(?A)(?&r)\1\g{-1}$~g
    AAAA

    View Slide

  25. conditionals
    test current position and choose path

    View Slide

  26. (?(check)true|false)

    View Slide

  27. ~(?(?=A)A\d{2}|00Z)~g
    A1100Z

    View Slide

  28. permanent anchors
    always match start or end of the input

    View Slide

  29. \Aregex\Z

    View Slide

  30. ~^\d+$~gm
    2016
    2017
    2018

    View Slide

  31. ~\A\d+\Z~gm
    2016
    2017
    2018

    View Slide

  32. ~\A\d+\Z~gm
    2017

    View Slide

  33. recursion
    match the regex itself inside the regex

    View Slide

  34. (?R)

    View Slide

  35. ~a(?R)?z~
    ~a(?R)?z~
    ~a(?R)?z~
    ~a(?R)?z~
    aaazzz

    View Slide

  36. subroutines
    define and call subregexes

    View Slide

  37. (?(DEFINE)(?regex))
    (?(DEFINE)
    (?regex)
    (?regex)
    )

    View Slide

  38. ~(?(DEFINE)
    (?[a-zA-Z]+)
    (?[0-9]+)
    )
    ((?&alpha))((?&digits))~x
    SymfonyCon2017

    View Slide

  39. literal escapes

    View Slide

  40. \Qescape\E

    View Slide

  41. ~[\[email protected]#$%^&*()[]{}\E]+~
    {[([email protected]#$%^&*)]}

    View Slide

  42. forcing failure

    View Slide

  43. (*FAIL)
    (?!)

    View Slide

  44. ~it (*FAIL)ed~
    ~it (?!)ed~

    View Slide

  45. /quirks/

    View Slide

  46. xkcd.com/1171

    View Slide

  47. View Slide

  48. the dot

    View Slide

  49. 16-11-2017
    \d{2}.\d{2}.\d{4}

    View Slide

  50. 16a11z2017

    View Slide

  51. \d{2}[-.]\d{2}[-.]\d{4}

    View Slide

  52. line anchors

    View Slide

  53. ~^\d+$~m
    abc
    def
    123

    View Slide

  54. ~\A\d+\Z~m
    abc
    def
    123

    View Slide

  55. balanced constructs

    View Slide


  56. ~<([a-z]+)>(?R)?\1>~gi

    View Slide

  57. stackoverflow.com/a/1732454/443341

    View Slide

  58. View Slide

  59. PREG_OFFSET_CAPTURE
    and the multibyte input

    View Slide

  60. ’’’’[xx]’’[yy]
    regular apostrophe (U+0027): '
    RIGHT SINGLE QUOTATION MARK (U+2019): ’

    View Slide

  61. $part = substr($text, 0, $match[1]);
    $offset = mb_strlen($part, 'utf-8');

    View Slide

  62. email validation

    View Slide

  63. ex-parrot.com/~pdw/Mail-RFC822-Address.html
    (?:(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<
    >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\
    r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\
    [\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\
    t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?
    :\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[
    ([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(
    ?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\"
    .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t
    ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
    |\\.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*:(?
    :(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@
    ,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\
    n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
    ]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t]
    )*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\
    r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
    ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?:
    \r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\
    [\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])
    +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\
    \.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t]
    )*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<
    >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z
    |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?
    :(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^(
    )<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])
    +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\
    \.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.
    (?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\"
    .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["
    ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

    View Slide

  64. /use cases/

    View Slide

  65. Shortcode

    View Slide

  66. [name=bbCode arg=val param] content [/name]

    View Slide

  67. [name=bbCode arg=val param] content [/name]

    View Slide

  68. /\[(\[?)([a-zA-Z]+)(?![\w-])([^\]\/]*(?:\/(?!
    \])[^\]\/]*)*?)(?:(\/)\]|\](?:([^\[]*+(?:\[(?
    !\/\2\])[^\[]*+)*+)\[\/\2\])?)(\]?)/s
    /([\w-]+)\s*=\s*"([^"]*)"(?:\s|$)|([\w-]+)\s*
    =\s*'([^']*)'(?:\s|$)|([\w-]+)\s*=\s*([^\s'"]
    +)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/
    https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L239
    https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L448

    View Slide

  69. ~((?:\[\s*(?[a-zA-Z0-9-_]+)\s*(?:\=\
    s*(?\"(?:[^\"\\]*(?:\\.[^\"\\]*)*)
    \"|(?!=(?:\s*|\]|\/\]))))?\s*(?>(?:\s*(?:\w+(?:\s*\=\s*\"(?:[^\"\\]*(?:\\
    .[^\"\\]*)*)\"|\s*\=\s*(?!=(?:\s*|\]|\/\])
    )|(?=\s|\]|\/\s*\]|$))))*)\s*(?:\](?nt>.*?)\[\s*(?\/)\s*(\ke>)\s*\]|\]|(?\/)\s*\])))~us
    github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

    View Slide

  70. Generator

    View Slide

  71. 3a[1!n]4!a10n
    between one and three alpha characters
    exactly one optional digit
    exactly four alpha characters
    and between one and ten digits

    View Slide

  72. ~^[a-zA-Z]{1,3}[0-9]{1}?[a-zA-Z]{4}[0-9]{1,10}$~
    between one and three alpha characters
    exactly one optional digit
    exactly four alpha characters
    and between one and ten digits

    View Slide

  73. ~^(?[a-zA-Z]{1,3})(?[0-9]{1}?)(?[a-z
    A-Z]{4})(?[0-9]{1,10})$~
    THU3NDER1337
    a1: THU, n1: 3, a2: NDER, n2: 1337

    View Slide

  74. /tools/

    View Slide

  75. regular-expressions.info

    View Slide

  76. RexEgg

    View Slide

  77. Debuggex

    View Slide

  78. github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

    View Slide

  79. Regex101

    View Slide

  80. /summary/

    View Slide

  81. (?P)
    Twitter / @tmmx

    View Slide

  82. please rate the talk and leave feedback
    joind.in/talk/b4015

    View Slide

  83. Resources
    http://rexegg.com
    http://www.regular-expressions.info
    https://en.wikipedia.org/wiki/Syntax_diagram
    https://en.wikipedia.org/wiki/Regular_expression
    Tools
    https://regex101.com
    https://debuggex.com
    http://regexper.com
    http://regexr.com
    Pictures (Creative Commons)
    https://www.flickr.com/photos/ghor/8394379683 (forest)

    View Slide

  84. (?=thanks!)
    Twitter / @tmmx

    View Slide