Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mastering regex incantations

Mastering regex incantations

Regular expressions are a very powerful tool in every software engineer's toolbox. If you know them well, you can access their power whenever analysing any regular construct. In this talk, I will present the most advanced parts of regex syntax and discuss other related concepts.

Tomasz Kowalczyk

November 17, 2017
Tweet

More Decks by Tomasz Kowalczyk

Other Decks in Programming

Transcript

  1. regex incantations
    mastering
    Tomasz Kowalczyk / @tmmx

    View full-size slide

  2. /introduction/

    View full-size slide

  3. regular expression

    View full-size slide

  4. pattern match

    View full-size slide

  5. regex flavors
    syntax and feature differences

    View full-size slide

  6. Perl-Compatible
    Regular Expressions

    View full-size slide

  7. PHP Hypertext
    Preprocessor

    View full-size slide

  8. PCRE in PHP
    preg_​filter
    preg_​grep
    preg_​last_​error
    preg_​match_​all
    preg_​match
    preg_​quote
    preg_​replace_​callback_​array
    preg_​replace_​callback
    preg_​replace
    preg_​split

    View full-size slide

  9. ereg_​replace
    ereg
    eregi_​replace
    eregi
    split
    spliti
    sql_​regcase
    DEPRECATED IN PHP 5.3.0
    REMOVED IN PHP 7.0.0

    View full-size slide

  10. [character] [^classes]
    (capture) (?:groups)
    alter | nations
    optionals?
    quan+ tif* iers{1,2}

    View full-size slide

  11. /incantations/

    View full-size slide

  12. xkcd.com/208

    View full-size slide

  13. atomic groups
    lookarounds

    View full-size slide

  14. (?=regex) lookahead
    (?!regex) negative lookahead
    (?<=regex) lookbehind
    (?

    View full-size slide

  15. ~^A(?=\d)\d(?<=\d)Z$~
    A4Z

    View full-size slide

  16. branch reset
    different captures under single group

    View full-size slide

  17. (?|(branch)|(reset))

    View full-size slide

  18. ~(?|(\d+)|([a-zA-Z]+))~g
    SymfonyCon2017
    Matches: SymfonyCon, 2017

    View full-size slide

  19. backreferences
    reference previous capture groups

    View full-size slide

  20. (?\d+)(?&foo)
    (?\d+)(?1)
    (?\d+)\g{-1}

    View full-size slide

  21. ~^(?A)(?&r)\1\g{-1}$~g
    AAAA

    View full-size slide

  22. conditionals
    test current position and choose path

    View full-size slide

  23. (?(check)true|false)

    View full-size slide

  24. ~(?(?=A)A\d{2}|00Z)~g
    A1100Z

    View full-size slide

  25. permanent anchors
    always match start or end of the input

    View full-size slide

  26. ~^\d+$~gm
    2016
    2017
    2018

    View full-size slide

  27. ~\A\d+\Z~gm
    2016
    2017
    2018

    View full-size slide

  28. ~\A\d+\Z~gm
    2017

    View full-size slide

  29. recursion
    match the regex itself inside the regex

    View full-size slide

  30. ~a(?R)?z~
    ~a(?R)?z~
    ~a(?R)?z~
    ~a(?R)?z~
    aaazzz

    View full-size slide

  31. subroutines
    define and call subregexes

    View full-size slide

  32. (?(DEFINE)(?regex))
    (?(DEFINE)
    (?regex)
    (?regex)
    )

    View full-size slide

  33. ~(?(DEFINE)
    (?[a-zA-Z]+)
    (?[0-9]+)
    )
    ((?&alpha))((?&digits))~x
    SymfonyCon2017

    View full-size slide

  34. literal escapes

    View full-size slide

  35. ~[\Q!@#$%^&*()[]{}\E]+~
    {[(!@#$%^&*)]}

    View full-size slide

  36. forcing failure

    View full-size slide

  37. ~it (*FAIL)ed~
    ~it (?!)ed~

    View full-size slide

  38. xkcd.com/1171

    View full-size slide

  39. 16-11-2017
    \d{2}.\d{2}.\d{4}

    View full-size slide

  40. \d{2}[-.]\d{2}[-.]\d{4}

    View full-size slide

  41. line anchors

    View full-size slide

  42. ~^\d+$~m
    abc
    def
    123

    View full-size slide

  43. ~\A\d+\Z~m
    abc
    def
    123

    View full-size slide

  44. balanced constructs

    View full-size slide


  45. ~<([a-z]+)>(?R)?\1>~gi

    View full-size slide

  46. stackoverflow.com/a/1732454/443341

    View full-size slide

  47. PREG_OFFSET_CAPTURE
    and the multibyte input

    View full-size slide

  48. ’’’’[xx]’’[yy]
    regular apostrophe (U+0027): '
    RIGHT SINGLE QUOTATION MARK (U+2019): ’

    View full-size slide

  49. $part = substr($text, 0, $match[1]);
    $offset = mb_strlen($part, 'utf-8');

    View full-size slide

  50. email validation

    View full-size slide

  51. ex-parrot.com/~pdw/Mail-RFC822-Address.html
    (?:(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<
    >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\
    r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\
    [\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\
    t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?
    :\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[
    ([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(
    ?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\"
    .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t
    ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
    |\\.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*:(?
    :(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@
    ,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\
    n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
    ]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t]
    )*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\
    r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
    ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?:
    \r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\
    [\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])
    +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\
    \.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t]
    )*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<
    >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z
    |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?
    :(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^(
    )<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])
    +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\
    \.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.
    (?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\"
    .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["
    ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

    View full-size slide

  52. [name=bbCode arg=val param] content [/name]

    View full-size slide

  53. [name=bbCode arg=val param] content [/name]

    View full-size slide

  54. /\[(\[?)([a-zA-Z]+)(?![\w-])([^\]\/]*(?:\/(?!
    \])[^\]\/]*)*?)(?:(\/)\]|\](?:([^\[]*+(?:\[(?
    !\/\2\])[^\[]*+)*+)\[\/\2\])?)(\]?)/s
    /([\w-]+)\s*=\s*"([^"]*)"(?:\s|$)|([\w-]+)\s*
    =\s*'([^']*)'(?:\s|$)|([\w-]+)\s*=\s*([^\s'"]
    +)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/
    https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L239
    https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L448

    View full-size slide

  55. ~((?:\[\s*(?[a-zA-Z0-9-_]+)\s*(?:\=\
    s*(?\"(?:[^\"\\]*(?:\\.[^\"\\]*)*)
    \"|(?!=(?:\s*|\]|\/\]))))?\s*(?>(?:\s*(?:\w+(?:\s*\=\s*\"(?:[^\"\\]*(?:\\
    .[^\"\\]*)*)\"|\s*\=\s*(?!=(?:\s*|\]|\/\])
    )|(?=\s|\]|\/\s*\]|$))))*)\s*(?:\](?nt>.*?)\[\s*(?\/)\s*(\ke>)\s*\]|\]|(?\/)\s*\])))~us
    github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

    View full-size slide

  56. 3a[1!n]4!a10n
    between one and three alpha characters
    exactly one optional digit
    exactly four alpha characters
    and between one and ten digits

    View full-size slide

  57. ~^[a-zA-Z]{1,3}[0-9]{1}?[a-zA-Z]{4}[0-9]{1,10}$~
    between one and three alpha characters
    exactly one optional digit
    exactly four alpha characters
    and between one and ten digits

    View full-size slide

  58. ~^(?[a-zA-Z]{1,3})(?[0-9]{1}?)(?[a-z
    A-Z]{4})(?[0-9]{1,10})$~
    THU3NDER1337
    a1: THU, n1: 3, a2: NDER, n2: 1337

    View full-size slide

  59. regular-expressions.info

    View full-size slide

  60. github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

    View full-size slide

  61. (?P)
    Twitter / @tmmx

    View full-size slide

  62. please rate the talk and leave feedback
    joind.in/talk/b4015

    View full-size slide

  63. Resources
    http://rexegg.com
    http://www.regular-expressions.info
    https://en.wikipedia.org/wiki/Syntax_diagram
    https://en.wikipedia.org/wiki/Regular_expression
    Tools
    https://regex101.com
    https://debuggex.com
    http://regexper.com
    http://regexr.com
    Pictures (Creative Commons)
    https://www.flickr.com/photos/ghor/8394379683 (forest)

    View full-size slide

  64. (?=thanks!)
    Twitter / @tmmx

    View full-size slide