$30 off During Our Annual Pro Sale. View Details »

Mastering regex incantations

Mastering regex incantations

Regular expressions are a very powerful tool in every software engineer's toolbox. If you know them well, you can access their power whenever analysing any regular construct. In this talk, I will present the most advanced parts of regex syntax and discuss other related concepts.

Tomasz Kowalczyk

November 17, 2017
Tweet

More Decks by Tomasz Kowalczyk

Other Decks in Programming

Transcript

  1. regex incantations mastering Tomasz Kowalczyk / @tmmx

  2. None
  3. None
  4. /introduction/

  5. regular expression

  6. pattern match

  7. regex flavors syntax and feature differences

  8. Perl-Compatible Regular Expressions

  9. PHP Hypertext Preprocessor

  10. PCRE in PHP preg_​filter preg_​grep preg_​last_​error preg_​match_​all preg_​match preg_​quote preg_​replace_​callback_​array

    preg_​replace_​callback preg_​replace preg_​split
  11. ereg_​replace ereg eregi_​replace eregi split spliti sql_​regcase DEPRECATED IN PHP

    5.3.0 REMOVED IN PHP 7.0.0
  12. [character] [^classes] (capture) (?:groups) alter | nations optionals? quan+ tif*

    iers{1,2}
  13. /incantations/

  14. xkcd.com/208

  15. None
  16. atomic groups lookarounds

  17. (?=regex) lookahead (?!regex) negative lookahead (?<=regex) lookbehind (?<!regex) negative lookbehind

  18. ~^A(?=\d)\d(?<=\d)Z$~ A4Z

  19. branch reset different captures under single group

  20. (?|(branch)|(reset))

  21. ~(?|(\d+)|([a-zA-Z]+))~g SymfonyCon2017 Matches: SymfonyCon, 2017

  22. backreferences reference previous capture groups

  23. (?<foo>\d+)(?&foo) (?<bar>\d+)(?1) (?<baz>\d+)\g{-1}

  24. ~^(?<r>A)(?&r)\1\g{-1}$~g AAAA

  25. conditionals test current position and choose path

  26. (?(check)true|false)

  27. ~(?(?=A)A\d{2}|00Z)~g A1100Z

  28. permanent anchors always match start or end of the input

  29. \Aregex\Z

  30. ~^\d+$~gm 2016 2017 2018

  31. ~\A\d+\Z~gm 2016 2017 2018

  32. ~\A\d+\Z~gm 2017

  33. recursion match the regex itself inside the regex

  34. (?R)

  35. ~a(?R)?z~ ~a(?R)?z~ ~a(?R)?z~ ~a(?R)?z~ aaazzz

  36. subroutines define and call subregexes

  37. (?(DEFINE)(?<name>regex)) (?(DEFINE) (?<foo>regex) (?<bar>regex) )

  38. ~(?(DEFINE) (?<alpha>[a-zA-Z]+) (?<digits>[0-9]+) ) ((?&alpha))((?&digits))~x SymfonyCon2017

  39. literal escapes

  40. \Qescape\E

  41. ~[\Q!@#$%^&*()[]{}\E]+~ {[(!@#$%^&*)]}

  42. forcing failure

  43. (*FAIL) (?!)

  44. ~it (*FAIL)ed~ ~it (?!)ed~

  45. /quirks/

  46. xkcd.com/1171

  47. None
  48. the dot

  49. 16-11-2017 \d{2}.\d{2}.\d{4}

  50. 16a11z2017

  51. \d{2}[-.]\d{2}[-.]\d{4}

  52. line anchors

  53. ~^\d+$~m abc def 123

  54. ~\A\d+\Z~m abc def 123

  55. balanced constructs

  56. <html><head></head></html> ~<([a-z]+)>(?R)?</\1>~gi

  57. stackoverflow.com/a/1732454/443341

  58. None
  59. PREG_OFFSET_CAPTURE and the multibyte input

  60. ’’’’[xx]’’[yy] regular apostrophe (U+0027): ' RIGHT SINGLE QUOTATION MARK (U+2019):

  61. $part = substr($text, 0, $match[1]); $offset = mb_strlen($part, 'utf-8');

  62. email validation

  63. ex-parrot.com/~pdw/Mail-RFC822-Address.html (?:(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()< >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\ r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\ [\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\ t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(? :\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[ ([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:( ?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\" .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t

    ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*:(? :(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@ ,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\ n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t] )*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\ r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?: \r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\ [\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t] )*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()< >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(? :(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^( )<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\. (?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\" .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)
  64. /use cases/

  65. Shortcode

  66. [name=bbCode arg=val param] content [/name]

  67. [name=bbCode arg=val param] content [/name]

  68. /\[(\[?)([a-zA-Z]+)(?![\w-])([^\]\/]*(?:\/(?! \])[^\]\/]*)*?)(?:(\/)\]|\](?:([^\[]*+(?:\[(? !\/\2\])[^\[]*+)*+)\[\/\2\])?)(\]?)/s /([\w-]+)\s*=\s*"([^"]*)"(?:\s|$)|([\w-]+)\s* =\s*'([^']*)'(?:\s|$)|([\w-]+)\s*=\s*([^\s'"] +)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/ https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L239 https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L448

  69. ~((?:\[\s*(?<name>[a-zA-Z0-9-_]+)\s*(?:\=\ s*(?<bbCode>\"(?:[^\"\\]*(?:\\.[^\"\\]*)*) \"|(?!=(?:\s*|\]|\/\]))))?\s*(?<parameters >(?:\s*(?:\w+(?:\s*\=\s*\"(?:[^\"\\]*(?:\\ .[^\"\\]*)*)\"|\s*\=\s*(?!=(?:\s*|\]|\/\]) )|(?=\s|\]|\/\s*\]|$))))*)\s*(?:\](?<conte nt>.*?)\[\s*(?<markerContent>\/)\s*(\k<nam e>)\s*\]|\]|(?<marker>\/)\s*\])))~us github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

  70. Generator

  71. 3a[1!n]4!a10n between one and three alpha characters exactly one optional

    digit exactly four alpha characters and between one and ten digits
  72. ~^[a-zA-Z]{1,3}[0-9]{1}?[a-zA-Z]{4}[0-9]{1,10}$~ between one and three alpha characters exactly one optional

    digit exactly four alpha characters and between one and ten digits
  73. ~^(?<a1>[a-zA-Z]{1,3})(?<n1>[0-9]{1}?)(?<a2>[a-z A-Z]{4})(?<n2>[0-9]{1,10})$~ THU3NDER1337 a1: THU, n1: 3, a2: NDER, n2:

    1337
  74. /tools/

  75. regular-expressions.info

  76. RexEgg

  77. Debuggex

  78. github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

  79. Regex101

  80. /summary/

  81. (?P<questions>) Twitter / @tmmx

  82. please rate the talk and leave feedback joind.in/talk/b4015

  83. Resources http://rexegg.com http://www.regular-expressions.info https://en.wikipedia.org/wiki/Syntax_diagram https://en.wikipedia.org/wiki/Regular_expression Tools https://regex101.com https://debuggex.com http://regexper.com http://regexr.com

    Pictures (Creative Commons) https://www.flickr.com/photos/ghor/8394379683 (forest)
  84. (?=thanks!) Twitter / @tmmx