Mastering regex incantations

Mastering regex incantations

Regular expressions are a very powerful tool in every software engineer's toolbox. If you know them well, you can access their power whenever analysing any regular construct. In this talk, I will present the most advanced parts of regex syntax and discuss other related concepts.

Bb29f6afb2ea244a12c25e04d46af19c?s=128

Tomasz Kowalczyk

November 17, 2017
Tweet

Transcript

  1. regex incantations mastering Tomasz Kowalczyk / @tmmx

  2. None
  3. None
  4. /introduction/

  5. regular expression

  6. pattern match

  7. regex flavors syntax and feature differences

  8. Perl-Compatible Regular Expressions

  9. PHP Hypertext Preprocessor

  10. PCRE in PHP preg_​filter preg_​grep preg_​last_​error preg_​match_​all preg_​match preg_​quote preg_​replace_​callback_​array

    preg_​replace_​callback preg_​replace preg_​split
  11. ereg_​replace ereg eregi_​replace eregi split spliti sql_​regcase DEPRECATED IN PHP

    5.3.0 REMOVED IN PHP 7.0.0
  12. [character] [^classes] (capture) (?:groups) alter | nations optionals? quan+ tif*

    iers{1,2}
  13. /incantations/

  14. xkcd.com/208

  15. None
  16. atomic groups lookarounds

  17. (?=regex) lookahead (?!regex) negative lookahead (?<=regex) lookbehind (?<!regex) negative lookbehind

  18. ~^A(?=\d)\d(?<=\d)Z$~ A4Z

  19. branch reset different captures under single group

  20. (?|(branch)|(reset))

  21. ~(?|(\d+)|([a-zA-Z]+))~g SymfonyCon2017 Matches: SymfonyCon, 2017

  22. backreferences reference previous capture groups

  23. (?<foo>\d+)(?&foo) (?<bar>\d+)(?1) (?<baz>\d+)\g{-1}

  24. ~^(?<r>A)(?&r)\1\g{-1}$~g AAAA

  25. conditionals test current position and choose path

  26. (?(check)true|false)

  27. ~(?(?=A)A\d{2}|00Z)~g A1100Z

  28. permanent anchors always match start or end of the input

  29. \Aregex\Z

  30. ~^\d+$~gm 2016 2017 2018

  31. ~\A\d+\Z~gm 2016 2017 2018

  32. ~\A\d+\Z~gm 2017

  33. recursion match the regex itself inside the regex

  34. (?R)

  35. ~a(?R)?z~ ~a(?R)?z~ ~a(?R)?z~ ~a(?R)?z~ aaazzz

  36. subroutines define and call subregexes

  37. (?(DEFINE)(?<name>regex)) (?(DEFINE) (?<foo>regex) (?<bar>regex) )

  38. ~(?(DEFINE) (?<alpha>[a-zA-Z]+) (?<digits>[0-9]+) ) ((?&alpha))((?&digits))~x SymfonyCon2017

  39. literal escapes

  40. \Qescape\E

  41. ~[\Q!@#$%^&*()[]{}\E]+~ {[(!@#$%^&*)]}

  42. forcing failure

  43. (*FAIL) (?!)

  44. ~it (*FAIL)ed~ ~it (?!)ed~

  45. /quirks/

  46. xkcd.com/1171

  47. None
  48. the dot

  49. 16-11-2017 \d{2}.\d{2}.\d{4}

  50. 16a11z2017

  51. \d{2}[-.]\d{2}[-.]\d{4}

  52. line anchors

  53. ~^\d+$~m abc def 123

  54. ~\A\d+\Z~m abc def 123

  55. balanced constructs

  56. <html><head></head></html> ~<([a-z]+)>(?R)?</\1>~gi

  57. stackoverflow.com/a/1732454/443341

  58. None
  59. PREG_OFFSET_CAPTURE and the multibyte input

  60. ’’’’[xx]’’[yy] regular apostrophe (U+0027): ' RIGHT SINGLE QUOTATION MARK (U+2019):

  61. $part = substr($text, 0, $match[1]); $offset = mb_strlen($part, 'utf-8');

  62. email validation

  63. ex-parrot.com/~pdw/Mail-RFC822-Address.html (?:(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()< >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\ r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\ [\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\ t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(? :\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[ ([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:( ?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\" .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t

    ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*:(? :(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@ ,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\ n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t] )*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\ r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?: \r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\ [\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t] )*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()< >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(? :(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^( )<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\. (?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\" .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)
  64. /use cases/

  65. Shortcode

  66. [name=bbCode arg=val param] content [/name]

  67. [name=bbCode arg=val param] content [/name]

  68. /\[(\[?)([a-zA-Z]+)(?![\w-])([^\]\/]*(?:\/(?! \])[^\]\/]*)*?)(?:(\/)\]|\](?:([^\[]*+(?:\[(? !\/\2\])[^\[]*+)*+)\[\/\2\])?)(\]?)/s /([\w-]+)\s*=\s*"([^"]*)"(?:\s|$)|([\w-]+)\s* =\s*'([^']*)'(?:\s|$)|([\w-]+)\s*=\s*([^\s'"] +)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/ https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L239 https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L448

  69. ~((?:\[\s*(?<name>[a-zA-Z0-9-_]+)\s*(?:\=\ s*(?<bbCode>\"(?:[^\"\\]*(?:\\.[^\"\\]*)*) \"|(?!=(?:\s*|\]|\/\]))))?\s*(?<parameters >(?:\s*(?:\w+(?:\s*\=\s*\"(?:[^\"\\]*(?:\\ .[^\"\\]*)*)\"|\s*\=\s*(?!=(?:\s*|\]|\/\]) )|(?=\s|\]|\/\s*\]|$))))*)\s*(?:\](?<conte nt>.*?)\[\s*(?<markerContent>\/)\s*(\k<nam e>)\s*\]|\]|(?<marker>\/)\s*\])))~us github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

  70. Generator

  71. 3a[1!n]4!a10n between one and three alpha characters exactly one optional

    digit exactly four alpha characters and between one and ten digits
  72. ~^[a-zA-Z]{1,3}[0-9]{1}?[a-zA-Z]{4}[0-9]{1,10}$~ between one and three alpha characters exactly one optional

    digit exactly four alpha characters and between one and ten digits
  73. ~^(?<a1>[a-zA-Z]{1,3})(?<n1>[0-9]{1}?)(?<a2>[a-z A-Z]{4})(?<n2>[0-9]{1,10})$~ THU3NDER1337 a1: THU, n1: 3, a2: NDER, n2:

    1337
  74. /tools/

  75. regular-expressions.info

  76. RexEgg

  77. Debuggex

  78. github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

  79. Regex101

  80. /summary/

  81. (?P<questions>) Twitter / @tmmx

  82. please rate the talk and leave feedback joind.in/talk/b4015

  83. Resources http://rexegg.com http://www.regular-expressions.info https://en.wikipedia.org/wiki/Syntax_diagram https://en.wikipedia.org/wiki/Regular_expression Tools https://regex101.com https://debuggex.com http://regexper.com http://regexr.com

    Pictures (Creative Commons) https://www.flickr.com/photos/ghor/8394379683 (forest)
  84. (?=thanks!) Twitter / @tmmx