Slide 1

Slide 1 text

regex incantations mastering Tomasz Kowalczyk / @tmmx

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

/introduction/

Slide 5

Slide 5 text

regular expression

Slide 6

Slide 6 text

pattern match

Slide 7

Slide 7 text

regex flavors syntax and feature differences

Slide 8

Slide 8 text

Perl-Compatible Regular Expressions

Slide 9

Slide 9 text

PHP Hypertext Preprocessor

Slide 10

Slide 10 text

PCRE in PHP preg_​filter preg_​grep preg_​last_​error preg_​match_​all preg_​match preg_​quote preg_​replace_​callback_​array preg_​replace_​callback preg_​replace preg_​split

Slide 11

Slide 11 text

ereg_​replace ereg eregi_​replace eregi split spliti sql_​regcase DEPRECATED IN PHP 5.3.0 REMOVED IN PHP 7.0.0

Slide 12

Slide 12 text

[character] [^classes] (capture) (?:groups) alter | nations optionals? quan+ tif* iers{1,2}

Slide 13

Slide 13 text

/incantations/

Slide 14

Slide 14 text

xkcd.com/208

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

atomic groups lookarounds

Slide 17

Slide 17 text

(?=regex) lookahead (?!regex) negative lookahead (?<=regex) lookbehind (?

Slide 18

Slide 18 text

~^A(?=\d)\d(?<=\d)Z$~ A4Z

Slide 19

Slide 19 text

branch reset different captures under single group

Slide 20

Slide 20 text

(?|(branch)|(reset))

Slide 21

Slide 21 text

~(?|(\d+)|([a-zA-Z]+))~g SymfonyCon2017 Matches: SymfonyCon, 2017

Slide 22

Slide 22 text

backreferences reference previous capture groups

Slide 23

Slide 23 text

(?\d+)(?&foo) (?\d+)(?1) (?\d+)\g{-1}

Slide 24

Slide 24 text

~^(?A)(?&r)\1\g{-1}$~g AAAA

Slide 25

Slide 25 text

conditionals test current position and choose path

Slide 26

Slide 26 text

(?(check)true|false)

Slide 27

Slide 27 text

~(?(?=A)A\d{2}|00Z)~g A1100Z

Slide 28

Slide 28 text

permanent anchors always match start or end of the input

Slide 29

Slide 29 text

\Aregex\Z

Slide 30

Slide 30 text

~^\d+$~gm 2016 2017 2018

Slide 31

Slide 31 text

~\A\d+\Z~gm 2016 2017 2018

Slide 32

Slide 32 text

~\A\d+\Z~gm 2017

Slide 33

Slide 33 text

recursion match the regex itself inside the regex

Slide 34

Slide 34 text

(?R)

Slide 35

Slide 35 text

~a(?R)?z~ ~a(?R)?z~ ~a(?R)?z~ ~a(?R)?z~ aaazzz

Slide 36

Slide 36 text

subroutines define and call subregexes

Slide 37

Slide 37 text

(?(DEFINE)(?regex)) (?(DEFINE) (?regex) (?regex) )

Slide 38

Slide 38 text

~(?(DEFINE) (?[a-zA-Z]+) (?[0-9]+) ) ((?&alpha))((?&digits))~x SymfonyCon2017

Slide 39

Slide 39 text

literal escapes

Slide 40

Slide 40 text

\Qescape\E

Slide 41

Slide 41 text

~[\Q!@#$%^&*()[]{}\E]+~ {[(!@#$%^&*)]}

Slide 42

Slide 42 text

forcing failure

Slide 43

Slide 43 text

(*FAIL) (?!)

Slide 44

Slide 44 text

~it (*FAIL)ed~ ~it (?!)ed~

Slide 45

Slide 45 text

/quirks/

Slide 46

Slide 46 text

xkcd.com/1171

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

the dot

Slide 49

Slide 49 text

16-11-2017 \d{2}.\d{2}.\d{4}

Slide 50

Slide 50 text

16a11z2017

Slide 51

Slide 51 text

\d{2}[-.]\d{2}[-.]\d{4}

Slide 52

Slide 52 text

line anchors

Slide 53

Slide 53 text

~^\d+$~m abc def 123

Slide 54

Slide 54 text

~\A\d+\Z~m abc def 123

Slide 55

Slide 55 text

balanced constructs

Slide 56

Slide 56 text

~<([a-z]+)>(?R)?~gi

Slide 57

Slide 57 text

stackoverflow.com/a/1732454/443341

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

PREG_OFFSET_CAPTURE and the multibyte input

Slide 60

Slide 60 text

’’’’[xx]’’[yy] regular apostrophe (U+0027): ' RIGHT SINGLE QUOTATION MARK (U+2019): ’

Slide 61

Slide 61 text

$part = substr($text, 0, $match[1]); $offset = mb_strlen($part, 'utf-8');

Slide 62

Slide 62 text

email validation

Slide 63

Slide 63 text

ex-parrot.com/~pdw/Mail-RFC822-Address.html (?:(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()< >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\ r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\ [\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\ t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(? :\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[ ([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:( ?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\" .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*:(? :(?:\r\n)?[\t])*(?:(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@ ,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\ n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t] )*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\ r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?: \r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\ [\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[\t])*))*\>(?:(?:\r\n)?[\t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t] )*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()< >@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*|(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(? :(?:\r\n)?[\t])*)*\<(?:(?:\r\n)?[\t])*(?:@(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^( )<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*))*(?:,@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[\t])*))*)*:(?:(?:\r\n)?[\t])*)?(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*)(?:\. (?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[\t]))*"(?:(?:\r\n)?[\t])*))*@(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\" .\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[\t])*)(?:\.(?:(?:\r\n)?[\t])*(?:[^()<>@,;:\\".\[\]\000-\031]+(?:(?:(?:\r\n)?[\t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

Slide 64

Slide 64 text

/use cases/

Slide 65

Slide 65 text

Shortcode

Slide 66

Slide 66 text

[name=bbCode arg=val param] content [/name]

Slide 67

Slide 67 text

[name=bbCode arg=val param] content [/name]

Slide 68

Slide 68 text

/\[(\[?)([a-zA-Z]+)(?![\w-])([^\]\/]*(?:\/(?! \])[^\]\/]*)*?)(?:(\/)\]|\](?:([^\[]*+(?:\[(? !\/\2\])[^\[]*+)*+)\[\/\2\])?)(\]?)/s /([\w-]+)\s*=\s*"([^"]*)"(?:\s|$)|([\w-]+)\s* =\s*'([^']*)'(?:\s|$)|([\w-]+)\s*=\s*([^\s'"] +)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/ https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L239 https://core.trac.wordpress.org/browser/tags/4.3.1/src/wp-includes/shortcodes.php#L448

Slide 69

Slide 69 text

~((?:\[\s*(?[a-zA-Z0-9-_]+)\s*(?:\=\ s*(?\"(?:[^\"\\]*(?:\\.[^\"\\]*)*) \"|(?!=(?:\s*|\]|\/\]))))?\s*(?(?:\s*(?:\w+(?:\s*\=\s*\"(?:[^\"\\]*(?:\\ .[^\"\\]*)*)\"|\s*\=\s*(?!=(?:\s*|\]|\/\]) )|(?=\s|\]|\/\s*\]|$))))*)\s*(?:\](?.*?)\[\s*(?\/)\s*(\k)\s*\]|\]|(?\/)\s*\])))~us github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

Slide 70

Slide 70 text

Generator

Slide 71

Slide 71 text

3a[1!n]4!a10n between one and three alpha characters exactly one optional digit exactly four alpha characters and between one and ten digits

Slide 72

Slide 72 text

~^[a-zA-Z]{1,3}[0-9]{1}?[a-zA-Z]{4}[0-9]{1,10}$~ between one and three alpha characters exactly one optional digit exactly four alpha characters and between one and ten digits

Slide 73

Slide 73 text

~^(?[a-zA-Z]{1,3})(?[0-9]{1}?)(?[a-z A-Z]{4})(?[0-9]{1,10})$~ THU3NDER1337 a1: THU, n1: 3, a2: NDER, n2: 1337

Slide 74

Slide 74 text

/tools/

Slide 75

Slide 75 text

regular-expressions.info

Slide 76

Slide 76 text

RexEgg

Slide 77

Slide 77 text

Debuggex

Slide 78

Slide 78 text

github.com/thunderer/Shortcode/blob/master/src/Utility/RegexBuilderUtility.php

Slide 79

Slide 79 text

Regex101

Slide 80

Slide 80 text

/summary/

Slide 81

Slide 81 text

(?P) Twitter / @tmmx

Slide 82

Slide 82 text

please rate the talk and leave feedback joind.in/talk/b4015

Slide 83

Slide 83 text

Resources http://rexegg.com http://www.regular-expressions.info https://en.wikipedia.org/wiki/Syntax_diagram https://en.wikipedia.org/wiki/Regular_expression Tools https://regex101.com https://debuggex.com http://regexper.com http://regexr.com Pictures (Creative Commons) https://www.flickr.com/photos/ghor/8394379683 (forest)

Slide 84

Slide 84 text

(?=thanks!) Twitter / @tmmx