Slide 1

Slide 1 text

The Swiss Army knife of string manipulation REGEX 101

Slide 2

Slide 2 text

@matthiasmullie Regular expressions 101

Slide 3

Slide 3 text

Regular expressions 101 INTRODUCTION What are regular expressions?

Slide 4

Slide 4 text

Regular expressions 101 » Introduction Google Regular expressions are special characters that match or capture portions of a field, as well as the rules that govern all characters.

Slide 5

Slide 5 text

Regular expressions 101 » Introduction Wikipedia A regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters.

Slide 6

Slide 6 text

Regular expressions 101 » Introduction /\{\$([a-z0-9_]*)((\.[a-z0-9_]*)*)
 (-\>[a-z0-9_]*((\.[a-z0-9_]*)*))?
 ((\|[a-z_][a-z0-9_]*(:.*?)*)*)\}/i

Slide 7

Slide 7 text

Regular expressions 101 » Introduction Me Regular expressions find patterns in strings.

Slide 8

Slide 8 text

Regular expressions 101 » Introduction Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit... ! ‣ /[a-z]/i ‣ /[^\w]/i ! ‣ /ipsum/ ‣ /(est|qui)/

Slide 9

Slide 9 text

Regular expressions 101 BASICS The syntax everyone should know already

Slide 10

Slide 10 text

/Delimiter/ Regular expressions 101 » Delimiter ‣ Any [^a-zA-Z0-9\\\s] character ‣ Opening char == terminating char ‣ Except for [ ] , ( ) , { } and < >

Slide 11

Slide 11 text

Regular expressions 101 » Delimiter Use / (uniformity, you know)

Slide 12

Slide 12 text

‣ . ‣ [ ] ‣ ^ $ ‣ | Meta characters Regular expressions 101 » Meta characters ‣ ( ) ‣ \ ‣ * ? + ‣ {n} {n,m}

Slide 13

Slide 13 text

‣ i ‣ m ‣ s ‣ x ‣ e ‣ A ‣ D ‣ U ‣ J ‣ ... Pattern modifiers //x Regular expressions 101 » Pattern modifiers

Slide 14

Slide 14 text

Character classes [ ] Regular expressions 101 » Character classes Ranges Inverse ranges ‣ [0-9] ‣ [a-zA-Z] ‣ [A-F0-9] ‣ [^0-9] ‣ [^a-zA-Z] ‣ [^A-F0-9]

Slide 15

Slide 15 text

Character classes [ ] Regular expressions 101 » Character classes No sequence of characters! ! [lorem] ‣ l, o, r, e or m ‣ lorem

Slide 16

Slide 16 text

Character classes [ ] ‣ [:alnum:] ‣ [:blank:] ‣ [:lower:] ‣ ... Regular expressions 101 » Character classes POSIX

Slide 17

Slide 17 text

Greediness: greedy Regular expressions 101 » Greediness
  • list-item1
  • list-item2
! /
  • .*<\/li>/ ‣
  • list-item1
  • list-item2
  • Slide 18

    Slide 18 text

    Greediness: lazy Regular expressions 101 » Greediness
    • list-item1
    • list-item2
    ! /
  • .*?<\/li>/ /
  • .*<\/li>/U ‣
  • list-item1
  • list-item2
  • or

    Slide 19

    Slide 19 text

    Subpatterns Regular expressions 101 » Subpatterns /([a-z0-9]*)@([a-z0-9\.]*\.[a-z0-9]{2,3})/i email ! hostname user Note: this regex only barely satisfies my needs for this particular example; do not use this really find occurrences of email addresses, it does not fully satisfy RFC5321 & RFC5322

    Slide 20

    Slide 20 text

    Questions? Regular expressions 101

    Slide 21

    Slide 21 text

    Regular expressions 101 ADVANCED The juicy stuff you never knew about, until now

    Slide 22

    Slide 22 text

    Regular expressions 101 » Back references Problem: /href=['"](.*?)['"]/i Matches: ‣ href="xxx" ‣ href='xxx' ! ! ‣ href="xxx' ‣ href='xxx" Back references

    Slide 23

    Slide 23 text

    Regular expressions 101 » Back references Back references Solution: /href=(['"])(.*?)\1/i \1 references first subpattern! ! Don’t forget to also string-escape in PHP: preg_match('/href=([\'"])(.*?)\\1/i', ...);

    Slide 24

    Slide 24 text

    Regular expressions 101 » Named subpatterns Named subpatterns Scenario: parsing large CSV 1,a title,5.00,92,green 2,another title,3.50,4,blue 3,one more,33699.99,15,white ...

    Slide 25

    Slide 25 text

    /([0-9]+),(.*?),([0-9]+\.[0-9]{2}),([0-9]+),([a-z]+)/i ! ! Result excerpt: Regular expressions 101 » Named subpatterns Named subpatterns [1] => string(1) "1" [2] => string(7) "a title" [3] => string(4) "5.00" [4] => string(2) "92" [5] => string(5) "green" ! ! ! !

    Slide 26

    Slide 26 text

    /(?P[0-9]+),(?P.*?),(?P[0-9]+\.[0-9] {2}),(?P[0-9]+),(?P[a-z]+)/i ! Result excerpt: Regular expressions 101 » Named subpatterns ["id"] => string(1) "1" [1] => string(1) "1" ["title"] => string(7) "a title" [2] => string(7) "a title" ["price"] => string(4) "5.00" [3] => string(4) "5.00" ["stock"] => string(2) "92" [4] => string(2) "92" ["color"] => string(5) "green" [5] => string(5) "green" Named subpatterns

    Slide 27

    Slide 27 text

    Regular expressions 101 » Named subpatterns Named subpatterns ‣ (?Ppattern) ‣ (?pattern) & (?'name'pattern) since PHP 5.2.2

    Slide 28

    Slide 28 text

    Regular expressions 101 » Named subpatterns Named subpatterns + back references ! /href=(?P['"])(?P.*?)(?P=quotes)/i

    Slide 29

    Slide 29 text

    Regular expressions 101 » Assertions “Take a peek, don’t eat it” Lookahead/-behind assertions

    Slide 30

    Slide 30 text

    Scenario: find all occurrences of “here” ! “Where can I find here, not there?” Regular expressions 101 » Assertions Lookahead/-behind assertions

    Slide 31

    Slide 31 text

    Regular expressions 101 » Assertions Lookahead/-behind assertions Deduction: Find all here’s, not preceded or followed by an alphabetic character. ! Solution: /(?

    Slide 32

    Slide 32 text

    Regular expressions 101 » Assertions Lookahead/-behind assertions ‣ Positive lookahead: (?=expression) ‣ Negative lookahead: (?!expression) ‣ Positive lookbehind: (?<=expression) ‣ Negative lookbehind: (?

    Slide 33

    Slide 33 text

    “lookbehind assertion is not fixed length...” In PHP, lookbehind can not contain repetition, while lookahead can. ‣ (?=.*) ‣ (?=abc) Regular expressions 101 » Assertions ‣ (?<=.*) ‣ (?<=abc) Lookahead/-behind assertions

    Slide 34

    Slide 34 text

    Regular expressions 101 » Conditional subpatterns Conditional subpatterns if-then(-else) in regular expressions ! ! YES RLY!

    Slide 35

    Slide 35 text

    Regular expressions 101 » Conditional subpatterns Conditional subpatterns Scenario: match all (x|ht)ml tags ! Caution! ‣ ‣

    Slide 36

    Slide 36 text

    Solution: if then else /<(?P[a-z]+).*?(?P\/)?>(?(self)|.*?<\/(?P=tag)>)/i Named patterns If self-closing, then do nothing,
 else, find matching end tag Regular expressions 101 » Conditional subpatterns Conditional subpatterns

    Slide 37

    Slide 37 text

    Regular expressions 101 » Conditional subpatterns Conditional subpatterns ‣ With subpattern (named or by id): ‣ (?(pattern)then) ‣ (?(pattern)then|else) ‣ With lookahead/-behind: ‣ (?(?=assertion)then) ‣ (?(?=assertion)then|else)

    Slide 38

    Slide 38 text

    Regular expressions 101 » Comments Comments / # match currency symbols for USD, EUR, GBP & YEN [$€£¥] # must be followed by a number to indicate a price (?=[0-9]) # pattern modifiers: # u for UTF-8 interpretation (currency symbols), # x to ignore whitespace (for comments) /ux

    Slide 39

    Slide 39 text

    Regular expressions 101 » Comments Comments ‣ # Perl-style ‣ /x modifier ‣ Ignores unescaped whitespace

    Slide 40

    Slide 40 text

    Presentation title

    Slide 41

    Slide 41 text

    Questions? Regular expressions 101

    Slide 42

    Slide 42 text

    mullie.eu ‣ www.mullie.eu/regular-expressions-basics/ ‣ www.mullie.eu/regular-expressions-advanced/ Regular expressions 101 Resources