Regular expressions 101

The Swiss Army knife of string manipulation REGEX 101

@matthiasmullie Regular expressions 101

Regular expressions 101 INTRODUCTION What are regular expressions?

Regular expressions 101 » Introduction Google Regular expressions are special
characters that match or capture portions of a field, as well as the rules that govern all characters.

Regular expressions 101 » Introduction Wikipedia A regular expression provides
a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters.

Regular expressions 101 » Introduction /\{\$([a-z0-9_]*)((\.[a-z0-9_]*)*)  (-\>[a-z0-9_]*((\.[a-z0-9_]*)*))?  ((\|[a-z_][a-z0-9_]*(:.*?)*)*)\}/i

Regular expressions 101 » Introduction Me Regular expressions find patterns
in strings.

Regular expressions 101 » Introduction Neque porro quisquam est qui
dolorem ipsum quia dolor sit amet, consectetur, adipisci velit... ! ‣ /[a-z]/i ‣ /[^\w]/i ! ‣ /ipsum/ ‣ /(est|qui)/

Regular expressions 101 BASICS The syntax everyone should know already

/Delimiter/ Regular expressions 101 » Delimiter ‣ Any [^a-zA-Z0-9\\\s] character
‣ Opening char == terminating char ‣ Except for [ ] , ( ) , { } and < >

Regular expressions 101 » Delimiter Use / (uniformity, you know)

‣ . ‣ [ ] ‣ ^ $ ‣ |
Meta characters Regular expressions 101 » Meta characters ‣ ( ) ‣ \ ‣ * ? + ‣ {n} {n,m}

‣ i ‣ m ‣ s ‣ x ‣ e
‣ A ‣ D ‣ U ‣ J ‣ ... Pattern modiﬁers //x Regular expressions 101 » Pattern modiﬁers

Character classes [ ] Regular expressions 101 » Character classes
Ranges Inverse ranges ‣ [0-9] ‣ [a-zA-Z] ‣ [A-F0-9] ‣ [^0-9] ‣ [^a-zA-Z] ‣ [^A-F0-9]

Character classes [ ] Regular expressions 101 » Character classes
No sequence of characters! ! [lorem] ‣ l, o, r, e or m ‣ lorem

Character classes [ ] ‣ [:alnum:] ‣ [:blank:] ‣ [:lower:]
‣ ... Regular expressions 101 » Character classes POSIX

Greediness: greedy Regular expressions 101 » Greediness <ul><li>list-item1</li><li>list-item2</li></ul> ! /<li>.*<\/li>/
‣ <li>list-item1</li><li>list-item2</li>

Greediness: lazy Regular expressions 101 » Greediness <ul><li>list-item1</li><li>list-item2</li></ul> ! /<li>.*?<\/li>/
/<li>.*<\/li>/U ‣ <li>list-item1</li> ‣ <li>list-item2</li> or

Subpatterns Regular expressions 101 » Subpatterns /([a-z0-9]*)@([a-z0-9\.]*\.[a-z0-9]{2,3})/i email ! hostname
user Note: this regex only barely satisfies my needs for this particular example; do not use this really find occurrences of email addresses, it does not fully satisfy RFC5321 & RFC5322

Questions? Regular expressions 101

Regular expressions 101 ADVANCED The juicy stuff you never knew
about, until now

Regular expressions 101 » Back references Problem: /href=['"](.*?)['"]/i Matches: ‣
href="xxx" ‣ href='xxx' ! ! ‣ href="xxx' ‣ href='xxx" Back references

Regular expressions 101 » Back references Back references Solution: /href=(['"])(.*?)\1/i
\1 references first subpattern! ! Don’t forget to also string-escape in PHP: preg_match('/href=([\'"])(.*?)\\1/i', ...);

Regular expressions 101 » Named subpatterns Named subpatterns Scenario: parsing
large CSV 1,a title,5.00,92,green 2,another title,3.50,4,blue 3,one more,33699.99,15,white ...

/([0-9]+),(.*?),([0-9]+\.[0-9]{2}),([0-9]+),([a-z]+)/i ! ! Result excerpt: Regular expressions 101 » Named
subpatterns Named subpatterns [1] => string(1) "1" [2] => string(7) "a title" [3] => string(4) "5.00" [4] => string(2) "92" [5] => string(5) "green" ! ! ! !

/(?P<id>[0-9]+),(?P<title>.*?),(?P<price>[0-9]+\.[0-9] {2}),(?P<stock>[0-9]+),(?P<color>[a-z]+)/i ! Result excerpt: Regular expressions 101 » Named
subpatterns ["id"] => string(1) "1" [1] => string(1) "1" ["title"] => string(7) "a title" [2] => string(7) "a title" ["price"] => string(4) "5.00" [3] => string(4) "5.00" ["stock"] => string(2) "92" [4] => string(2) "92" ["color"] => string(5) "green" [5] => string(5) "green" Named subpatterns

Regular expressions 101 » Named subpatterns Named subpatterns ‣ (?P<name>pattern)
‣ (?<name>pattern) & (?'name'pattern) since PHP 5.2.2

Regular expressions 101 » Named subpatterns Named subpatterns + back
references ! /href=(?P<quotes>['"])(?P<href>.*?)(?P=quotes)/i

Regular expressions 101 » Assertions “Take a peek, don’t eat
it” Lookahead/-behind assertions

Scenario: ﬁnd all occurrences of “here” ! “Where can I
ﬁnd here, not there?” Regular expressions 101 » Assertions Lookahead/-behind assertions

Regular expressions 101 » Assertions Lookahead/-behind assertions Deduction: Find all
here’s, not preceded or followed by an alphabetic character. ! Solution: /(?<![a-z])here(?![a-z])/i

Regular expressions 101 » Assertions Lookahead/-behind assertions ‣ Positive lookahead:
(?=expression) ‣ Negative lookahead: (?!expression) ‣ Positive lookbehind: (?<=expression) ‣ Negative lookbehind: (?<!expression)

“lookbehind assertion is not fixed length...” In PHP, lookbehind can
not contain repetition, while lookahead can. ‣ (?=.*) ‣ (?=abc) Regular expressions 101 » Assertions ‣ (?<=.*) ‣ (?<=abc) Lookahead/-behind assertions

Regular expressions 101 » Conditional subpatterns Conditional subpatterns if-then(-else) in
regular expressions ! ! YES RLY!

Regular expressions 101 » Conditional subpatterns Conditional subpatterns Scenario: match
all (x|ht)ml tags ! Caution! ‣ <element></element> ‣ <element />

Solution: if then else /<(?P<tag>[a-z]+).*?(?P<self>\/)?>(?(self)|.*?<\/(?P=tag)>)/i Named patterns If self-closing, then
do nothing,  else, find matching end tag Regular expressions 101 » Conditional subpatterns Conditional subpatterns

Regular expressions 101 » Conditional subpatterns Conditional subpatterns ‣ With
subpattern (named or by id): ‣ (?(pattern)then) ‣ (?(pattern)then|else) ‣ With lookahead/-behind: ‣ (?(?=assertion)then) ‣ (?(?=assertion)then|else)

Regular expressions 101 » Comments Comments / # match currency
symbols for USD, EUR, GBP & YEN [$€£¥] # must be followed by a number to indicate a price (?=[0-9]) # pattern modifiers: # u for UTF-8 interpretation (currency symbols), # x to ignore whitespace (for comments) /ux

Regular expressions 101 » Comments Comments ‣ # Perl-style ‣
/x modifier ‣ Ignores unescaped whitespace

Presentation title

Questions? Regular expressions 101

mullie.eu ‣ www.mullie.eu/regular-expressions-basics/ ‣ www.mullie.eu/regular-expressions-advanced/ Regular expressions 101 Resources

Regular expressions 101

Regular expressions 101

More Decks by Matthias Mullie

Other Decks in Programming

Featured

Transcript