Regular expressions 101

Regular expressions 101

Regular expressions are under-valued and most developers tend to only know the basics. Having a thorough understanding of how regular expressions work, will be incredibly helpful when you need to parse structured data.

This presentation will assume you already know what regular expressions are, but will sum up (with an example) some fancy things you probably didn’t know were possible with regular expressions.

If you're interested in a more detailed write-up, I suggest you check out http://www.mullie.eu/regular-expressions-basics/ & http://www.mullie.eu/regular-expressions-advanced/

This presentation is based on the PHP-implementation of PCRE, but nearly all programming languages support the same functionality, albeit sometimes with their own twists.

54b2d1838a911e14c4b7b46bb8e0e8ff?s=128

Matthias Mullie

June 20, 2012
Tweet

Transcript

  1. The Swiss Army knife of string manipulation REGEX 101

  2. @matthiasmullie Regular expressions 101

  3. Regular expressions 101 INTRODUCTION What are regular expressions?

  4. Regular expressions 101 » Introduction Google Regular expressions are special

    characters that match or capture portions of a field, as well as the rules that govern all characters.
  5. Regular expressions 101 » Introduction Wikipedia A regular expression provides

    a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters.
  6. Regular expressions 101 » Introduction /\{\$([a-z0-9_]*)((\.[a-z0-9_]*)*)
 (-\>[a-z0-9_]*((\.[a-z0-9_]*)*))?
 ((\|[a-z_][a-z0-9_]*(:.*?)*)*)\}/i

  7. Regular expressions 101 » Introduction Me Regular expressions find patterns

    in strings.
  8. Regular expressions 101 » Introduction Neque porro quisquam est qui

    dolorem ipsum quia dolor sit amet, consectetur, adipisci velit... ! ‣ /[a-z]/i ‣ /[^\w]/i ! ‣ /ipsum/ ‣ /(est|qui)/
  9. Regular expressions 101 BASICS The syntax everyone should know already

  10. /Delimiter/ Regular expressions 101 » Delimiter ‣ Any [^a-zA-Z0-9\\\s] character

    ‣ Opening char == terminating char ‣ Except for [ ] , ( ) , { } and < >
  11. Regular expressions 101 » Delimiter Use / (uniformity, you know)

  12. ‣ . ‣ [ ] ‣ ^ $ ‣ |

    Meta characters Regular expressions 101 » Meta characters ‣ ( ) ‣ \ ‣ * ? + ‣ {n} {n,m}
  13. ‣ i ‣ m ‣ s ‣ x ‣ e

    ‣ A ‣ D ‣ U ‣ J ‣ ... Pattern modifiers //x Regular expressions 101 » Pattern modifiers
  14. Character classes [ ] Regular expressions 101 » Character classes

    Ranges Inverse ranges ‣ [0-9] ‣ [a-zA-Z] ‣ [A-F0-9] ‣ [^0-9] ‣ [^a-zA-Z] ‣ [^A-F0-9]
  15. Character classes [ ] Regular expressions 101 » Character classes

    No sequence of characters! ! [lorem] ‣ l, o, r, e or m ‣ lorem
  16. Character classes [ ] ‣ [:alnum:] ‣ [:blank:] ‣ [:lower:]

    ‣ ... Regular expressions 101 » Character classes POSIX
  17. Greediness: greedy Regular expressions 101 » Greediness <ul><li>list-item1</li><li>list-item2</li></ul> ! /<li>.*<\/li>/

    ‣ <li>list-item1</li><li>list-item2</li>
  18. Greediness: lazy Regular expressions 101 » Greediness <ul><li>list-item1</li><li>list-item2</li></ul> ! /<li>.*?<\/li>/

    /<li>.*<\/li>/U ‣ <li>list-item1</li> ‣ <li>list-item2</li> or
  19. Subpatterns Regular expressions 101 » Subpatterns /([a-z0-9]*)@([a-z0-9\.]*\.[a-z0-9]{2,3})/i email ! hostname

    user Note: this regex only barely satisfies my needs for this particular example; do not use this really find occurrences of email addresses, it does not fully satisfy RFC5321 & RFC5322
  20. Questions? Regular expressions 101

  21. Regular expressions 101 ADVANCED The juicy stuff you never knew

    about, until now
  22. Regular expressions 101 » Back references Problem: /href=['"](.*?)['"]/i Matches: ‣

    href="xxx" ‣ href='xxx' ! ! ‣ href="xxx' ‣ href='xxx" Back references
  23. Regular expressions 101 » Back references Back references Solution: /href=(['"])(.*?)\1/i

    \1 references first subpattern! ! Don’t forget to also string-escape in PHP: preg_match('/href=([\'"])(.*?)\\1/i', ...);
  24. Regular expressions 101 » Named subpatterns Named subpatterns Scenario: parsing

    large CSV 1,a title,5.00,92,green 2,another title,3.50,4,blue 3,one more,33699.99,15,white ...
  25. /([0-9]+),(.*?),([0-9]+\.[0-9]{2}),([0-9]+),([a-z]+)/i ! ! Result excerpt: Regular expressions 101 » Named

    subpatterns Named subpatterns [1] => string(1) "1" [2] => string(7) "a title" [3] => string(4) "5.00" [4] => string(2) "92" [5] => string(5) "green" ! ! ! !
  26. /(?P<id>[0-9]+),(?P<title>.*?),(?P<price>[0-9]+\.[0-9] {2}),(?P<stock>[0-9]+),(?P<color>[a-z]+)/i ! Result excerpt: Regular expressions 101 » Named

    subpatterns ["id"] => string(1) "1" [1] => string(1) "1" ["title"] => string(7) "a title" [2] => string(7) "a title" ["price"] => string(4) "5.00" [3] => string(4) "5.00" ["stock"] => string(2) "92" [4] => string(2) "92" ["color"] => string(5) "green" [5] => string(5) "green" Named subpatterns
  27. Regular expressions 101 » Named subpatterns Named subpatterns ‣ (?P<name>pattern)

    ‣ (?<name>pattern) & (?'name'pattern) since PHP 5.2.2
  28. Regular expressions 101 » Named subpatterns Named subpatterns + back

    references ! /href=(?P<quotes>['"])(?P<href>.*?)(?P=quotes)/i
  29. Regular expressions 101 » Assertions “Take a peek, don’t eat

    it” Lookahead/-behind assertions
  30. Scenario: find all occurrences of “here” ! “Where can I

    find here, not there?” Regular expressions 101 » Assertions Lookahead/-behind assertions
  31. Regular expressions 101 » Assertions Lookahead/-behind assertions Deduction: Find all

    here’s, not preceded or followed by an alphabetic character. ! Solution: /(?<![a-z])here(?![a-z])/i
  32. Regular expressions 101 » Assertions Lookahead/-behind assertions ‣ Positive lookahead:

    (?=expression) ‣ Negative lookahead: (?!expression) ‣ Positive lookbehind: (?<=expression) ‣ Negative lookbehind: (?<!expression)
  33. “lookbehind assertion is not fixed length...” In PHP, lookbehind can

    not contain repetition, while lookahead can. ‣ (?=.*) ‣ (?=abc) Regular expressions 101 » Assertions ‣ (?<=.*) ‣ (?<=abc) Lookahead/-behind assertions
  34. Regular expressions 101 » Conditional subpatterns Conditional subpatterns if-then(-else) in

    regular expressions ! ! YES RLY!
  35. Regular expressions 101 » Conditional subpatterns Conditional subpatterns Scenario: match

    all (x|ht)ml tags ! Caution! ‣ <element></element> ‣ <element />
  36. Solution: if then else /<(?P<tag>[a-z]+).*?(?P<self>\/)?>(?(self)|.*?<\/(?P=tag)>)/i Named patterns If self-closing, then

    do nothing,
 else, find matching end tag Regular expressions 101 » Conditional subpatterns Conditional subpatterns
  37. Regular expressions 101 » Conditional subpatterns Conditional subpatterns ‣ With

    subpattern (named or by id): ‣ (?(pattern)then) ‣ (?(pattern)then|else) ‣ With lookahead/-behind: ‣ (?(?=assertion)then) ‣ (?(?=assertion)then|else)
  38. Regular expressions 101 » Comments Comments / # match currency

    symbols for USD, EUR, GBP & YEN [$€£¥] # must be followed by a number to indicate a price (?=[0-9]) # pattern modifiers: # u for UTF-8 interpretation (currency symbols), # x to ignore whitespace (for comments) /ux
  39. Regular expressions 101 » Comments Comments ‣ # Perl-style ‣

    /x modifier ‣ Ignores unescaped whitespace
  40. Presentation title

  41. Questions? Regular expressions 101

  42. mullie.eu ‣ www.mullie.eu/regular-expressions-basics/ ‣ www.mullie.eu/regular-expressions-advanced/ Regular expressions 101 Resources