Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Regular expressions 101

Regular expressions 101

Regular expressions are under-valued and most developers tend to only know the basics. Having a thorough understanding of how regular expressions work, will be incredibly helpful when you need to parse structured data.

This presentation will assume you already know what regular expressions are, but will sum up (with an example) some fancy things you probably didn’t know were possible with regular expressions.

If you're interested in a more detailed write-up, I suggest you check out http://www.mullie.eu/regular-expressions-basics/ & http://www.mullie.eu/regular-expressions-advanced/

This presentation is based on the PHP-implementation of PCRE, but nearly all programming languages support the same functionality, albeit sometimes with their own twists.

Matthias Mullie

June 20, 2012
Tweet

More Decks by Matthias Mullie

Other Decks in Programming

Transcript

  1. Regular expressions 101 » Introduction Google Regular expressions are special

    characters that match or capture portions of a field, as well as the rules that govern all characters.
  2. Regular expressions 101 » Introduction Wikipedia A regular expression provides

    a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters.
  3. Regular expressions 101 » Introduction Neque porro quisquam est qui

    dolorem ipsum quia dolor sit amet, consectetur, adipisci velit... ! ‣ /[a-z]/i ‣ /[^\w]/i ! ‣ /ipsum/ ‣ /(est|qui)/
  4. /Delimiter/ Regular expressions 101 » Delimiter ‣ Any [^a-zA-Z0-9\\\s] character

    ‣ Opening char == terminating char ‣ Except for [ ] , ( ) , { } and < >
  5. ‣ . ‣ [ ] ‣ ^ $ ‣ |

    Meta characters Regular expressions 101 » Meta characters ‣ ( ) ‣ \ ‣ * ? + ‣ {n} {n,m}
  6. ‣ i ‣ m ‣ s ‣ x ‣ e

    ‣ A ‣ D ‣ U ‣ J ‣ ... Pattern modifiers //x Regular expressions 101 » Pattern modifiers
  7. Character classes [ ] Regular expressions 101 » Character classes

    Ranges Inverse ranges ‣ [0-9] ‣ [a-zA-Z] ‣ [A-F0-9] ‣ [^0-9] ‣ [^a-zA-Z] ‣ [^A-F0-9]
  8. Character classes [ ] Regular expressions 101 » Character classes

    No sequence of characters! ! [lorem] ‣ l, o, r, e or m ‣ lorem
  9. Character classes [ ] ‣ [:alnum:] ‣ [:blank:] ‣ [:lower:]

    ‣ ... Regular expressions 101 » Character classes POSIX
  10. Subpatterns Regular expressions 101 » Subpatterns /([a-z0-9]*)@([a-z0-9\.]*\.[a-z0-9]{2,3})/i email ! hostname

    user Note: this regex only barely satisfies my needs for this particular example; do not use this really find occurrences of email addresses, it does not fully satisfy RFC5321 & RFC5322
  11. Regular expressions 101 » Back references Problem: /href=['"](.*?)['"]/i Matches: ‣

    href="xxx" ‣ href='xxx' ! ! ‣ href="xxx' ‣ href='xxx" Back references
  12. Regular expressions 101 » Back references Back references Solution: /href=(['"])(.*?)\1/i

    \1 references first subpattern! ! Don’t forget to also string-escape in PHP: preg_match('/href=([\'"])(.*?)\\1/i', ...);
  13. Regular expressions 101 » Named subpatterns Named subpatterns Scenario: parsing

    large CSV 1,a title,5.00,92,green 2,another title,3.50,4,blue 3,one more,33699.99,15,white ...
  14. /([0-9]+),(.*?),([0-9]+\.[0-9]{2}),([0-9]+),([a-z]+)/i ! ! Result excerpt: Regular expressions 101 » Named

    subpatterns Named subpatterns [1] => string(1) "1" [2] => string(7) "a title" [3] => string(4) "5.00" [4] => string(2) "92" [5] => string(5) "green" ! ! ! !
  15. /(?P<id>[0-9]+),(?P<title>.*?),(?P<price>[0-9]+\.[0-9] {2}),(?P<stock>[0-9]+),(?P<color>[a-z]+)/i ! Result excerpt: Regular expressions 101 » Named

    subpatterns ["id"] => string(1) "1" [1] => string(1) "1" ["title"] => string(7) "a title" [2] => string(7) "a title" ["price"] => string(4) "5.00" [3] => string(4) "5.00" ["stock"] => string(2) "92" [4] => string(2) "92" ["color"] => string(5) "green" [5] => string(5) "green" Named subpatterns
  16. Regular expressions 101 » Named subpatterns Named subpatterns ‣ (?P<name>pattern)

    ‣ (?<name>pattern) & (?'name'pattern) since PHP 5.2.2
  17. Regular expressions 101 » Named subpatterns Named subpatterns + back

    references ! /href=(?P<quotes>['"])(?P<href>.*?)(?P=quotes)/i
  18. Scenario: find all occurrences of “here” ! “Where can I

    find here, not there?” Regular expressions 101 » Assertions Lookahead/-behind assertions
  19. Regular expressions 101 » Assertions Lookahead/-behind assertions Deduction: Find all

    here’s, not preceded or followed by an alphabetic character. ! Solution: /(?<![a-z])here(?![a-z])/i
  20. Regular expressions 101 » Assertions Lookahead/-behind assertions ‣ Positive lookahead:

    (?=expression) ‣ Negative lookahead: (?!expression) ‣ Positive lookbehind: (?<=expression) ‣ Negative lookbehind: (?<!expression)
  21. “lookbehind assertion is not fixed length...” In PHP, lookbehind can

    not contain repetition, while lookahead can. ‣ (?=.*) ‣ (?=abc) Regular expressions 101 » Assertions ‣ (?<=.*) ‣ (?<=abc) Lookahead/-behind assertions
  22. Regular expressions 101 » Conditional subpatterns Conditional subpatterns Scenario: match

    all (x|ht)ml tags ! Caution! ‣ <element></element> ‣ <element />
  23. Solution: if then else /<(?P<tag>[a-z]+).*?(?P<self>\/)?>(?(self)|.*?<\/(?P=tag)>)/i Named patterns If self-closing, then

    do nothing,
 else, find matching end tag Regular expressions 101 » Conditional subpatterns Conditional subpatterns
  24. Regular expressions 101 » Conditional subpatterns Conditional subpatterns ‣ With

    subpattern (named or by id): ‣ (?(pattern)then) ‣ (?(pattern)then|else) ‣ With lookahead/-behind: ‣ (?(?=assertion)then) ‣ (?(?=assertion)then|else)
  25. Regular expressions 101 » Comments Comments / # match currency

    symbols for USD, EUR, GBP & YEN [$€£¥] # must be followed by a number to indicate a price (?=[0-9]) # pattern modifiers: # u for UTF-8 interpretation (currency symbols), # x to ignore whitespace (for comments) /ux
  26. Regular expressions 101 » Comments Comments ‣ # Perl-style ‣

    /x modifier ‣ Ignores unescaped whitespace