Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Regular expressions 101

Regular expressions 101

Regular expressions are under-valued and most developers tend to only know the basics. Having a thorough understanding of how regular expressions work, will be incredibly helpful when you need to parse structured data.

This presentation will assume you already know what regular expressions are, but will sum up (with an example) some fancy things you probably didn’t know were possible with regular expressions.

If you're interested in a more detailed write-up, I suggest you check out http://www.mullie.eu/regular-expressions-basics/ & http://www.mullie.eu/regular-expressions-advanced/

This presentation is based on the PHP-implementation of PCRE, but nearly all programming languages support the same functionality, albeit sometimes with their own twists.

Matthias Mullie

June 20, 2012
Tweet

More Decks by Matthias Mullie

Other Decks in Programming

Transcript

  1. The Swiss Army knife of string manipulation
    REGEX 101

    View Slide

  2. @matthiasmullie
    Regular expressions 101

    View Slide

  3. Regular expressions 101
    INTRODUCTION
    What are regular expressions?

    View Slide

  4. Regular expressions 101 » Introduction
    Google
    Regular expressions are special
    characters that match or capture
    portions of a field, as well as the rules
    that govern all characters.

    View Slide

  5. Regular expressions 101 » Introduction
    Wikipedia
    A regular expression provides a
    concise and flexible means for
    "matching" strings of text, such as
    particular characters, words, or
    patterns of characters.

    View Slide

  6. Regular expressions 101 » Introduction
    /\{\$([a-z0-9_]*)((\.[a-z0-9_]*)*)

    (-\>[a-z0-9_]*((\.[a-z0-9_]*)*))?

    ((\|[a-z_][a-z0-9_]*(:.*?)*)*)\}/i

    View Slide

  7. Regular expressions 101 » Introduction
    Me
    Regular expressions find patterns in
    strings.

    View Slide

  8. Regular expressions 101 » Introduction
    Neque porro quisquam est qui
    dolorem ipsum quia dolor sit amet,
    consectetur, adipisci velit...
    !
    ‣ /[a-z]/i
    ‣ /[^\w]/i
    !
    ‣ /ipsum/
    ‣ /(est|qui)/

    View Slide

  9. Regular expressions 101
    BASICS
    The syntax everyone should know already

    View Slide

  10. /Delimiter/
    Regular expressions 101 » Delimiter
    ‣ Any [^a-zA-Z0-9\\\s]
    character
    ‣ Opening char == terminating char
    ‣ Except for [ ]
    , ( )
    , { }
    and < >

    View Slide

  11. Regular expressions 101 » Delimiter
    Use /
    (uniformity, you know)

    View Slide

  12. ‣ .
    ‣ [ ]
    ‣ ^ $
    ‣ |
    Meta characters
    Regular expressions 101 » Meta characters
    ‣ ( )
    ‣ \
    ‣ * ? +
    ‣ {n} {n,m}

    View Slide

  13. ‣ i
    ‣ m
    ‣ s
    ‣ x
    ‣ e
    ‣ A
    ‣ D
    ‣ U
    ‣ J
    ‣ ...
    Pattern modifiers //x
    Regular expressions 101 » Pattern modifiers

    View Slide

  14. Character classes [ ]
    Regular expressions 101 » Character classes
    Ranges Inverse ranges
    ‣ [0-9]
    ‣ [a-zA-Z]
    ‣ [A-F0-9]
    ‣ [^0-9]
    ‣ [^a-zA-Z]
    ‣ [^A-F0-9]

    View Slide

  15. Character classes [ ]
    Regular expressions 101 » Character classes
    No sequence of characters!
    !
    [lorem]
    ‣ l, o, r, e or m
    ‣ lorem

    View Slide

  16. Character classes [ ]
    ‣ [:alnum:]
    ‣ [:blank:]
    ‣ [:lower:]
    ‣ ...
    Regular expressions 101 » Character classes
    POSIX

    View Slide

  17. Greediness: greedy
    Regular expressions 101 » Greediness
    list-item1list-item2
    !
    /.*<\/li>/
    ‣ list-item1list-item2

    View Slide

  18. Greediness: lazy
    Regular expressions 101 » Greediness
    list-item1list-item2
    !
    /.*?<\/li>/ /.*<\/li>/U
    ‣ list-item1
    ‣ list-item2
    or

    View Slide

  19. Subpatterns
    Regular expressions 101 » Subpatterns
    /([a-z0-9]*)@([a-z0-9\.]*\.[a-z0-9]{2,3})/i email
    !
    hostname
    user
    Note: this regex only barely satisfies my needs for this particular example; do not use this really find occurrences of email addresses, it does not fully satisfy RFC5321 & RFC5322

    View Slide

  20. Questions?
    Regular expressions 101

    View Slide

  21. Regular expressions 101
    ADVANCED
    The juicy stuff you never knew about, until now

    View Slide

  22. Regular expressions 101 » Back references
    Problem: /href=['"](.*?)['"]/i
    Matches:
    ‣ href="xxx"
    ‣ href='xxx'
    !
    !
    ‣ href="xxx'
    ‣ href='xxx"
    Back references

    View Slide

  23. Regular expressions 101 » Back references
    Back references
    Solution: /href=(['"])(.*?)\1/i
    \1 references first subpattern!
    !
    Don’t forget to also string-escape in PHP:
    preg_match('/href=([\'"])(.*?)\\1/i', ...);

    View Slide

  24. Regular expressions 101 » Named subpatterns
    Named subpatterns
    Scenario: parsing large CSV
    1,a title,5.00,92,green
    2,another title,3.50,4,blue
    3,one more,33699.99,15,white
    ...

    View Slide

  25. /([0-9]+),(.*?),([0-9]+\.[0-9]{2}),([0-9]+),([a-z]+)/i
    !
    !
    Result excerpt:
    Regular expressions 101 » Named subpatterns
    Named subpatterns
    [1] => string(1) "1"
    [2] => string(7) "a title"
    [3] => string(4) "5.00"
    [4] => string(2) "92"
    [5] => string(5) "green"
    !
    !
    !
    !

    View Slide

  26. /(?P[0-9]+),(?P.*?),(?P[0-9]+\.[0-9]
    {2}),(?P[0-9]+),(?P[a-z]+)/i
    !
    Result excerpt:
    Regular expressions 101 » Named subpatterns
    ["id"] => string(1) "1"
    [1] => string(1) "1"
    ["title"] => string(7) "a title"
    [2] => string(7) "a title"
    ["price"] => string(4) "5.00"
    [3] => string(4) "5.00"
    ["stock"] => string(2) "92"
    [4] => string(2) "92"
    ["color"] => string(5) "green"
    [5] => string(5) "green"
    Named subpatterns

    View Slide

  27. Regular expressions 101 » Named subpatterns
    Named subpatterns
    ‣ (?Ppattern)
    ‣ (?pattern) & (?'name'pattern)
    since PHP 5.2.2

    View Slide

  28. Regular expressions 101 » Named subpatterns
    Named subpatterns +
    back references
    !
    /href=(?P['"])(?P.*?)(?P=quotes)/i

    View Slide

  29. Regular expressions 101 » Assertions
    “Take a peek, don’t eat it”
    Lookahead/-behind assertions

    View Slide

  30. Scenario: find all occurrences of “here”

    !
    “Where can I find here, not there?”
    Regular expressions 101 » Assertions
    Lookahead/-behind assertions

    View Slide

  31. Regular expressions 101 » Assertions
    Lookahead/-behind assertions
    Deduction:
    Find all here’s, not preceded or followed by
    an alphabetic character.
    !
    Solution: /(?

    View Slide

  32. Regular expressions 101 » Assertions
    Lookahead/-behind assertions
    ‣ Positive lookahead: (?=expression)
    ‣ Negative lookahead: (?!expression)
    ‣ Positive lookbehind: (?<=expression)
    ‣ Negative lookbehind: (?

    View Slide

  33. “lookbehind assertion is not fixed length...”
    In PHP, lookbehind can not contain repetition,
    while lookahead can.
    ‣ (?=.*)
    ‣ (?=abc)
    Regular expressions 101 » Assertions
    ‣ (?<=.*)
    ‣ (?<=abc)
    Lookahead/-behind assertions

    View Slide

  34. Regular expressions 101 » Conditional subpatterns
    Conditional subpatterns
    if-then(-else) in regular expressions
    !
    !
    YES RLY!

    View Slide

  35. Regular expressions 101 » Conditional subpatterns
    Conditional subpatterns
    Scenario: match all (x|ht)ml tags
    !
    Caution!


    View Slide

  36. Solution: if then else
    /<(?P[a-z]+).*?(?P\/)?>(?(self)|.*?<\/(?P=tag)>)/i
    Named patterns
    If self-closing, then do nothing,

    else, find matching end tag
    Regular expressions 101 » Conditional subpatterns
    Conditional subpatterns

    View Slide

  37. Regular expressions 101 » Conditional subpatterns
    Conditional subpatterns
    ‣ With subpattern (named or by id):
    ‣ (?(pattern)then)
    ‣ (?(pattern)then|else)
    ‣ With lookahead/-behind:
    ‣ (?(?=assertion)then)
    ‣ (?(?=assertion)then|else)

    View Slide

  38. Regular expressions 101 » Comments
    Comments
    /
    # match currency symbols for USD, EUR, GBP & YEN
    [$€£¥]
    # must be followed by a number to indicate a price
    (?=[0-9])
    # pattern modifiers:
    # u for UTF-8 interpretation (currency symbols),
    # x to ignore whitespace (for comments)
    /ux

    View Slide

  39. Regular expressions 101 » Comments
    Comments
    ‣ # Perl-style
    ‣ /x modifier
    ‣ Ignores unescaped whitespace

    View Slide

  40. Presentation title

    View Slide

  41. Questions?
    Regular expressions 101

    View Slide

  42. mullie.eu
    ‣ www.mullie.eu/regular-expressions-basics/
    ‣ www.mullie.eu/regular-expressions-advanced/
    Regular expressions 101
    Resources

    View Slide