Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Regular Expressions for Fun and Profit

Regular Expressions for Fun and Profit

Slides for a presentation given at OpenWest 2016 in Sandy, UT.

Spencer Christensen

July 18, 2016
Tweet

More Decks by Spencer Christensen

Other Decks in Programming

Transcript

  1. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Regular Expressions For Fun And Profit
    Spencer Christensen | Adobe Analytics SRE

    View Slide

  2. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Some people, when confronted with a problem, think "I know, I'll use
    regular expressions." Now they have two problems.
    - Jamie Zawinski, circa 1997

    View Slide

  3. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Some people, when confronted with a problem, think "I know, I'll use
    regular expressions." Now they have two problems.
    - Jamie Zawinski, circa 1997
    Some people, when confronted with a problem, think “I know, I’ll quote
    Jamie Zawinski.” Now they have two problems.
    - Martin Liebach, circa 2009

    View Slide

  4. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Regular Expressions are...

    View Slide

  5. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Regular Expressions are...

    View Slide

  6. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Regular Expressions are...

    View Slide

  7. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    You have been invited to become Regex witches and wizards

    View Slide

  8. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Patterns and Pattern matching

    View Slide

  9. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Patterns and Pattern matching

    View Slide

  10. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Patterns and Pattern matching

    View Slide

  11. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Describing patterns

    View Slide

  12. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Describing patterns
    Using white, cast on 61sts. Mark the centre stitch with a piece of
    coloured yarn.
    1st row: Knit to within 1 st of the centre (on the first row this will
    be 29 sts and every decrease row after that will be one stitch less),
    Sl2, K1, PSSO, K to end
    2nd row: Knit
    3rd row: Using red, knit to within 1 st of the centre, Sl2, K1, PSSO,
    K to end
    4th row: Purl
    Repeat these four rows, always working rows 1 and 2 in white,
    and rows 3 and 4 in rainbow stripes. When you have 5sts left
    work as follows:
    K1, Sl2, K1, PSSO, K1
    Next row: Knit
    Next row (don't change colours): Sl2, K1, PSSO
    Fasten off.

    View Slide

  13. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Describing patterns
    Poetry and Rhyming patterns
    Here s an example of ABAB in action, as written

    by William Shakespeare:
    A O, if I say, you look upon this verse,
    B When I, perhaps, compounded am with clay,
    A Do not so much as my poor name rehearse,
    B But let your love even with my life decay…

    View Slide

  14. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Describing patterns
    Rubik s Cube Notation

    A single letter by itself means to turn that face clockwise 90 degrees.
    A letter followed by an apostrophe means to turn that face counterclockwise 90
    degrees.
    A letter with the number 2 after it means to turn that face 180 degrees.
    e.g. R U R U R U2 R U
    ’ ’

    View Slide

  15. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Languages and Symbols
    using codes to represent ideas and expressions.
    if (def[d] && def[d].arg && param) {
    var rw = (d+":"+param).replace(/'|\\/g, '_');
    def.__exp = def.__exp || {};
    def.__exp[rw] = def[d].text.replace(new RegExp("(^|[^\\w$])" + def[d].arg +
    "([^\\w$])", "g"), "$1" + param + "$2");
    return s + "def.__exp['"+rw+"']";
    }

    View Slide

  16. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Regex as a language
    matching hello world:
    /hello world/

    View Slide

  17. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Regex as a language
    matching hello world:
    /hello world/
    Limitations of hello world example:

    Case sensitive

    No explicit start or end of line

    Only matches a single space
    character

    View Slide

  18. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Regex as a language
    Special Characters

    \ Quote the next metacharacter, or escape

    ^ Match the beginning of the line

    . Match any character (except newline)

    $ Match the end of the string (or before newline at the end of the string)

    | Alternation

    () Grouping

    [ ] Bracketed Character class

    View Slide

  19. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Regex as a language
    Quantifiers

    * Match 0 or more times

    + Match 1 or more times

    ? Match 1 or 0 times

    {n} Match exactly n times

    {n,} Match at least n times

    {n,m} Match at least n but not more than m times

    View Slide

  20. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Examples:
    /hello +world/
    /(h|H)ello +(w|W)orld/
    /^(h|H)ello +(w|W)orld$/

    View Slide

  21. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Quiz time!
    Write a regex to match any white space at the beginning of a
    line- zero or more space or tab characters.

    View Slide

  22. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Quiz time!
    Write a regex to match any white space at the beginning of a
    line- zero or more space or tab characters.
    /^( |\t)*/

    View Slide

  23. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Character Classes
    [ ] Square brackets contain possible characters to match
    one character.

    [ABCDEF] matches only the specific literal characters

    [A-Z] matches all uppercase letters of the alphabet

    [A-Za-z] matches all upper and lower case letters

    [0-9] matches all digits

    [0-9A-Fa-f] matches hexidecimal numbers, like 9a31f

    View Slide

  24. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Character Classes
    Order of contents within a character class doesn't matter as
    long as the matching is equivalent
    [abcd] == [dcba]
    However ranges do matter- [a-Z] != [a-zA-Z]

    View Slide

  25. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Character Classes
    Special characters within character class

    Invert character class [^a-z], carrot at beginning

    Dot, pipe, parens, braces, plus, question mark, star, caret,
    dollar are literals within a character class

    no need to escape, although escaping makes it clear
    [.|(){}+?*^$]
    [\.\|\(\)\{\}\+\?\*\^\$]

    To get a literal dash, have it at the beginning or escape it
    [-asdf] or [asdf\-]

    View Slide

  26. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Escaping characters
    When desiring a literal non-alphanumeric character
    and in doubt if you should escape it, then escape it.
    /USD$[0-9]+\.[0-9]{2}/
    /USD\$[0-9]+\.[0-9]{2}/
    Double backslash to get a literal backslash character /\\/

    View Slide

  27. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Quiz time!
    Write a regex to match an IP address.
    ei. 10.9.200.12
    Hint: use the { } quantifier

    View Slide

  28. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Quiz time!
    Write a regex to match an IP address.
    ei. 10.9.200.12
    Hint: use the { } quantifier
    /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/
    /([0-9]{1,3}\.?){4}/

    View Slide

  29. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    PCRE character classes as metacharacters
    Metacharacters or escaped character
    \w – word character == [a-zA-Z_]
    \d – digit == [0-9]
    \s – white space == [ \t\r\n]
    \t – tab
    \n – newline
    \r – carriage return
    \b – word boundary
    \x0234 – hex value
    Inverses:
    \W == [^\w] == [^a-zA-Z_]
    \D == [^\d] == [^0-9]
    \S == [^\s] == [^ \t\r\n]

    View Slide

  30. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    POSIX character classes
    POSIX character classes are named classes in the form [[:class:]]
    alpha Any alphabetical character ("[A-Za-z]"). [[:alpha:]]
    alnum Any alphanumeric character ("[A-Za-z0-9]").
    ascii Any character in the ASCII character set.
    blank A GNU extension, equal to a space or a horizontal tab ("\t").
    cntrl Any control character.
    digit Any decimal digit ("[0-9]"), equivalent to "\d".
    graph Any printable character, excluding a space.
    lower Any lowercase character ("[a-z]").
    print Any printable character, including a space.
    punct Any graphical character excluding "word" characters.
    space Any whitespace character. "\s" including the vertical tab ("\cK").
    upper Any uppercase character ("[A-Z]").
    word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
    xdigit Any hexadecimal digit ("[0-9a-fA-F]").

    View Slide

  31. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Quiz time!
    Write a regex to match any white space at the beginning of a
    line- zero or more space or tab characters.
    /^( |\t)*/ => /^\s*/

    View Slide

  32. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Quiz time!
    Write a regex to match an IP address.
    ei. 10.9.200.12
    Hint: use the { } quantifier
    /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ => /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/
    /([0-9]{1,3}\.?){4}/ => /(\d{1,3}\.?){4}/

    View Slide

  33. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Subexpressions, groups, and captures
    Parentheses enclose a subexpression, and the
    match of just that subexpression is saved in a
    buffer. These buffers can be referenced and
    used, sometimes called groups or captures.
    Example: /”(GET|POST) (http[^”]+)”/

    View Slide

  34. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Subexpressions, groups, and captures
    Depending on your programming language you can then use those groups
    and store them in variables and do something with them.
    Example in python:
    matches = re.search(r'”(GET|POST) (http[^”]+)”', request_str)
    if matches:
    method = matches,group(1)
    url = matches.group(2)

    View Slide

  35. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Subexpressions, groups, and captures
    Groups can be nested, in which case group numbers are based on the left parentheses
    Example: /(https?:\/\/([^\/]+)/(.*)(\?.*)?)/
    How many groups are there?

    View Slide

  36. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Subexpressions, groups, and captures
    Groups can be nested, in which case group numbers are based on the left parentheses
    Example: /(https?:\/\/([^\/]+)/(.*)(\?.*)?)/
    How many groups are there?
    4
    Group 1 is the entire url
    Group 2 is the hostname
    Group 3 is the url path
    Group 4 is the query string if it exists, and is optional. You will need to check if it
    exists in your code

    View Slide

  37. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Subexpressions, groups, and captures
    Things to be aware of with groups/captures

    They have overhead copying text to the saved buffers. So if you don't really need the
    group you can improve performance slightly by using (?:...) notation. This tells
    the regex engine to not save the subexpression match in a buffer.
    Example: /(?:this|that|these)/

    View Slide

  38. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Subexpressions, groups, and captures
    You can reference a group within the same regex the groups are matching. To reference a
    group use \1, \2, \3, etc.
    Example: /(\w+) \1/

    View Slide

  39. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Performance concerns

    If you are only matching a single literal string, it is faster to use a
    substring function

    Be careful using dynamic regexes inside loops. They are
    evaluated and compile every time. Static regexes can be
    optimized in perl with /o
    foreach my $animal (@zoo) {
    If ($animal =~ /(?:monkey|ape)/o) {
    $primate_count++;
    }
    }

    View Slide

  40. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Greedy versus Non-greedy matching
    The quantifiers + and * are greedy by default.
    Example: //
    with the text:
    class=”button”>Home

    View Slide

  41. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Greedy versus Non-greedy matching
    The quantifiers + and * are greedy by default.
    Example: //
    with the text:
    class=”button”>Home

    View Slide

  42. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Greedy versus Non-greedy matching
    The quantifiers + and * are greedy by default.
    Example: //
    with the text:
    class=”button”>Home

    View Slide

  43. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Greedy versus Non-greedy matching
    The quantifiers + and * are greedy by default.
    Example: //
    with the text:
    class=”button”>Home

    View Slide

  44. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Greedy versus Non-greedy matching
    The quantifiers + and * are greedy by default.
    Example: //
    with the text:
    class=”button”>Home

    View Slide

  45. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Greedy versus Non-greedy matching
    To make them non-greedy simply add ? To the end, like .+? or .*?. This
    tells the regex engine to look ahead one character on every match which
    prevents it from going too far.
    Example: //
    with the text:
    class=”button”>Home

    View Slide

  46. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Regular Expressions are Magic

    View Slide

  47. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Congratulations!
    Regex Witches and Wizards!

    View Slide

  48. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Thanks!

    View Slide