Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Regular Expressions for Fun and Profit

Regular Expressions for Fun and Profit

Slides for a presentation given at OpenWest 2016 in Sandy, UT.

Spencer Christensen

July 18, 2016
Tweet

More Decks by Spencer Christensen

Other Decks in Programming

Transcript

  1. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Regular Expressions For Fun And Profit Spencer Christensen | Adobe Analytics SRE
  2. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski, circa 1997
  3. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski, circa 1997 Some people, when confronted with a problem, think “I know, I’ll quote Jamie Zawinski.” Now they have two problems. - Martin Liebach, circa 2009
  4. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    You have been invited to become Regex witches and wizards
  5. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Describing patterns Using white, cast on 61sts. Mark the centre stitch with a piece of coloured yarn. 1st row: Knit to within 1 st of the centre (on the first row this will be 29 sts and every decrease row after that will be one stitch less), Sl2, K1, PSSO, K to end 2nd row: Knit 3rd row: Using red, knit to within 1 st of the centre, Sl2, K1, PSSO, K to end 4th row: Purl Repeat these four rows, always working rows 1 and 2 in white, and rows 3 and 4 in rainbow stripes. When you have 5sts left work as follows: K1, Sl2, K1, PSSO, K1 Next row: Knit Next row (don't change colours): Sl2, K1, PSSO Fasten off.
  6. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Describing patterns Poetry and Rhyming patterns Here s an example of ABAB in action, as written ’ by William Shakespeare: A O, if I say, you look upon this verse, B When I, perhaps, compounded am with clay, A Do not so much as my poor name rehearse, B But let your love even with my life decay…
  7. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Describing patterns Rubik s Cube Notation ’ A single letter by itself means to turn that face clockwise 90 degrees. A letter followed by an apostrophe means to turn that face counterclockwise 90 degrees. A letter with the number 2 after it means to turn that face 180 degrees. e.g. R U R U R U2 R U ’ ’
  8. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Languages and Symbols using codes to represent ideas and expressions. if (def[d] && def[d].arg && param) { var rw = (d+":"+param).replace(/'|\\/g, '_'); def.__exp = def.__exp || {}; def.__exp[rw] = def[d].text.replace(new RegExp("(^|[^\\w$])" + def[d].arg + "([^\\w$])", "g"), "$1" + param + "$2"); return s + "def.__exp['"+rw+"']"; }
  9. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Regex as a language matching hello world: /hello world/
  10. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Regex as a language matching hello world: /hello world/ Limitations of hello world example: • Case sensitive • No explicit start or end of line • Only matches a single space character
  11. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Regex as a language Special Characters • \ Quote the next metacharacter, or escape • ^ Match the beginning of the line • . Match any character (except newline) • $ Match the end of the string (or before newline at the end of the string) • | Alternation • () Grouping • [ ] Bracketed Character class
  12. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Regex as a language Quantifiers • * Match 0 or more times • + Match 1 or more times • ? Match 1 or 0 times • {n} Match exactly n times • {n,} Match at least n times • {n,m} Match at least n but not more than m times
  13. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Examples: /hello +world/ /(h|H)ello +(w|W)orld/ /^(h|H)ello +(w|W)orld$/
  14. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Quiz time! Write a regex to match any white space at the beginning of a line- zero or more space or tab characters.
  15. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Quiz time! Write a regex to match any white space at the beginning of a line- zero or more space or tab characters. /^( |\t)*/
  16. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Character Classes [ ] Square brackets contain possible characters to match one character. • [ABCDEF] matches only the specific literal characters • [A-Z] matches all uppercase letters of the alphabet • [A-Za-z] matches all upper and lower case letters • [0-9] matches all digits • [0-9A-Fa-f] matches hexidecimal numbers, like 9a31f
  17. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Character Classes Order of contents within a character class doesn't matter as long as the matching is equivalent [abcd] == [dcba] However ranges do matter- [a-Z] != [a-zA-Z]
  18. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Character Classes Special characters within character class • Invert character class [^a-z], carrot at beginning • Dot, pipe, parens, braces, plus, question mark, star, caret, dollar are literals within a character class • no need to escape, although escaping makes it clear [.|(){}+?*^$] [\.\|\(\)\{\}\+\?\*\^\$] • To get a literal dash, have it at the beginning or escape it [-asdf] or [asdf\-]
  19. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Escaping characters When desiring a literal non-alphanumeric character and in doubt if you should escape it, then escape it. /USD$[0-9]+\.[0-9]{2}/ /USD\$[0-9]+\.[0-9]{2}/ Double backslash to get a literal backslash character /\\/
  20. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Quiz time! Write a regex to match an IP address. ei. 10.9.200.12 Hint: use the { } quantifier
  21. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Quiz time! Write a regex to match an IP address. ei. 10.9.200.12 Hint: use the { } quantifier /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ /([0-9]{1,3}\.?){4}/
  22. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    PCRE character classes as metacharacters Metacharacters or escaped character \w – word character == [a-zA-Z_] \d – digit == [0-9] \s – white space == [ \t\r\n] \t – tab \n – newline \r – carriage return \b – word boundary \x0234 – hex value Inverses: \W == [^\w] == [^a-zA-Z_] \D == [^\d] == [^0-9] \S == [^\s] == [^ \t\r\n]
  23. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    POSIX character classes POSIX character classes are named classes in the form [[:class:]] alpha Any alphabetical character ("[A-Za-z]"). [[:alpha:]] alnum Any alphanumeric character ("[A-Za-z0-9]"). ascii Any character in the ASCII character set. blank A GNU extension, equal to a space or a horizontal tab ("\t"). cntrl Any control character. digit Any decimal digit ("[0-9]"), equivalent to "\d". graph Any printable character, excluding a space. lower Any lowercase character ("[a-z]"). print Any printable character, including a space. punct Any graphical character excluding "word" characters. space Any whitespace character. "\s" including the vertical tab ("\cK"). upper Any uppercase character ("[A-Z]"). word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". xdigit Any hexadecimal digit ("[0-9a-fA-F]").
  24. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Quiz time! Write a regex to match any white space at the beginning of a line- zero or more space or tab characters. /^( |\t)*/ => /^\s*/
  25. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Quiz time! Write a regex to match an IP address. ei. 10.9.200.12 Hint: use the { } quantifier /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ => /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/ /([0-9]{1,3}\.?){4}/ => /(\d{1,3}\.?){4}/
  26. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Subexpressions, groups, and captures Parentheses enclose a subexpression, and the match of just that subexpression is saved in a buffer. These buffers can be referenced and used, sometimes called groups or captures. Example: /”(GET|POST) (http[^”]+)”/
  27. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Subexpressions, groups, and captures Depending on your programming language you can then use those groups and store them in variables and do something with them. Example in python: matches = re.search(r'”(GET|POST) (http[^”]+)”', request_str) if matches: method = matches,group(1) url = matches.group(2)
  28. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Subexpressions, groups, and captures Groups can be nested, in which case group numbers are based on the left parentheses Example: /(https?:\/\/([^\/]+)/(.*)(\?.*)?)/ How many groups are there?
  29. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Subexpressions, groups, and captures Groups can be nested, in which case group numbers are based on the left parentheses Example: /(https?:\/\/([^\/]+)/(.*)(\?.*)?)/ How many groups are there? 4 Group 1 is the entire url Group 2 is the hostname Group 3 is the url path Group 4 is the query string if it exists, and is optional. You will need to check if it exists in your code
  30. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Subexpressions, groups, and captures Things to be aware of with groups/captures • They have overhead copying text to the saved buffers. So if you don't really need the group you can improve performance slightly by using (?:...) notation. This tells the regex engine to not save the subexpression match in a buffer. Example: /(?:this|that|these)/
  31. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Subexpressions, groups, and captures You can reference a group within the same regex the groups are matching. To reference a group use \1, \2, \3, etc. Example: /(\w+) \1/
  32. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Performance concerns • If you are only matching a single literal string, it is faster to use a substring function • Be careful using dynamic regexes inside loops. They are evaluated and compile every time. Static regexes can be optimized in perl with /o foreach my $animal (@zoo) { If ($animal =~ /(?:monkey|ape)/o) { $primate_count++; } }
  33. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Greedy versus Non-greedy matching The quantifiers + and * are greedy by default. Example: /<a href=”.*”>/ with the text: <a href=”/index.html”><span class=”button”>Home</span></a>
  34. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Greedy versus Non-greedy matching The quantifiers + and * are greedy by default. Example: /<a href=”.*”>/ with the text: <a href=”/index.html”><span class=”button”>Home</span></a>
  35. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Greedy versus Non-greedy matching The quantifiers + and * are greedy by default. Example: /<a href=”.*”>/ with the text: <a href=”/index.html”><span class=”button”>Home</span></a>
  36. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Greedy versus Non-greedy matching The quantifiers + and * are greedy by default. Example: /<a href=”.*”>/ with the text: <a href=”/index.html”><span class=”button”>Home</span></a>
  37. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Greedy versus Non-greedy matching The quantifiers + and * are greedy by default. Example: /<a href=”.*”>/ with the text: <a href=”/index.html”><span class=”button”>Home</span></a>
  38. © 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Greedy versus Non-greedy matching To make them non-greedy simply add ? To the end, like .+? or .*?. This tells the regex engine to look ahead one character on every match which prevents it from going too far. Example: /<a href=”.*?”>/ with the text: <a href=”/index.html”><span class=”button”>Home</span></a>