Slides for a presentation given at OpenWest 2016 in Sandy, UT.
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Regular Expressions For Fun And ProfitSpencer Christensen | Adobe Analytics SRE
View Slide
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Some people, when confronted with a problem, think "I know, I'll useregular expressions." Now they have two problems.- Jamie Zawinski, circa 1997
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Some people, when confronted with a problem, think "I know, I'll useregular expressions." Now they have two problems.- Jamie Zawinski, circa 1997Some people, when confronted with a problem, think “I know, I’ll quoteJamie Zawinski.” Now they have two problems.- Martin Liebach, circa 2009
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Regular Expressions are...
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.You have been invited to become Regex witches and wizards
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Patterns and Pattern matching
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Describing patterns
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Describing patternsUsing white, cast on 61sts. Mark the centre stitch with a piece ofcoloured yarn.1st row: Knit to within 1 st of the centre (on the first row this willbe 29 sts and every decrease row after that will be one stitch less),Sl2, K1, PSSO, K to end2nd row: Knit3rd row: Using red, knit to within 1 st of the centre, Sl2, K1, PSSO,K to end4th row: PurlRepeat these four rows, always working rows 1 and 2 in white,and rows 3 and 4 in rainbow stripes. When you have 5sts leftwork as follows:K1, Sl2, K1, PSSO, K1Next row: KnitNext row (don't change colours): Sl2, K1, PSSOFasten off.
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Describing patternsPoetry and Rhyming patternsHere s an example of ABAB in action, as written’by William Shakespeare:A O, if I say, you look upon this verse,B When I, perhaps, compounded am with clay,A Do not so much as my poor name rehearse,B But let your love even with my life decay…
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Describing patternsRubik s Cube Notation’A single letter by itself means to turn that face clockwise 90 degrees.A letter followed by an apostrophe means to turn that face counterclockwise 90degrees.A letter with the number 2 after it means to turn that face 180 degrees.e.g. R U R U R U2 R U’ ’
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Languages and Symbolsusing codes to represent ideas and expressions.if (def[d] && def[d].arg && param) {var rw = (d+":"+param).replace(/'|\\/g, '_');def.__exp = def.__exp || {};def.__exp[rw] = def[d].text.replace(new RegExp("(^|[^\\w$])" + def[d].arg +"([^\\w$])", "g"), "$1" + param + "$2");return s + "def.__exp['"+rw+"']";}
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Regex as a languagematching hello world:/hello world/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Regex as a languagematching hello world:/hello world/Limitations of hello world example:●Case sensitive●No explicit start or end of line●Only matches a single spacecharacter
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Regex as a languageSpecial Characters●\ Quote the next metacharacter, or escape●^ Match the beginning of the line●. Match any character (except newline)●$ Match the end of the string (or before newline at the end of the string)●| Alternation●() Grouping●[ ] Bracketed Character class
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Regex as a languageQuantifiers●* Match 0 or more times●+ Match 1 or more times●? Match 1 or 0 times●{n} Match exactly n times●{n,} Match at least n times●{n,m} Match at least n but not more than m times
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Examples:/hello +world//(h|H)ello +(w|W)orld//^(h|H)ello +(w|W)orld$/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Quiz time!Write a regex to match any white space at the beginning of aline- zero or more space or tab characters.
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Quiz time!Write a regex to match any white space at the beginning of aline- zero or more space or tab characters./^( |\t)*/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Character Classes[ ] Square brackets contain possible characters to matchone character.●[ABCDEF] matches only the specific literal characters●[A-Z] matches all uppercase letters of the alphabet●[A-Za-z] matches all upper and lower case letters●[0-9] matches all digits●[0-9A-Fa-f] matches hexidecimal numbers, like 9a31f
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Character ClassesOrder of contents within a character class doesn't matter aslong as the matching is equivalent[abcd] == [dcba]However ranges do matter- [a-Z] != [a-zA-Z]
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Character ClassesSpecial characters within character class●Invert character class [^a-z], carrot at beginning●Dot, pipe, parens, braces, plus, question mark, star, caret,dollar are literals within a character class●no need to escape, although escaping makes it clear[.|(){}+?*^$][\.\|\(\)\{\}\+\?\*\^\$]●To get a literal dash, have it at the beginning or escape it[-asdf] or [asdf\-]
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Escaping charactersWhen desiring a literal non-alphanumeric characterand in doubt if you should escape it, then escape it./USD$[0-9]+\.[0-9]{2}//USD\$[0-9]+\.[0-9]{2}/Double backslash to get a literal backslash character /\\/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Quiz time!Write a regex to match an IP address.ei. 10.9.200.12Hint: use the { } quantifier
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Quiz time!Write a regex to match an IP address.ei. 10.9.200.12Hint: use the { } quantifier/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}//([0-9]{1,3}\.?){4}/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.PCRE character classes as metacharactersMetacharacters or escaped character\w – word character == [a-zA-Z_]\d – digit == [0-9]\s – white space == [ \t\r\n]\t – tab\n – newline\r – carriage return\b – word boundary\x0234 – hex valueInverses:\W == [^\w] == [^a-zA-Z_]\D == [^\d] == [^0-9]\S == [^\s] == [^ \t\r\n]
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.POSIX character classesPOSIX character classes are named classes in the form [[:class:]]alpha Any alphabetical character ("[A-Za-z]"). [[:alpha:]]alnum Any alphanumeric character ("[A-Za-z0-9]").ascii Any character in the ASCII character set.blank A GNU extension, equal to a space or a horizontal tab ("\t").cntrl Any control character.digit Any decimal digit ("[0-9]"), equivalent to "\d".graph Any printable character, excluding a space.lower Any lowercase character ("[a-z]").print Any printable character, including a space.punct Any graphical character excluding "word" characters.space Any whitespace character. "\s" including the vertical tab ("\cK").upper Any uppercase character ("[A-Z]").word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".xdigit Any hexadecimal digit ("[0-9a-fA-F]").
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Quiz time!Write a regex to match any white space at the beginning of aline- zero or more space or tab characters./^( |\t)*/ => /^\s*/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Quiz time!Write a regex to match an IP address.ei. 10.9.200.12Hint: use the { } quantifier/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ => /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}//([0-9]{1,3}\.?){4}/ => /(\d{1,3}\.?){4}/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Subexpressions, groups, and capturesParentheses enclose a subexpression, and thematch of just that subexpression is saved in abuffer. These buffers can be referenced andused, sometimes called groups or captures.Example: /”(GET|POST) (http[^”]+)”/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Subexpressions, groups, and capturesDepending on your programming language you can then use those groupsand store them in variables and do something with them.Example in python:matches = re.search(r'”(GET|POST) (http[^”]+)”', request_str)if matches:method = matches,group(1)url = matches.group(2)
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Subexpressions, groups, and capturesGroups can be nested, in which case group numbers are based on the left parenthesesExample: /(https?:\/\/([^\/]+)/(.*)(\?.*)?)/How many groups are there?
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Subexpressions, groups, and capturesGroups can be nested, in which case group numbers are based on the left parenthesesExample: /(https?:\/\/([^\/]+)/(.*)(\?.*)?)/How many groups are there?4Group 1 is the entire urlGroup 2 is the hostnameGroup 3 is the url pathGroup 4 is the query string if it exists, and is optional. You will need to check if itexists in your code
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Subexpressions, groups, and capturesThings to be aware of with groups/captures●They have overhead copying text to the saved buffers. So if you don't really need thegroup you can improve performance slightly by using (?:...) notation. This tellsthe regex engine to not save the subexpression match in a buffer.Example: /(?:this|that|these)/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Subexpressions, groups, and capturesYou can reference a group within the same regex the groups are matching. To reference agroup use \1, \2, \3, etc.Example: /(\w+) \1/
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Performance concerns●If you are only matching a single literal string, it is faster to use asubstring function●Be careful using dynamic regexes inside loops. They areevaluated and compile every time. Static regexes can beoptimized in perl with /oforeach my $animal (@zoo) {If ($animal =~ /(?:monkey|ape)/o) {$primate_count++;}}
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Greedy versus Non-greedy matchingThe quantifiers + and * are greedy by default.Example: //with the text:class=”button”>Home
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Greedy versus Non-greedy matchingTo make them non-greedy simply add ? To the end, like .+? or .*?. Thistells the regex engine to look ahead one character on every match whichprevents it from going too far.Example: //with the text:class=”button”>Home
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Regular Expressions are Magic
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Congratulations!Regex Witches and Wizards!
© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.Thanks!