Regular Expressions for Fun and Profit

© 2015 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Regular Expressions For Fun And Profit Spencer Christensen | Adobe Analytics SRE

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski, circa 1997

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski, circa 1997 Some people, when confronted with a problem, think “I know, I’ll quote Jamie Zawinski.” Now they have two problems. - Martin Liebach, circa 2009

Regular Expressions are...

You have been invited to become Regex witches and wizards

Patterns and Pattern matching

Describing patterns

Describing patterns Using white, cast on 61sts. Mark the centre stitch with a piece of coloured yarn. 1st row: Knit to within 1 st of the centre (on the first row this will be 29 sts and every decrease row after that will be one stitch less), Sl2, K1, PSSO, K to end 2nd row: Knit 3rd row: Using red, knit to within 1 st of the centre, Sl2, K1, PSSO, K to end 4th row: Purl Repeat these four rows, always working rows 1 and 2 in white, and rows 3 and 4 in rainbow stripes. When you have 5sts left work as follows: K1, Sl2, K1, PSSO, K1 Next row: Knit Next row (don't change colours): Sl2, K1, PSSO Fasten off.

Describing patterns Poetry and Rhyming patterns Here s an example of ABAB in action, as written ’ by William Shakespeare: A O, if I say, you look upon this verse, B When I, perhaps, compounded am with clay, A Do not so much as my poor name rehearse, B But let your love even with my life decay…

Describing patterns Rubik s Cube Notation ’ A single letter by itself means to turn that face clockwise 90 degrees. A letter followed by an apostrophe means to turn that face counterclockwise 90 degrees. A letter with the number 2 after it means to turn that face 180 degrees. e.g. R U R U R U2 R U ’ ’

Languages and Symbols using codes to represent ideas and expressions. if (def[d] && def[d].arg && param) { var rw = (d+":"+param).replace(/'|\\/g, '_'); def.__exp = def.__exp || {}; def.__exp[rw] = def[d].text.replace(new RegExp("(^|[^\\w$])" + def[d].arg + "([^\\w$])", "g"), "$1" + param + "$2"); return s + "def.__exp['"+rw+"']"; }

Regex as a language matching hello world: /hello world/

Regex as a language matching hello world: /hello world/ Limitations of hello world example: • Case sensitive • No explicit start or end of line • Only matches a single space character

Regex as a language Special Characters • \ Quote the next metacharacter, or escape • ^ Match the beginning of the line • . Match any character (except newline) • $ Match the end of the string (or before newline at the end of the string) • | Alternation • () Grouping • [ ] Bracketed Character class

Regex as a language Quantifiers • * Match 0 or more times • + Match 1 or more times • ? Match 1 or 0 times • {n} Match exactly n times • {n,} Match at least n times • {n,m} Match at least n but not more than m times

Examples: /hello +world/ /(h|H)ello +(w|W)orld/ /^(h|H)ello +(w|W)orld$/

Quiz time! Write a regex to match any white space at the beginning of a line- zero or more space or tab characters.

Quiz time! Write a regex to match any white space at the beginning of a line- zero or more space or tab characters. /^( |\t)*/

Character Classes [ ] Square brackets contain possible characters to match one character. • [ABCDEF] matches only the specific literal characters • [A-Z] matches all uppercase letters of the alphabet • [A-Za-z] matches all upper and lower case letters • [0-9] matches all digits • [0-9A-Fa-f] matches hexidecimal numbers, like 9a31f

Character Classes Order of contents within a character class doesn't matter as long as the matching is equivalent [abcd] == [dcba] However ranges do matter- [a-Z] != [a-zA-Z]

Character Classes Special characters within character class • Invert character class [^a-z], carrot at beginning • Dot, pipe, parens, braces, plus, question mark, star, caret, dollar are literals within a character class • no need to escape, although escaping makes it clear [.|(){}+?*^$] [\.\|\{\}\+\?\*\^\$] • To get a literal dash, have it at the beginning or escape it [-asdf] or [asdf\-]

Escaping characters When desiring a literal non-alphanumeric character and in doubt if you should escape it, then escape it. /USD$[0-9]+\.[0-9]{2}/ /USD\$[0-9]+\.[0-9]{2}/ Double backslash to get a literal backslash character /\\/

Quiz time! Write a regex to match an IP address. ei. 10.9.200.12 Hint: use the { } quantifier

Quiz time! Write a regex to match an IP address. ei. 10.9.200.12 Hint: use the { } quantifier /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ /([0-9]{1,3}\.?){4}/

PCRE character classes as metacharacters Metacharacters or escaped character \w – word character == [a-zA-Z_] \d – digit == [0-9] \s – white space == [ \t\r\n] \t – tab \n – newline \r – carriage return \b – word boundary \x0234 – hex value Inverses: \W == [^\w] == [^a-zA-Z_] \D == [^\d] == [^0-9] \S == [^\s] == [^ \t\r\n]

POSIX character classes POSIX character classes are named classes in the form [[:class:]] alpha Any alphabetical character ("[A-Za-z]"). [[:alpha:]] alnum Any alphanumeric character ("[A-Za-z0-9]"). ascii Any character in the ASCII character set. blank A GNU extension, equal to a space or a horizontal tab ("\t"). cntrl Any control character. digit Any decimal digit ("[0-9]"), equivalent to "\d". graph Any printable character, excluding a space. lower Any lowercase character ("[a-z]"). print Any printable character, including a space. punct Any graphical character excluding "word" characters. space Any whitespace character. "\s" including the vertical tab ("\cK"). upper Any uppercase character ("[A-Z]"). word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". xdigit Any hexadecimal digit ("[0-9a-fA-F]").

Quiz time! Write a regex to match any white space at the beginning of a line- zero or more space or tab characters. /^( |\t)*/ => /^\s*/

Quiz time! Write a regex to match an IP address. ei. 10.9.200.12 Hint: use the { } quantifier /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ => /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/ /([0-9]{1,3}\.?){4}/ => /(\d{1,3}\.?){4}/

Subexpressions, groups, and captures Parentheses enclose a subexpression, and the match of just that subexpression is saved in a buffer. These buffers can be referenced and used, sometimes called groups or captures. Example: /”(GET|POST) (http[^”]+)”/

Subexpressions, groups, and captures Depending on your programming language you can then use those groups and store them in variables and do something with them. Example in python: matches = re.search(r'”(GET|POST) (http[^”]+)”', request_str) if matches: method = matches,group(1) url = matches.group(2)

Subexpressions, groups, and captures Groups can be nested, in which case group numbers are based on the left parentheses Example: /(https?:\/\/([^\/]+)/(.*)(\?.*)?)/ How many groups are there?

Subexpressions, groups, and captures Groups can be nested, in which case group numbers are based on the left parentheses Example: /(https?:\/\/([^\/]+)/(.*)(\?.*)?)/ How many groups are there? 4 Group 1 is the entire url Group 2 is the hostname Group 3 is the url path Group 4 is the query string if it exists, and is optional. You will need to check if it exists in your code

Subexpressions, groups, and captures Things to be aware of with groups/captures • They have overhead copying text to the saved buffers. So if you don't really need the group you can improve performance slightly by using (?:...) notation. This tells the regex engine to not save the subexpression match in a buffer. Example: /(?:this|that|these)/

Subexpressions, groups, and captures You can reference a group within the same regex the groups are matching. To reference a group use \1, \2, \3, etc. Example: /(\w+) \1/

Performance concerns • If you are only matching a single literal string, it is faster to use a substring function • Be careful using dynamic regexes inside loops. They are evaluated and compile every time. Static regexes can be optimized in perl with /o foreach my $animal (@zoo) { If ($animal =~ /(?:monkey|ape)/o) { $primate_count++; } }

Greedy versus Non-greedy matching The quantifiers + and * are greedy by default. Example: /<a href=”.*”>/ with the text: <a href=”/index.html”><span class=”button”>Home</span></a>

Greedy versus Non-greedy matching To make them non-greedy simply add ? To the end, like .+? or .*?. This tells the regex engine to look ahead one character on every match which prevents it from going too far. Example: /<a href=”.*?”>/ with the text: <a href=”/index.html”><span class=”button”>Home</span></a>

Regular Expressions are Magic

Congratulations! Regex Witches and Wizards!

Thanks!

Regular Expressions for Fun and Profit

Regular Expressions for Fun and Profit

More Decks by Spencer Christensen

Other Decks in Programming

Featured

Transcript