Why “regular expressions”? ● 1956: mathematical definition of regular sets by Stephen Cole Kleen ● 1968: “Regular Expression Search Algorithm” - by Ken Thompson. Description of a regular expression compiler. ● Regular expressions employed in text editors. Introduction of the grep command.
Examples of text matching ● Given an IIS log, keep just the requests to the web app “/PicnicAPI” ● Perform LIKE queries on MongoDB ● Get the dir and basename of a file path ● Get the src attribute of an tag ● Read a key-value file having “\” line continuations
Generalized problems ● Determine if a pattern is contained (matches) a given string ● Extract substrings from a matching string ● Replace one or more substrings ● Generalizable to files and streams
Interactive testing ● http://regex101.com/ - currently provides a free multi-engine test environment, explaining your regex and showing the matches on a text. ● http://rubular.com/ - another regex test environment, targeting Ruby's flavour.
The dualism regex-target The regular expression is applied to a string, to check for a match. Both the regex and the string have their own cursor. Which cursor drives the matching process? T h e q u i c k b r q u i Text: Regex:
DFA ● Matching is driven by the cursor on the text ● Very fast matching ● Takes longer to compile ● Takes more memory ● Declarative regex ● Always returns the longest possible match.
Traditional NFA ● Matching is driven by the cursor on the regex ● Creates a stack of states, and performs backtracking ● Supports more language constructs ● Imperative regex ● Usually returns the first match found ● Employed by standard Java, .NET, Python, PHP, Perl, Ruby, …
Our target: NFAs ● DFAs are less common than NFAs, their syntax is almost a subset and they are generally simpler. ● We will concentrate on NFA regexes
Escaping characters ● Some characters (\, *, ?, +, ., (, ), [, ], {, }, |, ^, $, #) must be escaped when they are used literally ● Escape is performed by prepending “\”. For example: /\?/ to represent a literal “?” ● Where raw strings are not supported, a double escape might be required. In Java, the regex /\\\+/ becomes: “\\\\\\+”.
Character classes ● [abc] = “a, b or c in this position” ● [a-z] = “a, b, c, …, z here” ● [A-Za-z] = “A, B, …, Z, a, b, …, z here” ● [A-Za-z0] = “A, …, Z, a, …, z, 0 here” ● [A-Z\-] = [-A-Z] = “A, …, Z or – here” ● What about accents? (é, è, …) And cedilla? ● Know your engine.
Negated character classes ● [^ab] = “Something not a and not b here” ● [^a-z] = “Something not a, b, c, …, z here” ● [^A-Za-c] = “Something not “A, …, Z, a, …, c here” ● Negating a character set requires the existence of a character in that position, not belonging to the specified class.
Common character classes ● \d = a digit ● \D = [^\d] ● \w = a letter, a digit or “_” ● \W = [^\w] ● \s = a space character ● \S = [^\s] ● . = any character except newline
What are letters and spaces? ● The answer depends on the encoding and on your engine. ● In ASCII, usually: – \w = [A-Za-z0-9_] – \s = [\r \n\t\f\v] (includes ASCII-32 common space) ● But what about Latin-1 or Unicode? ● Know your engine
Unicode character classes ● \uXXXX: matches the Unicode code point whose hex value is XXXX ● There should also be support for Unicode's categories and scripts, especially via \p ● Much more Unicode-related, non-standard features ● Know your engine
Capturing groups ● ( and ) define a capturing group ● Capturing groups are assigned a 1-based index, according to the position of their ( ● /(\w+)bet/ tries to match a string and, if successful, creates a capturing group for the text matching \w+, having index 1 ● If the above regex is applied to “alphabet”, it matches and its group 1 is “alpha”
Non-capturing groups ● Groups can just be used to clarify precedence: capturing is not always needed ● Skipping capturing can save memory and speed up the matching process ● To define a non-capturing group, use (?: and ). ● Therefore, /(?:\w+)bet/ is just like /\w+bet/, as no capturing is performed and this grouping alters precedence without effects.
Backreferences ● Backreference = the content of a capturing group that becomes part of the regex ● Use \N in your regex, replacing N with the index of the captured group in question ● For example: /(['”])\w+\1/ to pair single and double quotes ● Some engines support named capturing and backreferences
Alternation ● Alternatives are separated by | ● For example: /alpha|beta/ means “alpha” or “beta” ● Alternation has very low precedence; its scope is the current group: use grouping to force precedence. ● For example: /A(?:pril|ugust)/ means “A, followed by “pril” or “ugust”.
Alternation VS char classes ● A character class (asserted or negated) always matches one and only one character ● The branches of an alternation can be strings of any length (at least one character, to be consistent)
Matching in a DFA /nice|cute/ applied to: “Pandas are cute animals” It scans the string, starting from P, and, at every character, tries to apply the regex. In a DFA regex, the engine only chooses which regex components remain valid at a given position of the text cursor.
Matching in NFA ● NFA also keeps a stack of states! ● Each decision point saves a state in the stack ● State = position of the 2 cursors ● If a choice in the regex leads to no match, the engine backtracks (=pops a state from the stack and makes a different choice)
Performance implications ● In NFA, a failure is returned only when all the regex paths have been explored ● NFA regexes must be written with performances in mind.
Greedy quantifiers ● All quantifiers can be applied to single characters, classes or even groups ● * = any number of occurrences (even 0) ● ? = 0 or 1 occurrencies ● + = 1 or infinite occurrencies ● {n} = exactly n occurrencies ● {m, n} = m to n occurrencies (included) ● {m,} = at least m occurrencies
First example of greedy quantifiers ● Let's consider the regex /be?(er|ar)/ ● How is it applied to “I'd like a chocolate bar” ? ● The regex cursor stays on “b” until the text cursor reaches its “b” too ● Then, the following regex paths are tried: – be => b(er) => b(ar)
Greedy quantifiers and backtracking ● Consider the regex /.* are/ ● Applied to: “Pandas are cute animals” ● .* will consume the whole text at first ● However, when reaching the end of the text, it stops matching and the regex cursor goes on.
Greedy quantifiers and backtracking (2) ● Now, “ “ can't match (no more text is available), so the engine backtracks! ● Some backtracking is performed, until the first available space is reached (between “cute” and “animals”) ● The regex cursor moves on to “a”, that matches the “a” in “animals”. But “r” doesn't match “n” => more backtracking!
Greedy quantifiers and backtracking (3) ● The failures and backtracking go on until the space between “are” and “cute”... “a” doesn't match the “c” in “cute” => backtracking, again! ● The next space is ok: it is followed by “are”, that matches the rest of the regex.
Lazy quantifiers and backtracking ● When applying /.*? are/ to “Pandas are cute animals”, what happens? ● The engine must choose whether to apply .*? to “P”. But it's lazy, so the engine chooses to move the regex cursor forward ● The regex cursor goes on to “ “, but it doesn't match “P” so the engine backtracks ● The engine must now take the remaining path – applying .*? to “P”, which is viable
Lazy quantifiers and backtracking (2) ● This goes on until the first space in the text is reached: it matches the space in the regex, so the regex cursor can go on ● The matching process continues until the regex ends ● In this case, the match of greedy and lazy evaluation was the same – but the lazy quantifiers required less backtracking
Apply or skip? Greedy VS Lazy ● When a quantifier is encountered, the regex engine must choose whether to apply its element to the text or not ● Greedy quantifiers prefer the “apply” path whenever possible ● Lazy quantifiers prefer the “skip” path whenever possible ● Choosing greedy VS lazy quantifiers can impact performances and what is matched, but not the presence/absence of a match.
Greedy VS Lazy: an example ● Given the text “987”: – /\d{1,3}/ matches the whole “987”: the greedy quantifier tries to consume as much as possible – /\d{1,3}?/ matches just “9”: the lazy quantifier must honour the constraints (at least 1 match), but chooses to skip application whenever possible
Atomic grouping ● (?> and ) define an atomic group ● All the states created within an atomic group are removed from the engine's stack as soon as the group closes ● Atomic groups are non-capturing, but can have capturing groups ● Atomic grouping can alter the match/failure result of a regex, as well as affecting performances
Possessive quantifiers ● Obtained by adding a “+” to greedy quantifiers ● Possessive quantifiers are equivalent to greedy quantifiers wrapped within an atomic group. ● For example: /\d++/ = /(?>\d+)/
Regex flags ● Regex engines can turn on/off features, for customized behaviour ● Enabling and disabling flags usually affects the whole regex, but some engines support flags on just regions. ● Flag manipulation is engine- and API- dependent ● Every engine has its own flags, but some are definitely common.
Most common regex flags ● Case insensitive ● Dot-all: . matches any character, including \n ● Multiline anchors: ^ and $ (see later) work on lines instead of the whole text ● Extended: spaces – including newlines - are ignored unless escaped or within a character class; lines starting with # are comments. More readable regexes.
Common anchors ● ^: the cursor is at the beginning of the text (of a line, in multiline mode) ● $: the cursor is at the end of the text (of a line, in multiline mode. And before or after \n? Know your engine). ● \A: the cursor is at the beginning of the text ● \Z: the cursor is at the end of the text ● \b: the cursor is at a word boundary (what's a word boundary? Know your engine)
Lookaround ● Lookaround = a regex-based condition on the text cursor. Can be positive (the regex must match) or negative (the regex must fail). ● Lookahead = a lookaround on the text following the cursor ● Lookbehind = a lookaround on the text preceding the cursor.
Lookaround basics ● Their position in the regex matters, as the other characters in the regex consume the text and make the text cursor shift forward. ● On the other hand, lookarounds do not consume text ● Juxtaposed lookarounds all apply, bound by a logic and, to the position marked by the text cursor
Lookaround limitations ● Lookarounds behave like nested regexes having their own stack ● They are also called zero-length assertions ● Lookahead can be full-fledged regexes ● Lookbehinds are usually much more restricted, depending on the engine
Lookarounds and the stack ● Each lookaround maintains its own stack, that gets deleted at the end of the lookaround. ● An important detail: capturing groups within lookarounds are considered capturing groups of the whole regex => their result is saved.
Lookahead + Backreference = Atomic group ● Lookaheads are full-fledged regexes with their own stack, which is thrown away. ● This is exactly like an atomic group, but the lookahead does not consume text ● However, capturing groups in a lookahead are stored by the regex => use a backreference to capture that text ● Therefore, for example: /(?=(\d+))\1/ = /(?>\d+)/
Regexes and C# ● .NET encapsulates regexes in a class, System.Text.RegularExpressions.Regex ● Its constructor accepts the regex and, optionally, global flags ● C# supports raw strings (preceded by @), to avoid over-escaping, that can be found in Java.
Regexes and Java ● Java's regex class is java.util.regex.Pattern ● In lieu of a constructor, it's a static method, Pattern.compile(), that creates a regex ● It takes the regex and, optionally, the global flags ● In Java, the regex /\\test/ becomes “\\\\test”, because each “\” in the regex must be escaped in Java, too, for a total of 4 “\”.
Regexes in MongoDB ● MongoDB supports regexes ● Just use /regex/ (with slashes and without double quotes) as the right side of an equality assertion in your query ● Important: a regex could hit indexes on a field, but the best results are achieved when the regex starts with ^
Regexes in Python ● Python provides the standard module re ● To create a regex, just use re.compile(), that takes, as usual, the regex string and the optional global flags
Regexes in JavaScript ● In JavaScript, it's quite common to use this notation to create a regex object: var regex = /regexPattern/ var regexWithFlags = /regexPattern/flags ● Alternatively, the RegExp class can be used
Final notes ● Don't forget that regexes must be kept simple, just like any other construct ● To achieve this result, a good knowledge of the text, as well as of the requirements, is needed. ● Write tests for your regexes
Further references ● “Mastering Regular Expressions” - by Jeffrey E. F. Friedl, published by O'Reilly Media ● http://regex101.com/ ● http://rubular.com/ ● http://www.regular-expressions.info/