Introduction to regular expressions

Slide 1

Slide 1 text

Gianluca Costa Introduction to regular expressions

Slide 2

Slide 2 text

Before starting Regular expressions are a tool: it's up to you to use them wisely. Like every tool, they require: Practice Tests Patience

Slide 3

Slide 3 text

Why “regular expressions”? ● 1956: mathematical definition of regular sets by Stephen Cole Kleen ● 1968: “Regular Expression Search Algorithm” - by Ken Thompson. Description of a regular expression compiler. ● Regular expressions employed in text editors. Introduction of the grep command.

Slide 4

Slide 4 text

Examples of text matching ● Given an IIS log, keep just the requests to the web app “/PicnicAPI” ● Perform LIKE queries on MongoDB ● Get the dir and basename of a file path ● Get the src attribute of an tag ● Read a key-value file having “\” line continuations

Slide 5

Slide 5 text

Generalized problems ● Determine if a pattern is contained (matches) a given string ● Extract substrings from a matching string ● Replace one or more substrings ● Generalizable to files and streams

Slide 6

Slide 6 text

Regular expressions Regular expressions describe text patterns. For example: “At least 3 digits, but not more than 5”.

Slide 7

Slide 7 text

A simple example /\d{3,5}/ Matches “3482”, but not “Hello”

Slide 8

Slide 8 text

How to apply regexes ● Functions/classes provided by programming languages/frameworks ● Command-line tools (sed, awk, egrep, …) ● Other interfaces (eg: MongoDB queries)

Slide 9

Slide 9 text

Interactive testing ● http://regex101.com/ - currently provides a free multi-engine test environment, explaining your regex and showing the matches on a text. ● http://rubular.com/ - another regex test environment, targeting Ruby's flavour.

Slide 10

Slide 10 text

The dualism regex-target The regular expression is applied to a string, to check for a match. Both the regex and the string have their own cursor. Which cursor drives the matching process? T h e q u i c k b r q u i Text: Regex:

Slide 11

Slide 11 text

Engine types ● DFA ● Traditional NFA ● POSIX NFA ● Hybrid solutions

Slide 12

Slide 12 text

DFA ● Matching is driven by the cursor on the text ● Very fast matching ● Takes longer to compile ● Takes more memory ● Declarative regex ● Always returns the longest possible match.

Slide 13

Slide 13 text

Traditional NFA ● Matching is driven by the cursor on the regex ● Creates a stack of states, and performs backtracking ● Supports more language constructs ● Imperative regex ● Usually returns the first match found ● Employed by standard Java, .NET, Python, PHP, Perl, Ruby, …

Slide 14

Slide 14 text

POSIX NFA ● Very, very similar to traditional NFA, but returns the longest possible match. ● Further performance issues!

Slide 15

Slide 15 text

Hybrid solutions Double engine: first-scan with DFA, then scan with NFA if required by the pattern. Further implementations are possible.

Slide 16

Slide 16 text

Our target: NFAs ● DFAs are less common than NFAs, their syntax is almost a subset and they are generally simpler. ● We will concentrate on NFA regexes

Slide 17

Slide 17 text

Know your engine There are common rules, but several engines. Every engine has its own implementation. You must know your engine. And write tests.

Slide 18

Slide 18 text

Regex basics Literal text, such as /rain/ matches if and only if the string contains, somewhere, that sequence, matching character after character.

Slide 19

Slide 19 text

The first rule of matching Matching starts from the leftmost character. Therefore: “The rainbow shines after the rain” /rain/

Slide 20

Slide 20 text

The second rule of matching The engine returns a success if and only if the regex cursor reaches the end of the regex.

Slide 21

Slide 21 text

Escaping characters ● Some characters (\, *, ?, +, ., (, ), [, ], {, }, |, ^, $, #) must be escaped when they are used literally ● Escape is performed by prepending “\”. For example: /\?/ to represent a literal “?” ● Where raw strings are not supported, a double escape might be required. In Java, the regex /\\\+/ becomes: “\\\\\\+”.

Slide 22

Slide 22 text

Escape sequences ● \r ● \n ● \v ● \f ● \t ● They work just like in C

Slide 23

Slide 23 text

Character classes ● [abc] = “a, b or c in this position” ● [a-z] = “a, b, c, …, z here” ● [A-Za-z] = “A, B, …, Z, a, b, …, z here” ● [A-Za-z0] = “A, …, Z, a, …, z, 0 here” ● [A-Z\-] = [-A-Z] = “A, …, Z or – here” ● What about accents? (é, è, …) And cedilla? ● Know your engine.

Slide 24

Slide 24 text

Negated character classes ● [âb] = “Something not a and not b here” ● [â-z] = “Something not a, b, c, …, z here” ● [Â-Za-c] = “Something not “A, …, Z, a, …, c here” ● Negating a character set requires the existence of a character in that position, not belonging to the specified class.

Slide 25

Slide 25 text

Common character classes ● \d = a digit ● \D = [^\d] ● \w = a letter, a digit or “_” ● \W = [^\w] ● \s = a space character ● \S = [^\s] ● . = any character except newline

Slide 26

Slide 26 text

What are letters and spaces? ● The answer depends on the encoding and on your engine. ● In ASCII, usually: – \w = [A-Za-z0-9_] – \s = [\r \n\t\f\v] (includes ASCII-32 common space) ● But what about Latin-1 or Unicode? ● Know your engine

Slide 27

Slide 27 text

Unicode character classes ● \uXXXX: matches the Unicode code point whose hex value is XXXX ● There should also be support for Unicode's categories and scripts, especially via \p ● Much more Unicode-related, non-standard features ● Know your engine

Slide 28

Slide 28 text

Capturing groups ● ( and ) define a capturing group ● Capturing groups are assigned a 1-based index, according to the position of their ( ● /(\w+)bet/ tries to match a string and, if successful, creates a capturing group for the text matching \w+, having index 1 ● If the above regex is applied to “alphabet”, it matches and its group 1 is “alpha”

Slide 29

Slide 29 text

Non-capturing groups ● Groups can just be used to clarify precedence: capturing is not always needed ● Skipping capturing can save memory and speed up the matching process ● To define a non-capturing group, use (?: and ). ● Therefore, /(?:\w+)bet/ is just like /\w+bet/, as no capturing is performed and this grouping alters precedence without effects.

Slide 30

Slide 30 text

Backreferences ● Backreference = the content of a capturing group that becomes part of the regex ● Use \N in your regex, replacing N with the index of the captured group in question ● For example: /(['”])\w+\1/ to pair single and double quotes ● Some engines support named capturing and backreferences

Slide 31

Slide 31 text

Alternation ● Alternatives are separated by | ● For example: /alpha|beta/ means “alpha” or “beta” ● Alternation has very low precedence; its scope is the current group: use grouping to force precedence. ● For example: /A(?:pril|ugust)/ means “A, followed by “pril” or “ugust”.

Slide 32

Slide 32 text

Alternation VS char classes ● A character class (asserted or negated) always matches one and only one character ● The branches of an alternation can be strings of any length (at least one character, to be consistent)

Slide 33

Slide 33 text

Matching in a DFA /nice|cute/ applied to: “Pandas are cute animals” It scans the string, starting from P, and, at every character, tries to apply the regex. In a DFA regex, the engine only chooses which regex components remain valid at a given position of the text cursor.

Slide 34

Slide 34 text

Matching in NFA ● NFA also keeps a stack of states! ● Each decision point saves a state in the stack ● State = position of the 2 cursors ● If a choice in the regex leads to no match, the engine backtracks (=pops a state from the stack and makes a different choice)

Slide 35

Slide 35 text

Backtracking S1 S2 S5 S3 S4 S6 S7 S8 1 2 4 7 8 10 11 3 5 6 9

Slide 36

Slide 36 text

Performance implications ● In NFA, a failure is returned only when all the regex paths have been explored ● NFA regexes must be written with performances in mind.

Slide 37

Slide 37 text

Alternation in NFA ● Ordered in most implementations. ● Affects what is matched and performances. ● Know your engine

Slide 38

Slide 38 text

Greedy quantifiers ● All quantifiers can be applied to single characters, classes or even groups ● * = any number of occurrences (even 0) ● ? = 0 or 1 occurrencies ● + = 1 or infinite occurrencies ● {n} = exactly n occurrencies ● {m, n} = m to n occurrencies (included) ● {m,} = at least m occurrencies

Slide 39

Slide 39 text

First example of greedy quantifiers ● Let's consider the regex /be?(er|ar)/ ● How is it applied to “I'd like a chocolate bar” ? ● The regex cursor stays on “b” until the text cursor reaches its “b” too ● Then, the following regex paths are tried: – be => b(er) => b(ar)

Slide 40

Slide 40 text

Greedy quantifiers and backtracking ● Consider the regex /.* are/ ● Applied to: “Pandas are cute animals” ● .* will consume the whole text at first ● However, when reaching the end of the text, it stops matching and the regex cursor goes on.

Slide 41

Slide 41 text

Greedy quantifiers and backtracking (2) ● Now, “ “ can't match (no more text is available), so the engine backtracks! ● Some backtracking is performed, until the first available space is reached (between “cute” and “animals”) ● The regex cursor moves on to “a”, that matches the “a” in “animals”. But “r” doesn't match “n” => more backtracking!

Slide 42

Slide 42 text

Greedy quantifiers and backtracking (3) ● The failures and backtracking go on until the space between “are” and “cute”... “a” doesn't match the “c” in “cute” => backtracking, again! ● The next space is ok: it is followed by “are”, that matches the rest of the regex.

Slide 43

Slide 43 text

Pandas are cute animals! ^__^!

Slide 44

Slide 44 text

Lazy quantifiers ● Quantifiers become lazy if followed by a ? ● *? ● ?? ● +? ● {m, n}? ● {m, }? ● {n} cannot be lazy: it indicates a precise n

Slide 45

Slide 45 text

Lazy quantifiers and backtracking ● When applying /.*? are/ to “Pandas are cute animals”, what happens? ● The engine must choose whether to apply .*? to “P”. But it's lazy, so the engine chooses to move the regex cursor forward ● The regex cursor goes on to “ “, but it doesn't match “P” so the engine backtracks ● The engine must now take the remaining path – applying .*? to “P”, which is viable

Slide 46

Slide 46 text

Lazy quantifiers and backtracking (2) ● This goes on until the first space in the text is reached: it matches the space in the regex, so the regex cursor can go on ● The matching process continues until the regex ends ● In this case, the match of greedy and lazy evaluation was the same – but the lazy quantifiers required less backtracking

Slide 47

Slide 47 text

Apply or skip? Greedy VS Lazy ● When a quantifier is encountered, the regex engine must choose whether to apply its element to the text or not ● Greedy quantifiers prefer the “apply” path whenever possible ● Lazy quantifiers prefer the “skip” path whenever possible ● Choosing greedy VS lazy quantifiers can impact performances and what is matched, but not the presence/absence of a match.

Slide 48

Slide 48 text

Greedy VS Lazy: an example ● Given the text “987”: – /\d{1,3}/ matches the whole “987”: the greedy quantifier tries to consume as much as possible – /\d{1,3}?/ matches just “9”: the lazy quantifier must honour the constraints (at least 1 match), but chooses to skip application whenever possible

Slide 49

Slide 49 text

Atomic grouping ● (?> and ) define an atomic group ● All the states created within an atomic group are removed from the engine's stack as soon as the group closes ● Atomic groups are non-capturing, but can have capturing groups ● Atomic grouping can alter the match/failure result of a regex, as well as affecting performances

Slide 50

Slide 50 text

Possessive quantifiers ● Obtained by adding a “+” to greedy quantifiers ● Possessive quantifiers are equivalent to greedy quantifiers wrapped within an atomic group. ● For example: /\d++/ = /(?>\d+)/

Slide 51

Slide 51 text

Regex flags ● Regex engines can turn on/off features, for customized behaviour ● Enabling and disabling flags usually affects the whole regex, but some engines support flags on just regions. ● Flag manipulation is engine- and API- dependent ● Every engine has its own flags, but some are definitely common.

Slide 52

Slide 52 text

Most common regex flags ● Case insensitive ● Dot-all: . matches any character, including \n ● Multiline anchors: ^ and $ (see later) work on lines instead of the whole text ● Extended: spaces – including newlines - are ignored unless escaped or within a character class; lines starting with # are comments. More readable regexes.

Slide 53

Slide 53 text

Anchors ● Anchors do not consume text: they are basic conditions on the text cursor. ● They must be verified for the regex to match

Slide 54

Slide 54 text

Common anchors ● ^: the cursor is at the beginning of the text (of a line, in multiline mode) ● $: the cursor is at the end of the text (of a line, in multiline mode. And before or after \n? Know your engine). ● \A: the cursor is at the beginning of the text ● \Z: the cursor is at the end of the text ● \b: the cursor is at a word boundary (what's a word boundary? Know your engine)

Slide 55

Slide 55 text

Lookaround ● Lookaround = a regex-based condition on the text cursor. Can be positive (the regex must match) or negative (the regex must fail). ● Lookahead = a lookaround on the text following the cursor ● Lookbehind = a lookaround on the text preceding the cursor.

Slide 56

Slide 56 text

Lookaround notation Lookbehind Lookahead Positive (?<= regex) (?= regex ) Negative (?

Slide 57

Slide 57 text

Lookaround basics ● Their position in the regex matters, as the other characters in the regex consume the text and make the text cursor shift forward. ● On the other hand, lookarounds do not consume text ● Juxtaposed lookarounds all apply, bound by a logic and, to the position marked by the text cursor

Slide 58

Slide 58 text

Lookaround limitations ● Lookarounds behave like nested regexes having their own stack ● They are also called zero-length assertions ● Lookahead can be full-fledged regexes ● Lookbehinds are usually much more restricted, depending on the engine

Slide 59

Slide 59 text

Lookarounds and the stack ● Each lookaround maintains its own stack, that gets deleted at the end of the lookaround. ● An important detail: capturing groups within lookarounds are considered capturing groups of the whole regex => their result is saved.

Slide 60

Slide 60 text

Lookahead + Backreference = Atomic group ● Lookaheads are full-fledged regexes with their own stack, which is thrown away. ● This is exactly like an atomic group, but the lookahead does not consume text ● However, capturing groups in a lookahead are stored by the regex => use a backreference to capture that text ● Therefore, for example: /(?=(\d+))\1/ = /(?>\d+)/

Slide 61

Slide 61 text

Regexes and C# ● .NET encapsulates regexes in a class, System.Text.RegularExpressions.Regex ● Its constructor accepts the regex and, optionally, global flags ● C# supports raw strings (preceded by @), to avoid over-escaping, that can be found in Java.

Slide 62

Slide 62 text

Regexes and Java ● Java's regex class is java.util.regex.Pattern ● In lieu of a constructor, it's a static method, Pattern.compile(), that creates a regex ● It takes the regex and, optionally, the global flags ● In Java, the regex /\\test/ becomes “\\\\test”, because each “\” in the regex must be escaped in Java, too, for a total of 4 “\”.

Slide 63

Slide 63 text

Regexes in MongoDB ● MongoDB supports regexes ● Just use /regex/ (with slashes and without double quotes) as the right side of an equality assertion in your query ● Important: a regex could hit indexes on a field, but the best results are achieved when the regex starts with ^

Slide 64

Slide 64 text

Regexes in Python ● Python provides the standard module re ● To create a regex, just use re.compile(), that takes, as usual, the regex string and the optional global flags

Slide 65

Slide 65 text

Regexes in JavaScript ● In JavaScript, it's quite common to use this notation to create a regex object: var regex = /regexPattern/ var regexWithFlags = /regexPattern/flags ● Alternatively, the RegExp class can be used

Slide 66

Slide 66 text

Final notes ● Don't forget that regexes must be kept simple, just like any other construct ● To achieve this result, a good knowledge of the text, as well as of the requirements, is needed. ● Write tests for your regexes

Slide 67

Slide 67 text

Further references ● “Mastering Regular Expressions” - by Jeffrey E. F. Friedl, published by O'Reilly Media ● http://regex101.com/ ● http://rubular.com/ ● http://www.regular-expressions.info/