$30 off During Our Annual Pro Sale. View Details »

Introduction to regular expressions

Introduction to regular expressions

The slides of my brown-bag session dedicated to introducing regular expressions.

Gianluca Costa

April 24, 2014
Tweet

More Decks by Gianluca Costa

Other Decks in Programming

Transcript

  1. Gianluca Costa
    Introduction to regular
    expressions

    View Slide

  2. Before starting
    Regular expressions are a tool:
    it's up to you to use them wisely.
    Like every tool, they require:
    Practice
    Tests
    Patience

    View Slide

  3. Why “regular expressions”?

    1956: mathematical definition of regular
    sets by Stephen Cole Kleen

    1968: “Regular Expression Search
    Algorithm” - by Ken Thompson. Description
    of a regular expression compiler.

    Regular expressions employed in text
    editors. Introduction of the grep command.

    View Slide

  4. Examples of text matching

    Given an IIS log, keep just the requests to
    the web app “/PicnicAPI”

    Perform LIKE queries on MongoDB

    Get the dir and basename of a file path

    Get the src attribute of an tag

    Read a key-value file having “\” line
    continuations

    View Slide

  5. Generalized problems

    Determine if a pattern is contained
    (matches) a given string

    Extract substrings from a matching string

    Replace one or more substrings

    Generalizable to files and streams

    View Slide

  6. Regular expressions
    Regular expressions describe text
    patterns.
    For example:
    “At least 3 digits, but not more than 5”.

    View Slide

  7. A simple example
    /\d{3,5}/
    Matches “3482”, but not “Hello”

    View Slide

  8. How to apply regexes

    Functions/classes provided by
    programming languages/frameworks

    Command-line tools (sed, awk, egrep, …)

    Other interfaces (eg: MongoDB queries)

    View Slide

  9. Interactive testing

    http://regex101.com/ - currently provides a
    free multi-engine test environment,
    explaining your regex and showing the
    matches on a text.

    http://rubular.com/ - another regex test
    environment, targeting Ruby's flavour.

    View Slide

  10. The dualism regex-target
    The regular expression is applied to a string,
    to check for a match.
    Both the regex and the string have their own
    cursor.
    Which cursor drives the matching
    process?
    T h e q u i c k b r q u i
    Text: Regex:

    View Slide

  11. Engine types

    DFA

    Traditional NFA

    POSIX NFA

    Hybrid solutions

    View Slide

  12. DFA

    Matching is driven by the cursor on the text

    Very fast matching

    Takes longer to compile

    Takes more memory

    Declarative regex

    Always returns the longest possible match.

    View Slide

  13. Traditional NFA

    Matching is driven by the cursor on the
    regex

    Creates a stack of states, and performs
    backtracking

    Supports more language constructs

    Imperative regex

    Usually returns the first match found

    Employed by standard Java, .NET, Python,
    PHP, Perl, Ruby, …

    View Slide

  14. POSIX NFA

    Very, very similar to traditional NFA, but
    returns the longest possible match.

    Further performance issues!

    View Slide

  15. Hybrid solutions
    Double engine: first-scan with DFA, then
    scan with NFA if required by the pattern.
    Further implementations are possible.

    View Slide

  16. Our target: NFAs

    DFAs are less common than NFAs, their
    syntax is almost a subset and they are
    generally simpler.

    We will concentrate on NFA regexes

    View Slide

  17. Know your engine
    There are common rules, but several
    engines.
    Every engine has its own implementation.
    You must know your engine. And write tests.

    View Slide

  18. Regex basics
    Literal text, such as /rain/ matches if and only
    if the string contains, somewhere, that
    sequence, matching character after
    character.

    View Slide

  19. The first rule of matching
    Matching starts from the leftmost character.
    Therefore:
    “The rainbow shines after the rain” /rain/

    View Slide

  20. The second rule of matching
    The engine returns a success if and only if
    the regex cursor reaches the end of the
    regex.

    View Slide

  21. Escaping characters

    Some characters (\, *, ?, +, ., (, ), [, ], {, }, |,
    ^, $, #) must be escaped when they are
    used literally

    Escape is performed by prepending “\”.
    For example: /\?/ to represent a literal “?”

    Where raw strings are not supported, a
    double escape might be required.
    In Java, the regex /\\\+/ becomes: “\\\\\\+”.

    View Slide

  22. Escape sequences

    \r

    \n

    \v

    \f

    \t

    They work just like in C

    View Slide

  23. Character classes

    [abc] = “a, b or c in this position”

    [a-z] = “a, b, c, …, z here”

    [A-Za-z] = “A, B, …, Z, a, b, …, z here”

    [A-Za-z0] = “A, …, Z, a, …, z, 0 here”

    [A-Z\-] = [-A-Z] = “A, …, Z or – here”

    What about accents? (é, è, …) And cedilla?

    Know your engine.

    View Slide

  24. Negated character classes

    [^ab] = “Something not a and not b here”

    [^a-z] = “Something not a, b, c, …, z here”

    [^A-Za-c] = “Something not “A, …, Z, a, …,
    c here”

    Negating a character set requires the
    existence of a character in that position, not
    belonging to the specified class.

    View Slide

  25. Common character classes

    \d = a digit

    \D = [^\d]

    \w = a letter, a digit or “_”

    \W = [^\w]

    \s = a space character

    \S = [^\s]

    . = any character except newline

    View Slide

  26. What are letters and spaces?

    The answer depends on the encoding and
    on your engine.

    In ASCII, usually:
    – \w = [A-Za-z0-9_]
    – \s = [\r \n\t\f\v] (includes ASCII-32 common
    space)

    But what about Latin-1 or Unicode?

    Know your engine

    View Slide

  27. Unicode character classes

    \uXXXX: matches the Unicode code point
    whose hex value is XXXX

    There should also be support for Unicode's
    categories and scripts, especially via \p

    Much more Unicode-related, non-standard
    features

    Know your engine

    View Slide

  28. Capturing groups

    ( and ) define a capturing group

    Capturing groups are assigned a 1-based
    index, according to the position of their (

    /(\w+)bet/ tries to match a string and, if
    successful, creates a capturing group for
    the text matching \w+, having index 1

    If the above regex is applied to “alphabet”,
    it matches and its group 1 is “alpha”

    View Slide

  29. Non-capturing groups

    Groups can just be used to clarify
    precedence: capturing is not always
    needed

    Skipping capturing can save memory and
    speed up the matching process

    To define a non-capturing group, use
    (?: and ).

    Therefore, /(?:\w+)bet/ is just like /\w+bet/,
    as no capturing is performed and this
    grouping alters precedence without effects.

    View Slide

  30. Backreferences

    Backreference = the content of a capturing
    group that becomes part of the regex

    Use \N in your regex, replacing N with the
    index of the captured group in question

    For example: /(['”])\w+\1/ to pair single and
    double quotes

    Some engines support named capturing
    and backreferences

    View Slide

  31. Alternation

    Alternatives are separated by |

    For example: /alpha|beta/ means “alpha” or
    “beta”

    Alternation has very low precedence; its
    scope is the current group: use grouping to
    force precedence.

    For example: /A(?:pril|ugust)/ means “A,
    followed by “pril” or “ugust”.

    View Slide

  32. Alternation VS char classes

    A character class (asserted or negated)
    always matches one and only one
    character

    The branches of an alternation can be
    strings of any length (at least one
    character, to be consistent)

    View Slide

  33. Matching in a DFA
    /nice|cute/ applied to:
    “Pandas are cute animals”
    It scans the string, starting from P, and, at
    every character, tries to apply the regex.
    In a DFA regex, the engine only chooses
    which regex components remain valid at a
    given position of the text cursor.

    View Slide

  34. Matching in NFA

    NFA also keeps a stack of states!

    Each decision point saves a state in the
    stack

    State = position of the 2 cursors

    If a choice in the regex leads to no match,
    the engine backtracks (=pops a state from
    the stack and makes a different choice)

    View Slide

  35. Backtracking
    S1
    S2 S5
    S3 S4
    S6 S7
    S8
    1
    2 4
    7
    8 10
    11
    3 5
    6
    9

    View Slide

  36. Performance implications

    In NFA, a failure is returned only when all
    the regex paths have been explored

    NFA regexes must be written with
    performances in mind.

    View Slide

  37. Alternation in NFA

    Ordered in most implementations.

    Affects what is matched and performances.

    Know your engine

    View Slide

  38. Greedy quantifiers

    All quantifiers can be applied to single
    characters, classes or even groups

    * = any number of occurrences (even 0)

    ? = 0 or 1 occurrencies

    + = 1 or infinite occurrencies

    {n} = exactly n occurrencies

    {m, n} = m to n occurrencies (included)

    {m,} = at least m occurrencies

    View Slide

  39. First example of greedy quantifiers

    Let's consider the regex /be?(er|ar)/

    How is it applied to
    “I'd like a chocolate bar” ?

    The regex cursor stays on “b” until the text
    cursor reaches its “b” too

    Then, the following regex paths are tried:
    – be => b(er) => b(ar)

    View Slide

  40. Greedy quantifiers and
    backtracking

    Consider the regex /.* are/

    Applied to: “Pandas are cute animals”

    .* will consume the whole text at first

    However, when reaching the end of the
    text, it stops matching and the regex cursor
    goes on.

    View Slide

  41. Greedy quantifiers and
    backtracking (2)

    Now, “ “ can't match (no more text is
    available), so the engine backtracks!

    Some backtracking is performed, until the
    first available space is reached (between
    “cute” and “animals”)

    The regex cursor moves on to “a”, that
    matches the “a” in “animals”. But “r” doesn't
    match “n” => more backtracking!

    View Slide

  42. Greedy quantifiers and
    backtracking (3)

    The failures and backtracking go on until
    the space between “are” and “cute”... “a”
    doesn't match the “c” in “cute” =>
    backtracking, again!

    The next space is ok: it is followed by “are”,
    that matches the rest of the regex.

    View Slide

  43. Pandas are cute animals! ^__^!

    View Slide

  44. Lazy quantifiers

    Quantifiers become lazy if followed by a ?

    *?

    ??

    +?

    {m, n}?

    {m, }?

    {n} cannot be lazy: it indicates a precise n

    View Slide

  45. Lazy quantifiers and backtracking

    When applying /.*? are/ to
    “Pandas are cute animals”, what happens?

    The engine must choose whether to
    apply .*? to “P”. But it's lazy, so the engine
    chooses to move the regex cursor forward

    The regex cursor goes on to “ “, but it
    doesn't match “P” so the engine backtracks

    The engine must now take the remaining
    path – applying .*? to “P”, which is viable

    View Slide

  46. Lazy quantifiers and backtracking
    (2)

    This goes on until the first space in the text
    is reached: it matches the space in the
    regex, so the regex cursor can go on

    The matching process continues until the
    regex ends

    In this case, the match of greedy and lazy
    evaluation was the same – but the lazy
    quantifiers required less backtracking

    View Slide

  47. Apply or skip? Greedy VS Lazy

    When a quantifier is encountered, the
    regex engine must choose whether to
    apply its element to the text or not

    Greedy quantifiers prefer the “apply” path
    whenever possible

    Lazy quantifiers prefer the “skip” path
    whenever possible

    Choosing greedy VS lazy quantifiers can
    impact performances and what is matched,
    but not the presence/absence of a match.

    View Slide

  48. Greedy VS Lazy: an example

    Given the text “987”:
    – /\d{1,3}/ matches the whole “987”: the
    greedy quantifier tries to consume as
    much as possible
    – /\d{1,3}?/ matches just “9”: the lazy
    quantifier must honour the constraints (at
    least 1 match), but chooses to skip
    application whenever possible

    View Slide

  49. Atomic grouping

    (?> and ) define an atomic group

    All the states created within an atomic
    group are removed from the engine's stack
    as soon as the group closes

    Atomic groups are non-capturing, but can
    have capturing groups

    Atomic grouping can alter the match/failure
    result of a regex, as well as affecting
    performances

    View Slide

  50. Possessive quantifiers

    Obtained by adding a “+” to greedy
    quantifiers

    Possessive quantifiers are equivalent to
    greedy quantifiers wrapped within an
    atomic group.

    For example:
    /\d++/ = /(?>\d+)/

    View Slide

  51. Regex flags

    Regex engines can turn on/off features, for
    customized behaviour

    Enabling and disabling flags usually affects
    the whole regex, but some engines support
    flags on just regions.

    Flag manipulation is engine- and API-
    dependent

    Every engine has its own flags, but some
    are definitely common.

    View Slide

  52. Most common regex flags

    Case insensitive

    Dot-all: . matches any character,
    including \n

    Multiline anchors: ^ and $ (see later) work
    on lines instead of the whole text

    Extended: spaces – including newlines -
    are ignored unless escaped or within a
    character class; lines starting with # are
    comments. More readable regexes.

    View Slide

  53. Anchors

    Anchors do not consume text: they are
    basic conditions on the text cursor.

    They must be verified for the regex to
    match

    View Slide

  54. Common anchors

    ^: the cursor is at the beginning of the text
    (of a line, in multiline mode)

    $: the cursor is at the end of the text (of a
    line, in multiline mode. And before or
    after \n? Know your engine).

    \A: the cursor is at the beginning of the text

    \Z: the cursor is at the end of the text

    \b: the cursor is at a word boundary (what's
    a word boundary? Know your engine)

    View Slide

  55. Lookaround

    Lookaround = a regex-based condition on
    the text cursor. Can be positive (the regex
    must match) or negative (the regex must
    fail).

    Lookahead = a lookaround on the text
    following the cursor

    Lookbehind = a lookaround on the text
    preceding the cursor.

    View Slide

  56. Lookaround notation
    Lookbehind Lookahead
    Positive (?<= regex) (?= regex )
    Negative (?

    View Slide

  57. Lookaround basics

    Their position in the regex matters, as the
    other characters in the regex consume the
    text and make the text cursor shift forward.

    On the other hand, lookarounds do not
    consume text

    Juxtaposed lookarounds all apply, bound
    by a logic and, to the position marked by
    the text cursor

    View Slide

  58. Lookaround limitations

    Lookarounds behave like nested regexes
    having their own stack

    They are also called zero-length assertions

    Lookahead can be full-fledged regexes

    Lookbehinds are usually much more
    restricted, depending on the engine

    View Slide

  59. Lookarounds and the stack

    Each lookaround maintains its own stack,
    that gets deleted at the end of the
    lookaround.

    An important detail: capturing groups within
    lookarounds are considered capturing
    groups of the whole regex => their result is
    saved.

    View Slide

  60. Lookahead + Backreference =
    Atomic group

    Lookaheads are full-fledged regexes with
    their own stack, which is thrown away.

    This is exactly like an atomic group, but the
    lookahead does not consume text

    However, capturing groups in a lookahead
    are stored by the regex => use a
    backreference to capture that text

    Therefore, for example:
    /(?=(\d+))\1/ = /(?>\d+)/

    View Slide

  61. Regexes and C#

    .NET encapsulates regexes in a class,
    System.Text.RegularExpressions.Regex

    Its constructor accepts the regex and,
    optionally, global flags

    C# supports raw strings (preceded by @),
    to avoid over-escaping, that can be found
    in Java.

    View Slide

  62. Regexes and Java

    Java's regex class is java.util.regex.Pattern

    In lieu of a constructor, it's a static method,
    Pattern.compile(), that creates a regex

    It takes the regex and, optionally, the global
    flags

    In Java, the regex /\\test/ becomes “\\\\test”,
    because each “\” in the regex must be
    escaped in Java, too, for a total of 4 “\”.

    View Slide

  63. Regexes in MongoDB

    MongoDB supports regexes

    Just use /regex/ (with slashes and without
    double quotes) as the right side of an
    equality assertion in your query

    Important: a regex could hit indexes on a
    field, but the best results are achieved
    when the regex starts with ^

    View Slide

  64. Regexes in Python

    Python provides the standard module re

    To create a regex, just use re.compile(),
    that takes, as usual, the regex string and
    the optional global flags

    View Slide

  65. Regexes in JavaScript

    In JavaScript, it's quite common to use this
    notation to create a regex object:
    var regex = /regexPattern/
    var regexWithFlags = /regexPattern/flags

    Alternatively, the RegExp class can be
    used

    View Slide

  66. Final notes

    Don't forget that regexes must be kept
    simple, just like any other construct

    To achieve this result, a good knowledge of
    the text, as well as of the requirements, is
    needed.

    Write tests for your regexes

    View Slide

  67. Further references

    “Mastering Regular Expressions” - by
    Jeffrey E. F. Friedl, published by O'Reilly
    Media

    http://regex101.com/

    http://rubular.com/

    http://www.regular-expressions.info/

    View Slide