Upgrade to Pro — share decks privately, control downloads, hide ads and more …

^RegexR4Strn<3$

Ju Liu
April 15, 2014

 ^RegexR4Strn<3$

Regular Expressions are for the Strong of Heart.

Ju Liu

April 15, 2014
Tweet

More Decks by Ju Liu

Other Decks in Programming

Transcript

  1. Regular Expressions are
    for the Strong of Heart
    ^RegexR4Strn<3$

    View Slide

  2. We all know RegExps,
    amirite?

    View Slide

  3. Stephen Kleene invents
    Regular Expressions in 1956.

    View Slide

  4. In 1968, Ken Thompson
    implements Regular Expressions
    to match pattern in text files.
    grep, global search for regular expressions and print matching lines

    View Slide

  5. So why are they
    called Regular?

    View Slide

  6. ‘Regular’ comes the Regular Sets
    used by Kleene to describe
    Regular Languages.

    View Slide

  7. WAT

    View Slide

  8. Some CS background

    View Slide

  9. A word is a sequence of symbols.
    !
    The symbols we are using is
    called the alphabet.

    View Slide

  10. Given the alphabet
    {0,1,2,3,4,5,6,7,8,9}
    !
    We could make words like
    0 1 42 1337 9001

    View Slide

  11. A language is a subset
    of all possible words.

    View Slide

  12. The language composed of
    James Bond colleagues codes
    would have these words
    007 002 006 0099
    !
    But not these words
    59 078 0935

    View Slide

  13. PROBLEM

    View Slide

  14. How can we tell if a
    word belongs to a
    language?

    View Slide

  15. Can we answer this question
    by using a machine with
    finite memory?

    View Slide

  16. If we can examine a word symbol by
    symbol (without requiring arbitrary
    amounts of memory), then we
    call the language regular.

    View Slide

  17. Let’s say we have a language
    with only the word 42

    View Slide

  18. word.length == 2 &&
    word[0] == 4 &&
    word[1] == 2
    REGULAR

    View Slide

  19. All languages with finite
    elements are regular, since
    we can just nest if conditions.

    View Slide

  20. Let’s look at the language with all
    prime numbers between 10 and 99

    View Slide

  21. if word[0] == 1
    if word[1] == 1 # 11
    return true
    if word[1] == 3 # 13
    return true
    if word[1] == 7 # 17
    return true
    if word[1] == 9 # 19
    return true
    !
    ...
    !
    return false
    REGULAR

    View Slide

  22. Even if all finite sets are regular,
    not all regular sets are finite.

    View Slide

  23. i.e. There can be infinite secret
    agents of the British Intelligence
    007 0042 00300 009292

    View Slide

  24. Formally, a Regular Language
    can be described by a FSM,
    aka Finite State Machine.

    View Slide

  25. start state: if input == 0 then goto state 2
    start state: if input == 1 then fail
    start state: if input == 2 then fail
    start state: if input == 3 then fail
    ...
    !
    state 2: if input == 0 then goto state 3
    state 2: if input == 1 then fail
    state 2: if input == 2 then fail
    state 2: if input == 3 then fail
    ...
    !
    state 3: for any input, accept
    REGULAR

    View Slide

  26. Alternatively, we can use a
    Regular Grammar.

    View Slide

  27. S → 0 A
    !
    A → 0 B
    !
    B → 0 B
    B → 1 B
    B → 2 B
    B → 3 B
    B → 4 B
    B → 5 B
    B → 6 B
    B → 7 B
    B → 8 B
    B → 9 B
    B → ε
    REGULAR

    View Slide

  28. Or… A Regular Expression.

    View Slide

  29. 00[0-9]+
    REGULAR

    View Slide

  30. wow
    so short
    very expressive
    how possible
    such regular
    pliz more succint

    View Slide

  31. FUNDAMENTALS

    View Slide

  32. Every character can be interpreted as
    a regular character, which has a literal
    meaning, or as a meta-character,
    which has a special meaning.

    View Slide

  33. So which are the
    metacharacters?

    View Slide

  34. WELL, WE DON’T
    KNOW FOR SURE

    View Slide

  35. There are many different flavours of
    regular expressions, and each have
    their own set of meta-characters.
    UNIX HATERS

    View Slide

  36. The \ character lets us switch
    from the regular meaning to
    the meta-meaning and back.

    View Slide

  37. TRIVIA TIME

    View Slide

  38. In grep’s default regexp engine, () and
    {} are considered literal, because Ken
    Thompson wanted to grep C code.

    View Slide

  39. RUBY METACHARACTERS

    View Slide

  40. ANY CHARACTER

    View Slide

  41. .
    ANY CHARACTER

    View Slide

  42. BOOLEAN OR

    View Slide

  43. gray|grey
    BOOLEAN OR

    View Slide

  44. GROUPING

    View Slide

  45. gr(a|e)y
    GROUPING

    View Slide

  46. CHARACTER SET

    View Slide

  47. gr[ae]y
    CHARACTER SET

    View Slide

  48. gr[^io]y
    NEGATED
    CHARACTER SET

    View Slide

  49. gr[a-z]y
    CHARACTER RANGE

    View Slide

  50. [\d] => [0-9]
    DIGITS

    View Slide

  51. [\w] => [0-9a-zA-Z_]
    WORD CHARACTER

    View Slide

  52. WHITESPACE
    [\s] => [ \t\r\n]

    View Slide

  53. NEGATED SETS
    [\D] => [^\d]
    [\S] => [^\s]
    [\W] => [^\w]

    View Slide

  54. QUANTIFIERS

    View Slide

  55. colou?r
    0 OR 1

    View Slide

  56. yeah*
    0 OR MORE

    View Slide

  57. foo+bar
    1 OR MORE

    View Slide

  58. ah{3}
    BRACES, EXACT

    View Slide

  59. oh{3,7}
    BRACES, RANGE

    View Slide

  60. aw{2,}
    BRACES, OPEN RANGE

    View Slide

  61. ANCHORS

    View Slide

  62. ^begin
    BEGINNING OF LINE

    View Slide

  63. end$
    END OF LINE

    View Slide

  64. \bword\b
    WORD BOUNDARIES

    View Slide

  65. WILD

    View Slide

  66. .*
    ANYTHING

    View Slide

  67. .*?
    NON-GREEDY ANYTHING

    View Slide

  68. (\w+) \1
    BACK-REFERENCE

    View Slide

  69. MOAR GROUPS

    View Slide

  70. (?:https?|ftp)://(.*)
    NON-CAPTURING

    View Slide

  71. soft(?=ware)
    POSITIVE
    LOOK-AHEAD

    View Slide

  72. hard(?!ware)
    NEGATIVE
    LOOK-AHEAD

    View Slide

  73. (?<=tender)love
    POSITIVE
    LOOK-BEHIND

    View Slide

  74. (?NEGATIVE
    LOOK-BEHIND

    View Slide

  75. SOME COOL THINGS
    YOU CAN DO WITH
    REGEXES

    View Slide

  76. ^\s*#
    Find all comments

    View Slide

  77. \s+$
    Find all trailing whitespaces

    View Slide

  78. (['"]).*?\1
    Find all single/double quoted
    strings in some blurb of text

    View Slide

  79. ^(?!Hello).*
    Find all lines not beginning
    with “Hello”

    View Slide

  80. \w+(?Find all words not ending with ay

    View Slide

  81. (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
    [ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\
    \]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\
    [\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\
    \".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^
    \"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\
    ["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?
    =[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|
    (?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|
    (?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
    |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:
    (?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
    [\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\
    \".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:
    [^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:
    (?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:
    (?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:
    (?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^
    \[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\
    [([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|
    ( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\
    [\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\
    \".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\
    ["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?
    =[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|
    \Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:
    (?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\
    [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\
    \".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:
    (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.
    (?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
    \r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:
    \r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
    (?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:
    (?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*
    \](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\
    \.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\
    \]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r
    \\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:
    [^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\
    [ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r
    \n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
    \r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*)
    Validate an email
    RFC 822

    View Slide

  82. View Slide

  83. REGEX TIPS

    View Slide

  84. Always use anchors

    View Slide

  85. Make your regex
    as specific as possible

    View Slide

  86. Build the regex
    step by step

    View Slide

  87. RESOURCES

    View Slide

  88. Mastering Regular
    Expressions

    View Slide

  89. regular-expressions.info

    View Slide

  90. rubular.com

    View Slide

  91. regex101.com

    View Slide

  92. regexcrossword.com

    View Slide

  93. regex.alf.nu

    View Slide

  94. Some people, when confronted
    with a problem, think “I know,
    I'll use regular expressions.”
    Now they have two problems.

    View Slide

  95. View Slide

  96. $

    View Slide