^RegexR4Strn<3$

1a45b192d0bbaf167afb43a41859e313?s=47 Ju Liu
April 15, 2014

 ^RegexR4Strn<3$

Regular Expressions are for the Strong of Heart.

1a45b192d0bbaf167afb43a41859e313?s=128

Ju Liu

April 15, 2014
Tweet

Transcript

  1. Regular Expressions are for the Strong of Heart ^RegexR4Strn<3$

  2. We all know RegExps, amirite?

  3. Stephen Kleene invents Regular Expressions in 1956.

  4. In 1968, Ken Thompson implements Regular Expressions to match pattern

    in text files. grep, global search for regular expressions and print matching lines
  5. So why are they called Regular?

  6. ‘Regular’ comes the Regular Sets used by Kleene to describe

    Regular Languages.
  7. WAT

  8. Some CS background

  9. A word is a sequence of symbols. ! The symbols

    we are using is called the alphabet.
  10. Given the alphabet {0,1,2,3,4,5,6,7,8,9} ! We could make words like

    0 1 42 1337 9001
  11. A language is a subset of all possible words.

  12. The language composed of James Bond colleagues codes would have

    these words 007 002 006 0099 ! But not these words 59 078 0935
  13. PROBLEM

  14. How can we tell if a word belongs to a

    language?
  15. Can we answer this question by using a machine with

    finite memory?
  16. If we can examine a word symbol by symbol (without

    requiring arbitrary amounts of memory), then we call the language regular.
  17. Let’s say we have a language with only the word

    42
  18. word.length == 2 && word[0] == 4 && word[1] ==

    2 REGULAR
  19. All languages with finite elements are regular, since we can

    just nest if conditions.
  20. Let’s look at the language with all prime numbers between

    10 and 99
  21. if word[0] == 1 if word[1] == 1 # 11

    return true if word[1] == 3 # 13 return true if word[1] == 7 # 17 return true if word[1] == 9 # 19 return true ! ... ! return false REGULAR
  22. Even if all finite sets are regular, not all regular

    sets are finite.
  23. i.e. There can be infinite secret agents of the British

    Intelligence 007 0042 00300 009292
  24. Formally, a Regular Language can be described by a FSM,

    aka Finite State Machine.
  25. start state: if input == 0 then goto state 2

    start state: if input == 1 then fail start state: if input == 2 then fail start state: if input == 3 then fail ... ! state 2: if input == 0 then goto state 3 state 2: if input == 1 then fail state 2: if input == 2 then fail state 2: if input == 3 then fail ... ! state 3: for any input, accept REGULAR
  26. Alternatively, we can use a Regular Grammar.

  27. S → 0 A ! A → 0 B !

    B → 0 B B → 1 B B → 2 B B → 3 B B → 4 B B → 5 B B → 6 B B → 7 B B → 8 B B → 9 B B → ε REGULAR
  28. Or… A Regular Expression.

  29. 00[0-9]+ REGULAR

  30. wow so short very expressive how possible such regular pliz

    more succint
  31. FUNDAMENTALS

  32. Every character can be interpreted as a regular character, which

    has a literal meaning, or as a meta-character, which has a special meaning.
  33. So which are the metacharacters?

  34. WELL, WE DON’T KNOW FOR SURE

  35. There are many different flavours of regular expressions, and each

    have their own set of meta-characters. UNIX HATERS
  36. The \ character lets us switch from the regular meaning

    to the meta-meaning and back.
  37. TRIVIA TIME

  38. In grep’s default regexp engine, () and {} are considered

    literal, because Ken Thompson wanted to grep C code.
  39. RUBY METACHARACTERS

  40. ANY CHARACTER

  41. . ANY CHARACTER

  42. BOOLEAN OR

  43. gray|grey BOOLEAN OR

  44. GROUPING

  45. gr(a|e)y GROUPING

  46. CHARACTER SET

  47. gr[ae]y CHARACTER SET

  48. gr[^io]y NEGATED CHARACTER SET

  49. gr[a-z]y CHARACTER RANGE

  50. [\d] => [0-9] DIGITS

  51. [\w] => [0-9a-zA-Z_] WORD CHARACTER

  52. WHITESPACE [\s] => [ \t\r\n]

  53. NEGATED SETS [\D] => [^\d] [\S] => [^\s] [\W] =>

    [^\w]
  54. QUANTIFIERS

  55. colou?r 0 OR 1

  56. yeah* 0 OR MORE

  57. foo+bar 1 OR MORE

  58. ah{3} BRACES, EXACT

  59. oh{3,7} BRACES, RANGE

  60. aw{2,} BRACES, OPEN RANGE

  61. ANCHORS

  62. ^begin BEGINNING OF LINE

  63. end$ END OF LINE

  64. \bword\b WORD BOUNDARIES

  65. WILD

  66. .* ANYTHING

  67. .*? NON-GREEDY ANYTHING

  68. (\w+) \1 BACK-REFERENCE

  69. MOAR GROUPS

  70. (?:https?|ftp)://(.*) NON-CAPTURING

  71. soft(?=ware) POSITIVE LOOK-AHEAD

  72. hard(?!ware) NEGATIVE LOOK-AHEAD

  73. (?<=tender)love POSITIVE LOOK-BEHIND

  74. (?<!tainted)love NEGATIVE LOOK-BEHIND

  75. SOME COOL THINGS YOU CAN DO WITH REGEXES

  76. ^\s*# Find all comments

  77. \s+$ Find all trailing whitespaces

  78. (['"]).*?\1 Find all single/double quoted strings in some blurb of

    text
  79. ^(?!Hello).* Find all lines not beginning with “Hello”

  80. \w+(?<!ay)\b Find all words not ending with ay

  81. (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(

    ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\ \]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\ [\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\ \".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^ \"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\ ["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(? =[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z| (?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z| (?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?: (?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\ \".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?: (?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?: (?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?: (?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^ \[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\ [([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.| ( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\ [\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\ \".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\ ["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(? =[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?: (?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\ \".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?: (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\. (?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?: \r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?: (?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)* \](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\ \]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r \\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?: [^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\ [ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r \n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*) Validate an email RFC 822
  82. None
  83. REGEX TIPS

  84. Always use anchors

  85. Make your regex as specific as possible

  86. Build the regex step by step

  87. RESOURCES

  88. Mastering Regular Expressions

  89. regular-expressions.info

  90. rubular.com

  91. regex101.com

  92. regexcrossword.com

  93. regex.alf.nu

  94. Some people, when confronted with a problem, think “I know,

    I'll use regular expressions.” Now they have two problems.
  95. None
  96. $