Slide 1

Slide 1 text

Regular Expressions are for the Strong of Heart ^RegexR4Strn<3$

Slide 2

Slide 2 text

We all know RegExps, amirite?

Slide 3

Slide 3 text

Stephen Kleene invents Regular Expressions in 1956.

Slide 4

Slide 4 text

In 1968, Ken Thompson implements Regular Expressions to match pattern in text files. grep, global search for regular expressions and print matching lines

Slide 5

Slide 5 text

So why are they called Regular?

Slide 6

Slide 6 text

‘Regular’ comes the Regular Sets used by Kleene to describe Regular Languages.

Slide 7

Slide 7 text

WAT

Slide 8

Slide 8 text

Some CS background

Slide 9

Slide 9 text

A word is a sequence of symbols. ! The symbols we are using is called the alphabet.

Slide 10

Slide 10 text

Given the alphabet {0,1,2,3,4,5,6,7,8,9} ! We could make words like 0 1 42 1337 9001

Slide 11

Slide 11 text

A language is a subset of all possible words.

Slide 12

Slide 12 text

The language composed of James Bond colleagues codes would have these words 007 002 006 0099 ! But not these words 59 078 0935

Slide 13

Slide 13 text

PROBLEM

Slide 14

Slide 14 text

How can we tell if a word belongs to a language?

Slide 15

Slide 15 text

Can we answer this question by using a machine with finite memory?

Slide 16

Slide 16 text

If we can examine a word symbol by symbol (without requiring arbitrary amounts of memory), then we call the language regular.

Slide 17

Slide 17 text

Let’s say we have a language with only the word 42

Slide 18

Slide 18 text

word.length == 2 && word[0] == 4 && word[1] == 2 REGULAR

Slide 19

Slide 19 text

All languages with finite elements are regular, since we can just nest if conditions.

Slide 20

Slide 20 text

Let’s look at the language with all prime numbers between 10 and 99

Slide 21

Slide 21 text

if word[0] == 1 if word[1] == 1 # 11 return true if word[1] == 3 # 13 return true if word[1] == 7 # 17 return true if word[1] == 9 # 19 return true ! ... ! return false REGULAR

Slide 22

Slide 22 text

Even if all finite sets are regular, not all regular sets are finite.

Slide 23

Slide 23 text

i.e. There can be infinite secret agents of the British Intelligence 007 0042 00300 009292

Slide 24

Slide 24 text

Formally, a Regular Language can be described by a FSM, aka Finite State Machine.

Slide 25

Slide 25 text

start state: if input == 0 then goto state 2 start state: if input == 1 then fail start state: if input == 2 then fail start state: if input == 3 then fail ... ! state 2: if input == 0 then goto state 3 state 2: if input == 1 then fail state 2: if input == 2 then fail state 2: if input == 3 then fail ... ! state 3: for any input, accept REGULAR

Slide 26

Slide 26 text

Alternatively, we can use a Regular Grammar.

Slide 27

Slide 27 text

S → 0 A ! A → 0 B ! B → 0 B B → 1 B B → 2 B B → 3 B B → 4 B B → 5 B B → 6 B B → 7 B B → 8 B B → 9 B B → ε REGULAR

Slide 28

Slide 28 text

Or… A Regular Expression.

Slide 29

Slide 29 text

00[0-9]+ REGULAR

Slide 30

Slide 30 text

wow so short very expressive how possible such regular pliz more succint

Slide 31

Slide 31 text

FUNDAMENTALS

Slide 32

Slide 32 text

Every character can be interpreted as a regular character, which has a literal meaning, or as a meta-character, which has a special meaning.

Slide 33

Slide 33 text

So which are the metacharacters?

Slide 34

Slide 34 text

WELL, WE DON’T KNOW FOR SURE

Slide 35

Slide 35 text

There are many different flavours of regular expressions, and each have their own set of meta-characters. UNIX HATERS

Slide 36

Slide 36 text

The \ character lets us switch from the regular meaning to the meta-meaning and back.

Slide 37

Slide 37 text

TRIVIA TIME

Slide 38

Slide 38 text

In grep’s default regexp engine, () and {} are considered literal, because Ken Thompson wanted to grep C code.

Slide 39

Slide 39 text

RUBY METACHARACTERS

Slide 40

Slide 40 text

ANY CHARACTER

Slide 41

Slide 41 text

. ANY CHARACTER

Slide 42

Slide 42 text

BOOLEAN OR

Slide 43

Slide 43 text

gray|grey BOOLEAN OR

Slide 44

Slide 44 text

GROUPING

Slide 45

Slide 45 text

gr(a|e)y GROUPING

Slide 46

Slide 46 text

CHARACTER SET

Slide 47

Slide 47 text

gr[ae]y CHARACTER SET

Slide 48

Slide 48 text

gr[^io]y NEGATED CHARACTER SET

Slide 49

Slide 49 text

gr[a-z]y CHARACTER RANGE

Slide 50

Slide 50 text

[\d] => [0-9] DIGITS

Slide 51

Slide 51 text

[\w] => [0-9a-zA-Z_] WORD CHARACTER

Slide 52

Slide 52 text

WHITESPACE [\s] => [ \t\r\n]

Slide 53

Slide 53 text

NEGATED SETS [\D] => [^\d] [\S] => [^\s] [\W] => [^\w]

Slide 54

Slide 54 text

QUANTIFIERS

Slide 55

Slide 55 text

colou?r 0 OR 1

Slide 56

Slide 56 text

yeah* 0 OR MORE

Slide 57

Slide 57 text

foo+bar 1 OR MORE

Slide 58

Slide 58 text

ah{3} BRACES, EXACT

Slide 59

Slide 59 text

oh{3,7} BRACES, RANGE

Slide 60

Slide 60 text

aw{2,} BRACES, OPEN RANGE

Slide 61

Slide 61 text

ANCHORS

Slide 62

Slide 62 text

^begin BEGINNING OF LINE

Slide 63

Slide 63 text

end$ END OF LINE

Slide 64

Slide 64 text

\bword\b WORD BOUNDARIES

Slide 65

Slide 65 text

WILD

Slide 66

Slide 66 text

.* ANYTHING

Slide 67

Slide 67 text

.*? NON-GREEDY ANYTHING

Slide 68

Slide 68 text

(\w+) \1 BACK-REFERENCE

Slide 69

Slide 69 text

MOAR GROUPS

Slide 70

Slide 70 text

(?:https?|ftp)://(.*) NON-CAPTURING

Slide 71

Slide 71 text

soft(?=ware) POSITIVE LOOK-AHEAD

Slide 72

Slide 72 text

hard(?!ware) NEGATIVE LOOK-AHEAD

Slide 73

Slide 73 text

(?<=tender)love POSITIVE LOOK-BEHIND

Slide 74

Slide 74 text

(?

Slide 75

Slide 75 text

SOME COOL THINGS YOU CAN DO WITH REGEXES

Slide 76

Slide 76 text

^\s*# Find all comments

Slide 77

Slide 77 text

\s+$ Find all trailing whitespaces

Slide 78

Slide 78 text

(['"]).*?\1 Find all single/double quoted strings in some blurb of text

Slide 79

Slide 79 text

^(?!Hello).* Find all lines not beginning with “Hello”

Slide 80

Slide 80 text

\w+(?

Slide 81

Slide 81 text

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\ \]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\ [\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\ \".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^ \"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\ ["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(? =[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z| (?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z| (?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?: (?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\ \".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?: (?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?: (?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?: (?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^ \[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\ [([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.| ( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\ [\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\ \".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\ ["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(? =[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?: (?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\ \".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?: (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\. (?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?: \r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?: (?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)* \](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\ \]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r \\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?: [^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\ [ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r \n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*) Validate an email RFC 822

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

REGEX TIPS

Slide 84

Slide 84 text

Always use anchors

Slide 85

Slide 85 text

Make your regex as specific as possible

Slide 86

Slide 86 text

Build the regex step by step

Slide 87

Slide 87 text

RESOURCES

Slide 88

Slide 88 text

Mastering Regular Expressions

Slide 89

Slide 89 text

regular-expressions.info

Slide 90

Slide 90 text

rubular.com

Slide 91

Slide 91 text

regex101.com

Slide 92

Slide 92 text

regexcrossword.com

Slide 93

Slide 93 text

regex.alf.nu

Slide 94

Slide 94 text

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Slide 95

Slide 95 text

No content

Slide 96

Slide 96 text

$