Regular Expressions
Write less, Say more
Thursday, December 22, 2011
Slide 2
Slide 2 text
The
Regular
Problem
Thursday, December 22, 2011
Slide 3
Slide 3 text
The Regular Problem
Regular Expressions provide a language that describes
text and other languages
It’s a mini language of its own, with syntax rules
Regular Expressions are a power tool for solving text
related problems
Thursday, December 22, 2011
Slide 4
Slide 4 text
The Regexp Story
Started in Mathematics
1968 Entered the Unix
world through Ken
Thompson’s qed
1984 Standardized by
Henry Spencer’s
implementation
Thursday, December 22, 2011
Slide 5
Slide 5 text
Regular Expressions
Alternatives
Shell Wildcards
Dedicated perl/c/java program
*.txt
x*x[0-9]
Thursday, December 22, 2011
Slide 6
Slide 6 text
Regular Expressions & Unix
Many UNIX tools take regular expressions
grep/egrep filters its input based on regular expressions
more/less/most search uses regular expressions
vi/vim search and replace use regular expressions
Thursday, December 22, 2011
Slide 7
Slide 7 text
Regular Expressions Today
Used by all programming
languages, including:
Php, Python, Perl, Tcl
JavaScript, ActionScript,
Microsoft .NET, Oracle
Java
Objective C
And More
Thursday, December 22, 2011
Slide 8
Slide 8 text
Regular Expressions
The Rules
Thursday, December 22, 2011
Slide 9
Slide 9 text
Rule #1
A Simple character matches itself
Thursday, December 22, 2011
Slide 10
Slide 10 text
Examples
Expression Meaning
foo
Match only input lines that
include the word ‘foo’
unix
Match only input lines that
include the word ‘unix’
Thursday, December 22, 2011
Slide 11
Slide 11 text
Rule #2
A character class matches a single
character from the class
Thursday, December 22, 2011
Slide 12
Slide 12 text
Character Classes
abcdABCD 0123 7
a07
B27
d17
Thursday, December 22, 2011
Slide 13
Slide 13 text
Character Class Syntax
A class is denoted by [...]
Can use any character sequence inside the squares
[012], [abc], [aAbBcZ]
Can use ranges inside the squares
[0-9], [a-z], [a-zA-Z], [0-9ab]
Thursday, December 22, 2011
Slide 14
Slide 14 text
Examples
Expression Meaning
[0-9][0-9]
Match only input lines that
include at least two digits
[Uu][Nn][Ii][Xx]
Match only input lines that
include the word ‘unix’ in
any casing
Thursday, December 22, 2011
Slide 15
Slide 15 text
Which of these match ?
hello [ux][012]
hello world
hello [ux][012]
hello unix
hello [ux][012] hello u2
hello [ux][012]
hello x10
hello [ux][012]
HELLO U2
Thursday, December 22, 2011
Slide 16
Slide 16 text
Which of these match ?
hello [ux][012]
hello world
hello [ux][012]
hello unix
hello [ux][012] hello u2
hello [ux][012]
hello x10
hello [ux][012]
HELLO U2
Thursday, December 22, 2011
Predefined Character
Classes
\w (\W) - match [0-9a-zA-Z_] (or other)
\s (\S) - match a white space (or other)
\d (\D) - match a digit (or other)
cheat sheet at: http://www.petefreitag.com/cheatsheets/regex/character-classes/
Thursday, December 22, 2011
Slide 19
Slide 19 text
Predefined Character
Classes
Note that brackets are part of the class name, therefore
the correct use is:
[[:digit:]]
This allows using:
[[:digit:][:lower:]]
Thursday, December 22, 2011
Slide 20
Slide 20 text
Rule #3
A quantifier denotes how many
times a letter will match
Thursday, December 22, 2011
Slide 21
Slide 21 text
Quantifiers
a +
ab
b
aaab
aye cabtain
Thursday, December 22, 2011
Slide 22
Slide 22 text
Quantifiers Syntax
* means match zero or more times
+ means match one or more times
? means match zero or one time
{n,m} means match at least n but no more than m
times
{n} means match exactly n times
Thursday, December 22, 2011
Slide 23
Slide 23 text
Which of these match ?
[[:digit:]]{2}-?[[:digit:]]{7}
08-9112232
[[:digit:]]{2}-?[[:digit:]]{7}
421121212
[[:digit:]]{2}-?[[:digit:]]{7} 054-2201121
[[:digit:]]{2}-?[[:digit:]]{7}
Phone: 03-9112121
[[:digit:]]{2}-?[[:digit:]]{7}
Bond 007
Thursday, December 22, 2011
Slide 24
Slide 24 text
Which of these match ?
[[:digit:]]{2}-?[[:digit:]]{7}
08-9112232
[[:digit:]]{2}-?[[:digit:]]{7}
421121212
[[:digit:]]{2}-?[[:digit:]]{7} 054-2201121
[[:digit:]]{2}-?[[:digit:]]{7}
Phone: 03-9112121
[[:digit:]]{2}-?[[:digit:]]{7}
Bond 007
Thursday, December 22, 2011
Slide 25
Slide 25 text
Which of these match ?
(http://)?w{3}\.[a-z]+\.com
www.google.com
(http://)?w{3}\.[a-z]+\.com
www.ynet.co.il
(http://)?w{3}\.[a-z]+\.com http://mail.google.com
(http://)?w{3}\.[a-z]+\.com
http://www.home.com
(http://)?w{3}\.[a-z]+\.com
http://www.tel-aviv.com
Thursday, December 22, 2011
Slide 26
Slide 26 text
Which of these match ?
(http://)?w{3}\.[a-z]+\.com
www.google.com
(http://)?w{3}\.[a-z]+\.com
www.ynet.co.il
(http://)?w{3}\.[a-z]+\.com http://mail.google.com
(http://)?w{3}\.[a-z]+\.com
http://www.home.com
(http://)?w{3}\.[a-z]+\.com
http://www.tel-aviv.com
Thursday, December 22, 2011
Slide 27
Slide 27 text
Backtracking
When the engine encounters a quantifier, it will keep on
adding matches to the quantified element as long as
possible
If a match failure occurs later on, the engine will
backtrack
Thursday, December 22, 2011
Slide 28
Slide 28 text
Backtracking
Examine the expression:
[a-z]*b+c
Input string:
aaaaaaaaaaabbbbbbbbbbbbbcccccccccccc
Thursday, December 22, 2011
Slide 29
Slide 29 text
Backtracking
Examine the expression:
[a-z]*b*c
Input string:
aaaaaaaaaaabbbbbbbbbbbbbcccccccccccc
\
Thursday, December 22, 2011
Slide 30
Slide 30 text
Rule #4
An assertion will match on a
condition, not capturing input
characters
Thursday, December 22, 2011
Slide 31
Slide 31 text
Assertions
^ matches the beginning of a line
$ matches the end of a line
\b matches word boundary
Thursday, December 22, 2011
Slide 32
Slide 32 text
Which of these match ?
^d
drwxr-xr-x dive
^d
-rwxr-xr-x dive
^d lrwxr-xr-x dive
^d
drwxr-xr-x /home
^d
-rwxr-xr-x /etc/passwd
Thursday, December 22, 2011
Slide 33
Slide 33 text
Which of these match ?
^d
drwxr-xr-x dive
^d
-rwxr-xr-x dive
^d lrwxr-xr-x dive
^d
drwxr-xr-x /home
^d
-rwxr-xr-x /etc/passwd
Thursday, December 22, 2011
Slide 34
Slide 34 text
Which of these match ?
^.$
x
^.$
mmm
^.$ 42
^.$
9
^.$
...
Thursday, December 22, 2011
Slide 35
Slide 35 text
Which of these match ?
^.$
x
^.$
mmm
^.$ 42
^.$
9
^.$
...
Thursday, December 22, 2011
Slide 36
Slide 36 text
Captures
Thursday, December 22, 2011
Slide 37
Slide 37 text
Capturing Parens
Use parens to capture matched expression
Use \1, \2, etc. to refer to captured match
Thursday, December 22, 2011
Slide 38
Slide 38 text
Capturing Parens
Paris in the the spring
(\b\w+\b) \1
Thursday, December 22, 2011
Slide 39
Slide 39 text
Which of these match ?
(\d)(\d)\2\1
1111
(\d)(\d)\2\1
1001
(\d)(\d)\2\1 1414
(\d)(\d)\2\1
12321
(\d)(\d)\2\1
9889
Thursday, December 22, 2011
Slide 40
Slide 40 text
Which of these match ?
(\d)(\d)\2\1
1111
(\d)(\d)\2\1
1001
(\d)(\d)\2\1 1414
(\d)(\d)\2\1
12321
(\d)(\d)\2\1
9889
Thursday, December 22, 2011
Slide 41
Slide 41 text
Q & A
Regular Expressions
Classes
Quantifiers
Assertions
Captures
Thursday, December 22, 2011
Slide 42
Slide 42 text
Let’s Talk Perl
Thursday, December 22, 2011
Slide 43
Slide 43 text
Define A Regexp
use qr to define a regular expression
my $DIGITS_RE = qr {
^ \d+ $
}xms;
xms make the regexp more readable, and work better
on multiline strings
Thursday, December 22, 2011
Slide 44
Slide 44 text
Match Against a Regexp
$text =~ $DIGITS_RE
returns true if $text matches the pattern
Can also use inline regexp:
$text =~ /^\d+$/;
Thursday, December 22, 2011
Slide 45
Slide 45 text
Match With Capture
If a regexp has captures, the return value of the match
operator is a list of the captured groups
my ($key, $value) = $line =~ $CONFIG_LINE;
Thursday, December 22, 2011
Slide 46
Slide 46 text
Search & Replace
Use s/// to perform a search & replace operation
replace first occurrence of $PATTERN in $text with
contents of $new:
$text =~ s/$PATTERN/$new/;
Thursday, December 22, 2011
Slide 47
Slide 47 text
Search & Replace
replace all occurrence of $PATTERN in $text with
contents of $new:
$text =~ s/$PATTERN/$new/g;
Thursday, December 22, 2011
Slide 48
Slide 48 text
Q & A
Thursday, December 22, 2011
Slide 49
Slide 49 text
Thank You
Ynon Perek
[email protected]
ynonperek.com
All Rights Reserved to
Ynon Perek
Thursday, December 22, 2011