Regular Expressions
Write less, Say more
Thursday, January 26, 2012
Slide 2
Slide 2 text
The
Regular
Problem
Thursday, January 26, 2012
Slide 3
Slide 3 text
The Regular Problem
Regular Expressions provide a language that describes
text and other languages
It’s a mini language of its own, with syntax rules
Regular Expressions are a power tool for solving text
related problems
Thursday, January 26, 2012
Slide 4
Slide 4 text
The Regexp Story
Started in Mathematics
1968 Entered the Unix
world through Ken
Thompson’s qed
1984 Standardized by
Henry Spencer’s
implementation
Thursday, January 26, 2012
Slide 5
Slide 5 text
Regular Expressions
Alternatives
Shell Wildcards
Dedicated perl/c/java program
*.txt
x*x[0-9]
Thursday, January 26, 2012
Slide 6
Slide 6 text
Regular Expressions & Unix
Many UNIX tools take regular expressions
grep/egrep filters its input based on regular expressions
more/less/most search uses regular expressions
vi/vim search and replace use regular expressions
Thursday, January 26, 2012
Slide 7
Slide 7 text
Regular Expressions Today
Used by all programming
languages, including:
Php, Python, Perl, Tcl
JavaScript, ActionScript,
Microsoft .NET, Oracle
Java
Objective C
And More
Thursday, January 26, 2012
Slide 8
Slide 8 text
Regular Expressions
The Rules
Thursday, January 26, 2012
Slide 9
Slide 9 text
Rule #1
A Simple character matches itself
Thursday, January 26, 2012
Slide 10
Slide 10 text
Examples
Command Meaning
egrep foo
Display only input lines that
include the word ‘foo’
egrep unix
Display only input lines that
include the word ‘unix’
Thursday, January 26, 2012
Slide 11
Slide 11 text
Rule #2
A character class matches a single
character from the class
Thursday, January 26, 2012
Slide 12
Slide 12 text
Character Classes
abcdABCD 0123 7
a07
B27
d17
Thursday, January 26, 2012
Slide 13
Slide 13 text
Character Class Syntax
A class is denoted by [...]
Can use any character sequence inside the squares
[012], [abc], [aAbBcZ]
Can use ranges inside the squares
[0-9], [a-z], [a-zA-Z], [0-9ab]
Can use not
[^abc], [^0-9]
Thursday, January 26, 2012
Slide 14
Slide 14 text
Examples
Command Meaning
egrep ‘[0-9][0-9]’
Display only input lines that
include at least two digits
egrep ‘[Uu][Nn][Ii][Xx]’
Display only input lines that
include the word ‘unix’ in
any casing
Thursday, January 26, 2012
Slide 15
Slide 15 text
Which of these match ?
hello [ux][012]
hello world
hello [ux][012]
hello unix
hello [ux][012] hello u2
hello [ux][012]
hello x10
hello [ux][012]
HELLO U2
Thursday, January 26, 2012
Slide 16
Slide 16 text
Which of these match ?
hello [ux][012]
hello world
hello [ux][012]
hello uni0
hello [ux][012] hello u2
hello [ux][012]
hello x10
hello [ux][012]
HELLO U2
Thursday, January 26, 2012
Predefined Character
Classes
Note that brackets are part of the class name, therefore
the correct use is:
[[:digit:]]
This allows using:
[[:digit:][:lower:]]
Thursday, January 26, 2012
Slide 19
Slide 19 text
Rule #3
A quantifier denotes how many
times a letter will match
Thursday, January 26, 2012
Slide 20
Slide 20 text
Quantifiers
a +
ab
b
aaab
aye cabtain
Thursday, January 26, 2012
Slide 21
Slide 21 text
Quantifiers Syntax
* means match zero or more times - {0,}
+ means match one or more times - {1,}
? means match zero or one time - {0,1}
{n,m} means match at least n but no more than m
times
{n} means match exactly n times
Thursday, January 26, 2012
Slide 22
Slide 22 text
Which of these match ?
[[:digit:]]{2}-?[[:digit:]]{7}
08-9112232
[[:digit:]]{2}-?[[:digit:]]{7}
421121212
[[:digit:]]{2}-?[[:digit:]]{7} 054-2201121
[[:digit:]]{2}-?[[:digit:]]{7}
Phone: 03-9112121
[[:digit:]]{2}-?[[:digit:]]{7}
Bond 007
Thursday, January 26, 2012
Slide 23
Slide 23 text
Which of these match ?
[[:digit:]]{2}-?[[:digit:]]{7}
08-9112232
[[:digit:]]{2}-?[[:digit:]]{7}
421121212
[[:digit:]]{2}-?[[:digit:]]{7} 054-2201121
[[:digit:]]{2}-?[[:digit:]]{7}
Phone: 03-9112121
[[:digit:]]{2}-?[[:digit:]]{7}
Bond 007
Thursday, January 26, 2012
Slide 24
Slide 24 text
Which of these match ?
(http://)?w{3}\.[a-z]+\.com
www.google.com
(http://)?w{3}\.[a-z]+\.com
www.ynet.co.il
(http://)?w{3}\.[a-z]+\.com http://mail.google.com
(http://)?w{3}\.[a-z]+\.com
http://www.home.com
(http://)?w{3}\.[a-z]+\.com
http://www.tel-aviv.com
Thursday, January 26, 2012
Slide 25
Slide 25 text
Which of these match ?
(http://)?w{3}\.[a-z]+\.com
www.google.com
(http://)?w{3}\.[a-z]+\.com
www.ynet.co.il
(http://)?w{3}\.[a-z]+\.com http://mail.google.com
(http://)?w{3}\.[a-z]+\.com
http://www.home.com
(http://)?w{3}\.[a-z]+\.com
http://www.tel-aviv.com
Thursday, January 26, 2012
Slide 26
Slide 26 text
Backtracking
When the engine encounters a quantifier, it will keep on
adding matches to the quantified element as long as
possible
If a match failure occurs later on, the engine will
backtrack
Thursday, January 26, 2012
Slide 27
Slide 27 text
Backtracking
Examine the expression:
[a-z]*b+c
Input string:
aaaaaaaaaaabbbbbbbbbbbbbcccccccccccc
Thursday, January 26, 2012
Slide 28
Slide 28 text
Backtracking
Examine the expression:
[a-z]*b*c
Input string:
aaaaaaaaaaabbbbbbbbbbbbbcccccccccccc
\
Thursday, January 26, 2012
Slide 29
Slide 29 text
Rule #4
An assertion will match on a
condition, not capturing input
characters
Thursday, January 26, 2012
Slide 30
Slide 30 text
Assertions
^ matches the beginning of a line
$ matches the end of a line
Thursday, January 26, 2012
Slide 31
Slide 31 text
Which of these match ?
^d
drwxr-xr-x dive
^d
-rwxr-xr-x dive
^d lrwxr-xr-x dive
^d
drwxr-xr-x /home
^d
-rwxr-xr-x /etc/passwd
Thursday, January 26, 2012
Slide 32
Slide 32 text
Which of these match ?
^d
drwxr-xr-x dive
^d
-rwxr-xr-x dive
^d lrwxr-xr-x dive
^d
drwxr-xr-x /home
^d
-rwxr-xr-x /etc/passwd
Thursday, January 26, 2012
Slide 33
Slide 33 text
Regular Expressions
Variants
Old style regexps didn’t have a + or {}. Therefore, in
grep, these have to be backslashed
Use egrep when possible for cleaner syntax
Thursday, January 26, 2012
Slide 34
Slide 34 text
Q & A
Thursday, January 26, 2012
Slide 35
Slide 35 text
Thank You
Ynon Perek
[email protected]
ynonperek.com
Thursday, January 26, 2012