Slide 1

Slide 1 text

Regular Expressions Write less, Say more Thursday, January 26, 2012

Slide 2

Slide 2 text

The Regular Problem Thursday, January 26, 2012

Slide 3

Slide 3 text

The Regular Problem Regular Expressions provide a language that describes text and other languages It’s a mini language of its own, with syntax rules Regular Expressions are a power tool for solving text related problems Thursday, January 26, 2012

Slide 4

Slide 4 text

The Regexp Story Started in Mathematics 1968 Entered the Unix world through Ken Thompson’s qed 1984 Standardized by Henry Spencer’s implementation Thursday, January 26, 2012

Slide 5

Slide 5 text

Regular Expressions Alternatives Shell Wildcards Dedicated perl/c/java program *.txt x*x[0-9] Thursday, January 26, 2012

Slide 6

Slide 6 text

Regular Expressions & Unix Many UNIX tools take regular expressions grep/egrep filters its input based on regular expressions more/less/most search uses regular expressions vi/vim search and replace use regular expressions Thursday, January 26, 2012

Slide 7

Slide 7 text

Regular Expressions Today Used by all programming languages, including: Php, Python, Perl, Tcl JavaScript, ActionScript, Microsoft .NET, Oracle Java Objective C And More Thursday, January 26, 2012

Slide 8

Slide 8 text

Regular Expressions The Rules Thursday, January 26, 2012

Slide 9

Slide 9 text

Rule #1 A Simple character matches itself Thursday, January 26, 2012

Slide 10

Slide 10 text

Examples Command Meaning egrep foo Display only input lines that include the word ‘foo’ egrep unix Display only input lines that include the word ‘unix’ Thursday, January 26, 2012

Slide 11

Slide 11 text

Rule #2 A character class matches a single character from the class Thursday, January 26, 2012

Slide 12

Slide 12 text

Character Classes abcdABCD 0123 7 a07 B27 d17 Thursday, January 26, 2012

Slide 13

Slide 13 text

Character Class Syntax A class is denoted by [...] Can use any character sequence inside the squares [012], [abc], [aAbBcZ] Can use ranges inside the squares [0-9], [a-z], [a-zA-Z], [0-9ab] Can use not [^abc], [^0-9] Thursday, January 26, 2012

Slide 14

Slide 14 text

Examples Command Meaning egrep ‘[0-9][0-9]’ Display only input lines that include at least two digits egrep ‘[Uu][Nn][Ii][Xx]’ Display only input lines that include the word ‘unix’ in any casing Thursday, January 26, 2012

Slide 15

Slide 15 text

Which of these match ? hello [ux][012] hello world hello [ux][012] hello unix hello [ux][012] hello u2 hello [ux][012] hello x10 hello [ux][012] HELLO U2 Thursday, January 26, 2012

Slide 16

Slide 16 text

Which of these match ? hello [ux][012] hello world hello [ux][012] hello uni0 hello [ux][012] hello u2 hello [ux][012] hello x10 hello [ux][012] HELLO U2 Thursday, January 26, 2012

Slide 17

Slide 17 text

Predefined Character Classes [:alnum:] [:alpha:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:] . cheat sheet at: http://www.petefreitag.com/cheatsheets/regex/character-classes/ Thursday, January 26, 2012

Slide 18

Slide 18 text

Predefined Character Classes Note that brackets are part of the class name, therefore the correct use is: [[:digit:]] This allows using: [[:digit:][:lower:]] Thursday, January 26, 2012

Slide 19

Slide 19 text

Rule #3 A quantifier denotes how many times a letter will match Thursday, January 26, 2012

Slide 20

Slide 20 text

Quantifiers a + ab b aaab aye cabtain Thursday, January 26, 2012

Slide 21

Slide 21 text

Quantifiers Syntax * means match zero or more times - {0,} + means match one or more times - {1,} ? means match zero or one time - {0,1} {n,m} means match at least n but no more than m times {n} means match exactly n times Thursday, January 26, 2012

Slide 22

Slide 22 text

Which of these match ? [[:digit:]]{2}-?[[:digit:]]{7} 08-9112232 [[:digit:]]{2}-?[[:digit:]]{7} 421121212 [[:digit:]]{2}-?[[:digit:]]{7} 054-2201121 [[:digit:]]{2}-?[[:digit:]]{7} Phone: 03-9112121 [[:digit:]]{2}-?[[:digit:]]{7} Bond 007 Thursday, January 26, 2012

Slide 23

Slide 23 text

Which of these match ? [[:digit:]]{2}-?[[:digit:]]{7} 08-9112232 [[:digit:]]{2}-?[[:digit:]]{7} 421121212 [[:digit:]]{2}-?[[:digit:]]{7} 054-2201121 [[:digit:]]{2}-?[[:digit:]]{7} Phone: 03-9112121 [[:digit:]]{2}-?[[:digit:]]{7} Bond 007 Thursday, January 26, 2012

Slide 24

Slide 24 text

Which of these match ? (http://)?w{3}\.[a-z]+\.com www.google.com (http://)?w{3}\.[a-z]+\.com www.ynet.co.il (http://)?w{3}\.[a-z]+\.com http://mail.google.com (http://)?w{3}\.[a-z]+\.com http://www.home.com (http://)?w{3}\.[a-z]+\.com http://www.tel-aviv.com Thursday, January 26, 2012

Slide 25

Slide 25 text

Which of these match ? (http://)?w{3}\.[a-z]+\.com www.google.com (http://)?w{3}\.[a-z]+\.com www.ynet.co.il (http://)?w{3}\.[a-z]+\.com http://mail.google.com (http://)?w{3}\.[a-z]+\.com http://www.home.com (http://)?w{3}\.[a-z]+\.com http://www.tel-aviv.com Thursday, January 26, 2012

Slide 26

Slide 26 text

Backtracking When the engine encounters a quantifier, it will keep on adding matches to the quantified element as long as possible If a match failure occurs later on, the engine will backtrack Thursday, January 26, 2012

Slide 27

Slide 27 text

Backtracking Examine the expression: [a-z]*b+c Input string: aaaaaaaaaaabbbbbbbbbbbbbcccccccccccc Thursday, January 26, 2012

Slide 28

Slide 28 text

Backtracking Examine the expression: [a-z]*b*c Input string: aaaaaaaaaaabbbbbbbbbbbbbcccccccccccc \ Thursday, January 26, 2012

Slide 29

Slide 29 text

Rule #4 An assertion will match on a condition, not capturing input characters Thursday, January 26, 2012

Slide 30

Slide 30 text

Assertions ^ matches the beginning of a line $ matches the end of a line Thursday, January 26, 2012

Slide 31

Slide 31 text

Which of these match ? ^d drwxr-xr-x dive ^d -rwxr-xr-x dive ^d lrwxr-xr-x dive ^d drwxr-xr-x /home ^d -rwxr-xr-x /etc/passwd Thursday, January 26, 2012

Slide 32

Slide 32 text

Which of these match ? ^d drwxr-xr-x dive ^d -rwxr-xr-x dive ^d lrwxr-xr-x dive ^d drwxr-xr-x /home ^d -rwxr-xr-x /etc/passwd Thursday, January 26, 2012

Slide 33

Slide 33 text

Regular Expressions Variants Old style regexps didn’t have a + or {}. Therefore, in grep, these have to be backslashed Use egrep when possible for cleaner syntax Thursday, January 26, 2012

Slide 34

Slide 34 text

Q & A Thursday, January 26, 2012

Slide 35

Slide 35 text

Thank You Ynon Perek [email protected] ynonperek.com Thursday, January 26, 2012