Regular Expressions

Pattern Matching and Text Processing Using Regular Expressions Karmen Blake

Regular Expressions What is a regular expression? • A pattern
of characters which may or may not predict (match) a given string. • Use the pattern match/no match for conditional branching • Scan for a pattern • String substitutions • Split strings

Regular Expressions Examples of what a "pattern" is: • the
letter a, followed by a digit • Any uppercase letter, followed by at least one lowercase letter • Three digits, followed by a hyphen, followed by four digits • The beginning of a line, followed by one or more whitespace characters • The character . (period) at the end of a string • An uppercase letter at the beginning of a word

Regular Expressions Regular expression literal // //.class in irb gives
you Regexp Between the slashes is where you put patterns /some-pattern-here/

Regular Expressions A pattern matching adventure requires two creatures: •
a regular expression • string The regular expression makes its predictions on the string. The predictions either match or don't

Regular Expressions "A match made in heaven" puts "Neo" if
/Neo/.match "Neo is in the Matrix" puts "Neo" if "Neo is in the Matrix".match /Neo/ Returns a match object (more on this later) if true. /Neo/.match "Neo is in the Matrix" in irb would show a match object returned. If no match occurs a nil object is returned which has an implicit value of false (in Ruby).

Regular Expressions puts "Neo" if /Neo/ =~ "Neo is in
the Matrix" puts "Neo" if "Neo is in the Matrix" =~ /Neo/ Using =~ differs in that it returns the index of the string where the regular expression was found. /Neo/ =~ "Neo is in the Matrix" in irb gives us 0

Regular Expressions Building a pattern • Literal characters: "match this
character" • The dot wild card character (.): "match any character" • Character classes: "match one of these characters"

Regular Expressions Literal Characters /a/ matches the string "a" as
well as any string containing the letter "a" Weird characters like ^,$,?,.,/,\,[,],{,},(,),+, and * need a special \ to make it literal. Reason being is that these characters are special in regular expression syntax. In order to match a literal ? the regular expression would have to look like this: /\?/

Regular Expressions The wildcard character . (dot) Match any character
in the string. /.pen/ Valid matches: /.pen/ =~ "the bank is open" /.pen/ =~ "open" /.ejected/ =~ "dejected" /.ejected/ =~ "rejected"

Regular Expressions Character classes Explicit list of characters placed inside
of square brackets /[dr]ejected/ Match either 'd' or 'r' and no other characters followed by 'ejected' Match /[dr]ejected/ =~ "dejected" /[dr]ejected/ =~ "rejected" No match /[dr]ejected/ =~ "bejected"

Regular Expressions Character classes Range of characters /[a-z]/ Match any
character a through f (upper or lower) or any digit /[A-Fa-f0-9]/ Negating a character match (^) /[^A-Fa-f0-9]/

Regular Expressions Character classes Special escape sequences To match any
digit you can do this: /[0-9]/ You can accomplish the same thing with: /\d/ Other useful escape sequences are: \w matches any digit, alphabetical character, or _ \s matches any whitespace character (space, tab, newline)

Regular Expressions Character classes Negated special escape sequences \D matches
any character that is not a digit \W matches any character other than an alphanumeric character \S matches any non-whitespace character

Regular Expressions Matching and MatchData: getting beyond yes/no success/failure stuff...
English pattern: blake,karmen blake, karmen "last name followed by a comma followed by an optional space followed by first name" /^[A-Za-z]+,\s*[A-Za-z]+$/ * optional (zero or more) + (one or more, or at least one)

Regular Expressions MatchData Parenthetical Groupings - Add parens to a
rule and get contents out after evaluation. :-) This /^[A-Za-z]+,\s*[A-Za-z]+$/ may turn into /(^[A-Za-z]+),\s*([A-Za-z]+$)/ Test it in irb: /(^[A-Za-z]+),\s*([A-Za-z]+$)/.match "blake, karmen" puts $1 outputs "blake" puts $2 outputs "karmen"

Regular Expressions Capturing more data from a match name =
"blake, karmen" name_format = /(^[A-Za-z]+),\s*([A-Za-z]+$)/ name_match = name_format.match(name) #save match name_match[0] #"blake, karmen" entire string name_match[1] #"blake" first capture name_match.begin(1) #0 name_match.end(1) #5 name_match[2] #"karmen" second capture name_match.begin(2) #7 name_match.end(2) #13

Regular Expressions What the heck is this???!!!?? /^x?[yz]{2}.*\z/ You will
learn soon my young padawans. Quantifiers!! Zero or one I want to match Mr, Mr., Mrs, Mrs. English version the character M, followed by the character r, followed by zero or one of the character s, followed by zero or one of the character '.'

Regular Expressions ? to the rescue /Mrs?\.?/ Rock and Roll!!
Valid matches! /Mrs?\.?/ =~ "Mr" /Mrs?\.?/ =~ "Mr." /Mrs?\.?/ =~ "Mrs" /Mrs?\.?/ =~ "Mrs."

Regular Expressions Zero or more * How do you spell
boo? /booo*/ Rock and Roll!! Valid matches! /boo*/ =~ "boo" /boo*/ =~ "booo" /boo*/ =~ "boooo" /boo*/ =~ "booooo" /boo*/ =~ "boooooo"

Regular Expressions One or more + How many digits? /\d+/
Rock and Roll!! Valid matches! /\d+/ =~ "2" /\d+/ =~ "34" /\d+/ =~ "3566"

Regular Expressions Number of repetitions For example, a basic phone
number pattern 3 digits followed by a hyphen followed by 4 digits /\d{3}-\d{4}/ Valid match: /\d{3}-\d{4}/ =~ "333-4444"

Regular Expressions Let's get real! Get ids out of a
file "123 karmen\n234 john\n456 mary". scan(/\d{3}/) => ["123", "234", "456"] Create permalink "john doe 1234 hello message". gsub(/\s/,"-") => "john-doe-1234-hello-message" Capitalize my string "a title of a book".gsub(/\b\w/) {|s| s.upcase}

Regular Expressions Let's get real! Grepilicious Uses regular expressions to
extract information out of collections. ["JOHN","Doe","Mary","SWANSON"].find_all {|name| /[a-z]/ =~ name} OR ["JOHN","Doe","Mary","SWANSON"].grep(/[a-z]/) Both return a result array: => ["Doe", "Mary"]

Regular Expressions Let's get real! Grepilicious ["JOHN","Doe","Mary","SWANSON"].find_all{|name| name =~ /[a-z]/}.
collect{|name| name.upcase} OR ["JOHN","Doe","Mary","SWANSON"].grep(/[a-z]/) {|name| name. upcase} Both return array: => ["DOE", "MARY"]

Regular Expressions

Regular Expressions

Karmen Blake

More Decks by Karmen Blake

Other Decks in Programming

Featured

Transcript

Pattern Matching and Text Processing Using Regular Expressions Karmen Blake

Regular Expressions What is a regular expression? • A pattern

Regular Expressions Examples of what a "pattern" is: • the

Regular Expressions Regular expression literal // //.class in irb gives

Regular Expressions A pattern matching adventure requires two creatures: •

Regular Expressions "A match made in heaven" puts "Neo" if

Regular Expressions puts "Neo" if /Neo/ =~ "Neo is in

Regular Expressions Building a pattern • Literal characters: "match this

Regular Expressions Literal Characters /a/ matches the string "a" as

Regular Expressions The wildcard character . (dot) Match any character

Regular Expressions Character classes Explicit list of characters placed inside

Regular Expressions Character classes Range of characters /[a-z]/ Match any

Regular Expressions Character classes Special escape sequences To match any

Regular Expressions Character classes Negated special escape sequences \D matches

Regular Expressions Matching and MatchData: getting beyond yes/no success/failure stuff...

Regular Expressions MatchData Parenthetical Groupings - Add parens to a

Regular Expressions Capturing more data from a match name =

Regular Expressions What the heck is this???!!!?? /^x?[yz]{2}.*\z/ You will

Regular Expressions ? to the rescue /Mrs?\.?/ Rock and Roll!!

Regular Expressions Zero or more * How do you spell

Regular Expressions One or more + How many digits? /\d+/

Regular Expressions Number of repetitions For example, a basic phone

Regular Expressions Let's get real! Get ids out of a

Regular Expressions Let's get real! Grepilicious Uses regular expressions to

Regular Expressions Let's get real! Grepilicious ["JOHN","Doe","Mary","SWANSON"].find_all{|name| name =~ /[a-z]/}.