Introduction to Regular Expressions by Luke Sneeringer

Regular Expressions Luke Sneeringer @lukesneeringer

Why Regex? • Regular expressions are a standard mechanism to
ﬁnd a pattern of text within a bigger block of text. • Also useful for ensuring formatting consistency, or transforming data into a common format. @lukesneeringer

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b Dragons? @lukesneeringer

Learning Regex • Regular expressions actually aren't that scary. •
But starting with the most complicated regular expression imaginable is. :-) @lukesneeringer

I The "re" module

The re module • Regular expression parsing is done through
Python's re module. • import re  re.search(needle, haystack) • PSA: Don't use re.match, at least while you're learning. Use re.search exclusively. @lukesneeringer

The re module • If your regex matches, re.search will
return a "match object". • Match objects have a few methods you commonly use: group, groups, groupdict. • If your regex doesn't match, re.search will return None. @lukesneeringer

Python "raw" strings • In Python, you can preﬁx a
string literal with r to get a "raw" string. • This is useful for regular expressions, as we'll see later, because it makes escaping regexes more sane. @lukesneeringer

II Building Blocks

The Basics The most basic regular expression is just a
string of uninteresting characters (Latin letters, Arabic numerals). >>> re.search('abc', 'abcdef')  <_sre.SRE_Match object at 0x100790cc8> >>> re.search('xyz', 'abcdef') >>> @lukesneeringer

The Basics • This is simple enough. • But regular
expressions are about matching patterns, not exact text. @lukesneeringer

Repetition Certain characters indicate that a character is repeated a
number of times. + Used one or more times * Used zero or more times ? Used zero or one times (e.g. optional) {x} Used x times (e.g. {2}) {x,y} Used between x and y times, inclusive @lukesneeringer

Repetition • An important note: repetition applies to the single
"atomic thing" it folllows: ab{2} matches abb, not abab. • So far, the only "atomic thing" we've seen is a single literal character. • We're about to look at some more. @lukesneeringer

Character Sets Characters enclosed in [] characters are called character
sets. This matches any single character in the set. >>> re.search('gr[ae]y', 'gray')  <_sre.SRE_Match object at 0x100790780> >>> re.search('gr[ae]y', 'grey')  <_sre.SRE_Match object at 0x100790cc8> >>> re.search('gr[ae]y', 'grzy') >>> @lukesneeringer

Character Sets Important note: Character sets match only a single
occurrence of a character. gr[ae]y gray grey gry grzy graay graey Gray grAy @lukesneeringer

Character Sets The hyphen character with a character set is
special. It designates a range of characters (usually). >>> re.search('[0-9]{3}-[0-9]{4}',  '867-5309')  <_sre.SRE_Match object at 0x1020b0d30> @lukesneeringer

Character Sets • A nice trick: If you want a
literal hyphen in a character set, put it at the beginning or end  (I prefer the end). • [a-z0-9-]+ will match one or more characters between a and z (lower-cased only), 0 and 9, and the literal hyphen character. • Obvious use case: [0-9a-f-]{36} for a Python UUID's string representation. @lukesneeringer

Character Classes Certain commonly used sets have character classes. These
are essentially shorthands that give you common sets of characters. The expression on the previous slide could also be written thusly: [\d]{3}-[\d]{4} @lukesneeringer

Character Classes \w word characters  (in Python, this includes numbers
also) \d numbers \s whitespace characters  (space, tab, newline, etc.) \b word boundary characters @lukesneeringer

Review This alone is sufﬁcient for ﬁnding simple patterns. [\w
]+, [A-Z]{2} [A-Z][\d][A-Z] [\d][A-Z][\d] Montréal, QC H2Z 1X7 @lukesneeringer

III Intermediate Blocks

Negation • Character sets can also support negation, thereby matching
any character that is not in the character set. • To accomplish this, use a ^ character to begin the character set. @lukesneeringer

Negation • Note that a negated character set does not
check for the absence of a matching character. • Rather, a negated character class checks for the presence of a character that does not match a character in the set. >>> re.search(r'abc[^d]', 'abc') >>> re.search(r'abc[^d]', 'abcxyz').group()  'abcx' @lukesneeringer

Special Characters • ^ matches beginning of string/line. • $
matches end of string/line. >>> re.search('abc', 'abcdef')  <_sre.SRE_Match object at 0x100690d30> >>> re.search('^abc', 'abcdef')  <_sre.SRE_Match object at 0x100690780> >>> re.search('^abc$', 'abcdef') >>> @lukesneeringer

IV Introspection

Introspection • So far, we've only covered the ability to
match a block of text – in other words, to see whether it (or a piece of it) conforms to a pattern. • That's sometimes enough; usually, though, we want to extract text that matches a pattern and store or manipulate it. @lukesneeringer

Full Match Introspection • The simplest thing we might want
to do is get the full match. • To do this, use the match object's group method. >>> m = re.search(r'[\d]{3}-[\d]{3}-[\d]{4}',  r'blah 514-867-5309 blah') >>> m  <_sre.SRE_Match object at 0x100690d30> >>> m.group()  '514-867-5309'

Backreferences • Often you want to not only introspect your
entire match, but also deﬁned pieces of your match. • The tool for doing this is called backreferences. • This is a fancy name for putting subsets of your regex in parentheses. @lukesneeringer

Backreferences Consider the previous address example: ^[\w ]+, [A-Z]{2} [A-Z][\d][A-Z]
[\d][A-Z][\d]$ Let's add backreferences: ^([\w ]+), ([A-Z]{2}) ([A-Z][\d][A-Z] [\d][A-Z][\d])$ @lukesneeringer

Backreferences >>> regex = r'^([\w ]+), ([A-Z]{2}) '\  r'([A-Z][\d][A-Z] [\d][A-Z][\d])$'
>>> match = re.search(regex,  'Montréal, QC H2Z 1X7') >>> match.group()  'Montréal, QC H2Z 1X7' >>> match.groups()  ('Montréal', 'QC', 'H2Z 1X7') @lukesneeringer

Named Backreferences • Python has added an extension to Perl-
compatible regular expressions to have named backreferences. • This syntax has been ported to most other languages. • The syntax is to add ?P<foo> to the beginning of the backreference. @lukesneeringer

Named Backreferences Our previous address with plain backreferences: ^([\w ]+),
([A-Z]{2}) ([A-Z][\d][A-Z] [\d][A-Z][\d])$ Let's make those named backreferences (apologies for the line break): ^(?P<city>[\w ]+), (?P<province>[A-Z]{2})  (?P<postal>[A-Z][0-9][A-Z] [0-9][A-Z][0-9])$ @lukesneeringer

Named Backreferences >>> regex = \  r'^(?P<city>[\w ]+), '\  r'(?P<province>[A-Z]{2})
'\  r'(?P<pc>[A-Z][\d][A-Z] [\d][A-Z][\d])$' >>> match = re.search(regex,  'Montréal, QC H2Z 1X7') >>> match.group()  'Montréal, QC H2Z 1X7' >>> match.groups()  ('Montréal', 'QC', 'H2Z 1X7') >>> match.groupdict()  {'city': 'Montréal', 'province': 'QC',  'pc': 'H2Z 1X7'}

V The Dragon

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b The Dragon @lukesneeringer

VI Conclusion

Conclusion • There is much, much more that can be
done with regular expressions. This talk has simply given the building blocks. • A few topics I haven't covered:  alternation ("or"), lookahead, conditionals @lukesneeringer

Regular Expressions Luke Sneeringer @lukesneeringer

Introduction to Regular Expressions by Luke Sne...

Introduction to Regular Expressions by Luke Sneeringer

More Decks by PyCon 2014

Featured

Transcript