Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Regular Expressions by Luke Sne...

PyCon 2014
April 11, 2014
1.3k

Introduction to Regular Expressions by Luke Sneeringer

PyCon 2014

April 11, 2014
Tweet

Transcript

  1. Why Regex? • Regular expressions are a standard mechanism to

    find a pattern of text within a bigger block of text. • Also useful for ensuring formatting consistency, or transforming data into a common format. @lukesneeringer
  2. Learning Regex • Regular expressions actually aren't that scary. •

    But starting with the most complicated regular expression imaginable is. :-) @lukesneeringer
  3. The re module • Regular expression parsing is done through

    Python's re module. • import re
 re.search(needle, haystack) • PSA: Don't use re.match, at least while you're learning. Use re.search exclusively. @lukesneeringer
  4. The re module • If your regex matches, re.search will

    return a "match object". • Match objects have a few methods you commonly use: group, groups, groupdict. • If your regex doesn't match, re.search will return None. @lukesneeringer
  5. Python "raw" strings • In Python, you can prefix a

    string literal with r to get a "raw" string. • This is useful for regular expressions, as we'll see later, because it makes escaping regexes more sane. @lukesneeringer
  6. The Basics The most basic regular expression is just a

    string of uninteresting characters (Latin letters, Arabic numerals). >>> re.search('abc', 'abcdef')
 <_sre.SRE_Match object at 0x100790cc8> >>> re.search('xyz', 'abcdef') >>> @lukesneeringer
  7. The Basics • This is simple enough. • But regular

    expressions are about matching patterns, not exact text. @lukesneeringer
  8. Repetition Certain characters indicate that a character is repeated a

    number of times. + Used one or more times * Used zero or more times ? Used zero or one times (e.g. optional) {x} Used x times (e.g. {2}) {x,y} Used between x and y times, inclusive @lukesneeringer
  9. Repetition • An important note: repetition applies to the single

    "atomic thing" it folllows: ab{2} matches abb, not abab. • So far, the only "atomic thing" we've seen is a single literal character. • We're about to look at some more. @lukesneeringer
  10. Character Sets Characters enclosed in [] characters are called character

    sets. This matches any single character in the set. >>> re.search('gr[ae]y', 'gray')
 <_sre.SRE_Match object at 0x100790780> >>> re.search('gr[ae]y', 'grey')
 <_sre.SRE_Match object at 0x100790cc8> >>> re.search('gr[ae]y', 'grzy') >>> @lukesneeringer
  11. Character Sets Important note: Character sets match only a single

    occurrence of a character. gr[ae]y gray grey gry grzy graay graey Gray grAy @lukesneeringer
  12. Character Sets The hyphen character with a character set is

    special. It designates a range of characters (usually). >>> re.search('[0-9]{3}-[0-9]{4}',
 '867-5309')
 <_sre.SRE_Match object at 0x1020b0d30> @lukesneeringer
  13. Character Sets • A nice trick: If you want a

    literal hyphen in a character set, put it at the beginning or end
 (I prefer the end). • [a-z0-9-]+ will match one or more characters between a and z (lower-cased only), 0 and 9, and the literal hyphen character. • Obvious use case: [0-9a-f-]{36} for a Python UUID's string representation. @lukesneeringer
  14. Character Classes Certain commonly used sets have character classes. These

    are essentially shorthands that give you common sets of characters. The expression on the previous slide could also be written thusly: [\d]{3}-[\d]{4} @lukesneeringer
  15. Character Classes \w word characters
 (in Python, this includes numbers

    also) \d numbers \s whitespace characters
 (space, tab, newline, etc.) \b word boundary characters @lukesneeringer
  16. Review This alone is sufficient for finding simple patterns. [\w

    ]+, [A-Z]{2} [A-Z][\d][A-Z] [\d][A-Z][\d] Montréal, QC H2Z 1X7 @lukesneeringer
  17. Negation • Character sets can also support negation, thereby matching

    any character that is not in the character set. • To accomplish this, use a ^ character to begin the character set. @lukesneeringer
  18. Negation • Note that a negated character set does not

    check for the absence of a matching character. • Rather, a negated character class checks for the presence of a character that does not match a character in the set. >>> re.search(r'abc[^d]', 'abc') >>> re.search(r'abc[^d]', 'abcxyz').group()
 'abcx' @lukesneeringer
  19. Special Characters • ^ matches beginning of string/line. • $

    matches end of string/line. >>> re.search('abc', 'abcdef')
 <_sre.SRE_Match object at 0x100690d30> >>> re.search('^abc', 'abcdef')
 <_sre.SRE_Match object at 0x100690780> >>> re.search('^abc$', 'abcdef') >>> @lukesneeringer
  20. Introspection • So far, we've only covered the ability to

    match a block of text – in other words, to see whether it (or a piece of it) conforms to a pattern. • That's sometimes enough; usually, though, we want to extract text that matches a pattern and store or manipulate it. @lukesneeringer
  21. Full Match Introspection • The simplest thing we might want

    to do is get the full match. • To do this, use the match object's group method. >>> m = re.search(r'[\d]{3}-[\d]{3}-[\d]{4}',
 r'blah 514-867-5309 blah') >>> m
 <_sre.SRE_Match object at 0x100690d30> >>> m.group()
 '514-867-5309'
  22. Backreferences • Often you want to not only introspect your

    entire match, but also defined pieces of your match. • The tool for doing this is called backreferences. • This is a fancy name for putting subsets of your regex in parentheses. @lukesneeringer
  23. Backreferences Consider the previous address example: ^[\w ]+, [A-Z]{2} [A-Z][\d][A-Z]

    [\d][A-Z][\d]$ Let's add backreferences: ^([\w ]+), ([A-Z]{2}) ([A-Z][\d][A-Z] [\d][A-Z][\d])$ @lukesneeringer
  24. Backreferences >>> regex = r'^([\w ]+), ([A-Z]{2}) '\
 r'([A-Z][\d][A-Z] [\d][A-Z][\d])$'

    >>> match = re.search(regex,
 'Montréal, QC H2Z 1X7') >>> match.group()
 'Montréal, QC H2Z 1X7' >>> match.groups()
 ('Montréal', 'QC', 'H2Z 1X7') @lukesneeringer
  25. Named Backreferences • Python has added an extension to Perl-

    compatible regular expressions to have named backreferences. • This syntax has been ported to most other languages. • The syntax is to add ?P<foo> to the beginning of the backreference. @lukesneeringer
  26. Named Backreferences Our previous address with plain backreferences: ^([\w ]+),

    ([A-Z]{2}) ([A-Z][\d][A-Z] [\d][A-Z][\d])$ Let's make those named backreferences (apologies for the line break): ^(?P<city>[\w ]+), (?P<province>[A-Z]{2})
 (?P<postal>[A-Z][0-9][A-Z] [0-9][A-Z][0-9])$ @lukesneeringer
  27. Named Backreferences >>> regex = \
 r'^(?P<city>[\w ]+), '\
 r'(?P<province>[A-Z]{2})

    '\
 r'(?P<pc>[A-Z][\d][A-Z] [\d][A-Z][\d])$' >>> match = re.search(regex,
 'Montréal, QC H2Z 1X7') >>> match.group()
 'Montréal, QC H2Z 1X7' >>> match.groups()
 ('Montréal', 'QC', 'H2Z 1X7') >>> match.groupdict()
 {'city': 'Montréal', 'province': 'QC',
 'pc': 'H2Z 1X7'}
  28. Conclusion • There is much, much more that can be

    done with regular expressions. This talk has simply given the building blocks. • A few topics I haven't covered:
 alternation ("or"), lookahead, conditionals @lukesneeringer