PyConZA 2014: "An introduction to regular expressions in Python" by Adrianna Pińska

Introduction to regular expressions in Python Adrianna Pi´ nska1 October
3, 2014 [email protected]

Some people, when confronted with a problem, think ”I know,
I’ll use regular expressions.” Now they have two problems. - Jamie Zawinski

What are regular expressions? Regular expressions are a mini-language used
to specify patterns in text. You can use them to: ﬁnd substrings search and replace split strings You may have encountered them in sed, awk, grep or your favourite text editor. If you can use them in one language, you can use them anywhere.

But we have string methods for that! "She sells sea
shells".find("sh") "I like dogs!".replace("dogs", "cats") "one , two , three".split(", ") But string methods are designed for constant string patterns. What if you want... to ﬁnd every word that starts with s? replace anything that you like with ”cats”? split ”one two; three, four,ﬁve”?

Regex syntax primer The full speciﬁcation is outside the scope
of this talk, but here are some basics: a literal character . any single character [abc] one character in this set [a-z] one character in this range \d shortcut for a character class, e.g. any digit ? the previous character 0 or 1 times * the previous character 0 or more times + the previous character 1 or more times {m} the previous character m times {m,n} the previous character m-n times (...) used to capture groups

The re module re is a standard library module which
provides regular expression functionality. Most commonly used functions: match and search, sub, findall and split Regular expressions are speciﬁed as strings You can use raw strings to avoid having to escape backslashes "\\d+" r"\d+"

Match and search Used to ﬁnd a substring in a
string search matches anywhere in the string match matches from the beginning Both functions return a match object if a match is found and None otherwise You can treat this as a boolean result, or examine the match object for more info if re.match("c.*h", "cuttlefish"): print "I found a match!" m = re.search("c.*h", "I have a cuttlefish!") if m: print m.group ()

Match objects and groups A MatchObject instance stores information about
the match and provides methods for accessing it. This is especially if you capture groups in your regex. Put parentheses around parts of the pattern that you want to extract. m = re.search("([a-z]*) fish", "I have a cuttlefish!") print m.group () # "cuttlefish" print m.group (0) # "cuttlefish" print m.groups () # (’cuttle ’,) print m.group (1) # "cuttle"

Findall Used to ﬁnd all instances of a pattern in
a string Returns a list of strings if there are no captured groups or one captured group ...or a list of tuples if there are multiple captured groups re.findall("[a-z]* fish", "butterfish , monkfish and cuttlefish") # [’butterfish ’, ’monkfish ’, ’cuttlefish ’] re.findall("([a-z]*) fish", "butterfish , monkfish and cuttlefish") # [’butter ’, ’monk ’, ’cuttle ’] re.findall("([a-z]\{3\}) ([1 -9]\{3\})", " aaa123 bbb456 ccc789") # [(’aaa ’, ’123’), (’bbb ’, ’456’), (’ccc ’, ’789’)]

Greed By default, expressions like .* will match the longest
possible pattern. If you want to match the shortest possible pattern, add ? to make the expression non-greedy. re.findall(’"(.*)" ’, ’"one", "two", "three"’) # [’one", "two", "three ’] re.findall(’"(.*?)" ’, ’"one", "two", "three"’) # [’one ’, ’two ’, ’three ’]

Substitution replace a pattern with a string backreferences correspond to
captured groups this function returns the modiﬁed string re.sub("(I love )[a-z]*(.*)", r"\1 cats \2", " I love hotdogs!") # "I love cats !"

Substitution with a function You can use a function in
place of the replacement string It takes a match object as a parameter It returns a string def fixcaps(m): before , name , after = m.groups () name = name.lower ().capitalize () return "".join ((before , name , after)) re.sub("(My name is )([A-Za -z]*) (.*)", fixcaps , "My name is alice.") # "My name is Alice ." re.sub("(My name is )([A-Za -z]*) (.*)", fixcaps , "My name is BOB.") # "My name is Bob."

Split Splits a string with inconsistent delimiters Returns a list
of strings If you want to keep the delimiters (or parts of them), put them in parentheses re.split("[,;]? *", "one two; three , four , five") # [’one ’, ’two ’, ’three ’, ’four ’, ’five ’] re.split("([ ,;]?) *", "one two; three , four ,five") # [’one ’, ’’, ’two ’, ’;’, ’three ’, ’,’, ’ four ’, ’,’, ’five ’]

Flags Optional flags can be used to modify the behaviour
of a regular expression. re.IGNORECASE – perform a case-insensitive match re.MULTILINE – ^ and $ match beginning and end of each line (they apply to the whole string by default) re.DOTALL – dot matches any character, including newline (by default it is excluded) re.VERBOSE – allows cleaner layout and annotation of long regular expressions flags parameter is a bitmask; you can combine flags with | re.match("I like [a-z]*", "I LIKE CATS!", re .IGNORECASE)

Compiled regex objects If you’re going to use the same
regex multiple times, it’s more eﬃcient to compile it Every module-level function has an equivalent method on the compiled object SOMETHINGFISH = re.compile("[a-z]* fish") SOMETHINGFISH.findall("butterfish , monkfish and cuttlefish") # [’butterfish ’, ’monkfish ’, ’cuttlefish ’]

How to write a good regular expression What do you
want to match? What do you not want to match? Your regex needs to be speciﬁc, but not too speciﬁc. What is your input like? How predictable is it? Is it going to change in the future?

Danger! High voltage! Parsing arbitrary text is fraught with peril
Especially if it’s provided by a user Use tests and include error checking Easy to handle: regex not matching Harder: regex matching something it shouldn’t

Pitfalls One regex to rule them all Use re.VERBOSE to
make it more legible Parsing XML with regex: don’t do it If you do it, don’t tell anyone

Further reading https://docs.python.org/2/library/re.html – Python 2 reference https://pypi.python.org/pypi/regex – third-party
library intended as a re replacement http://www.regular-expressions.info/ – not Python-speciﬁc https://pythex.org/ – interactive regex sandbox for Python

PyConZA 2014: "An introduction to regular expressions in Python" by Adrianna Pińska

PyConZA 2014: "An introduction to regular expressions in Python" by Adrianna Pińska

Pycon ZA

More Decks by Pycon ZA

Other Decks in Technology

Featured

Transcript

Introduction to regular expressions in Python Adrianna Pi´ nska1 October

Some people, when confronted with a problem, think ”I know,

What are regular expressions? Regular expressions are a mini-language used

But we have string methods for that! "She sells sea

Regex syntax primer The full speciﬁcation is outside the scope

The re module re is a standard library module which

Match and search Used to ﬁnd a substring in a

Match objects and groups A MatchObject instance stores information about

Findall Used to ﬁnd all instances of a pattern in

Greed By default, expressions like .* will match the longest

Substitution replace a pattern with a string backreferences correspond to

Substitution with a function You can use a function in

Split Splits a string with inconsistent delimiters Returns a list

Flags Optional ﬂags can be used to modify the behaviour

Compiled regex objects If you’re going to use the same

How to write a good regular expression What do you

Danger! High voltage! Parsing arbitrary text is fraught with peril

Pitfalls One regex to rule them all Use re.VERBOSE to

Further reading https://docs.python.org/2/library/re.html – Python 2 reference https://pypi.python.org/pypi/regex – third-party