Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConZA 2014: "An introduction to regular expre...

Pycon ZA
October 03, 2014

PyConZA 2014: "An introduction to regular expressions in Python" by Adrianna Pińska

Regular expressions are a mini-language used for pattern-matching in text. They have been a staple of the computing world for decades: they are implemented in most programming languages, form the core of several utilities, and can be found lurking in the search-and-replace functionality of any sufficiently advanced text editor.

Despite their usefulness, regular expressions have developed a reputation for complexity and a steep learning curve. New programmers are often warned to steer clear of them -- which is a pity, because there are some problems for which they are a quick and elegant solution.

In this talk I aim to demystify regular expressions for the beginner programmer, and to provide a brief guided tour of Python's re module. I hope to encourage more programmers to get to know this useful tool.

Pycon ZA

October 03, 2014
Tweet

More Decks by Pycon ZA

Other Decks in Technology

Transcript

  1. Some people, when confronted with a problem, think ”I know,

    I’ll use regular expressions.” Now they have two problems. - Jamie Zawinski
  2. What are regular expressions? Regular expressions are a mini-language used

    to specify patterns in text. You can use them to: find substrings search and replace split strings You may have encountered them in sed, awk, grep or your favourite text editor. If you can use them in one language, you can use them anywhere.
  3. But we have string methods for that! "She sells sea

    shells".find("sh") "I like dogs!".replace("dogs", "cats") "one , two , three".split(", ") But string methods are designed for constant string patterns. What if you want... to find every word that starts with s? replace anything that you like with ”cats”? split ”one two; three, four,five”?
  4. Regex syntax primer The full specification is outside the scope

    of this talk, but here are some basics: a literal character . any single character [abc] one character in this set [a-z] one character in this range \d shortcut for a character class, e.g. any digit ? the previous character 0 or 1 times * the previous character 0 or more times + the previous character 1 or more times {m} the previous character m times {m,n} the previous character m-n times (...) used to capture groups
  5. The re module re is a standard library module which

    provides regular expression functionality. Most commonly used functions: match and search, sub, findall and split Regular expressions are specified as strings You can use raw strings to avoid having to escape backslashes "\\d+" r"\d+"
  6. Match and search Used to find a substring in a

    string search matches anywhere in the string match matches from the beginning Both functions return a match object if a match is found and None otherwise You can treat this as a boolean result, or examine the match object for more info if re.match("c.*h", "cuttlefish"): print "I found a match!" m = re.search("c.*h", "I have a cuttlefish!") if m: print m.group ()
  7. Match objects and groups A MatchObject instance stores information about

    the match and provides methods for accessing it. This is especially if you capture groups in your regex. Put parentheses around parts of the pattern that you want to extract. m = re.search("([a-z]*) fish", "I have a cuttlefish!") print m.group () # "cuttlefish" print m.group (0) # "cuttlefish" print m.groups () # (’cuttle ’,) print m.group (1) # "cuttle"
  8. Findall Used to find all instances of a pattern in

    a string Returns a list of strings if there are no captured groups or one captured group ...or a list of tuples if there are multiple captured groups re.findall("[a-z]* fish", "butterfish , monkfish and cuttlefish") # [’butterfish ’, ’monkfish ’, ’cuttlefish ’] re.findall("([a-z]*) fish", "butterfish , monkfish and cuttlefish") # [’butter ’, ’monk ’, ’cuttle ’] re.findall("([a-z]\{3\}) ([1 -9]\{3\})", " aaa123 bbb456 ccc789") # [(’aaa ’, ’123’), (’bbb ’, ’456’), (’ccc ’, ’789’)]
  9. Greed By default, expressions like .* will match the longest

    possible pattern. If you want to match the shortest possible pattern, add ? to make the expression non-greedy. re.findall(’"(.*)" ’, ’"one", "two", "three"’) # [’one", "two", "three ’] re.findall(’"(.*?)" ’, ’"one", "two", "three"’) # [’one ’, ’two ’, ’three ’]
  10. Substitution replace a pattern with a string backreferences correspond to

    captured groups this function returns the modified string re.sub("(I love )[a-z]*(.*)", r"\1 cats \2", " I love hotdogs!") # "I love cats !"
  11. Substitution with a function You can use a function in

    place of the replacement string It takes a match object as a parameter It returns a string def fixcaps(m): before , name , after = m.groups () name = name.lower ().capitalize () return "".join ((before , name , after)) re.sub("(My name is )([A-Za -z]*) (.*)", fixcaps , "My name is alice.") # "My name is Alice ." re.sub("(My name is )([A-Za -z]*) (.*)", fixcaps , "My name is BOB.") # "My name is Bob."
  12. Split Splits a string with inconsistent delimiters Returns a list

    of strings If you want to keep the delimiters (or parts of them), put them in parentheses re.split("[,;]? *", "one two; three , four , five") # [’one ’, ’two ’, ’three ’, ’four ’, ’five ’] re.split("([ ,;]?) *", "one two; three , four ,five") # [’one ’, ’’, ’two ’, ’;’, ’three ’, ’,’, ’ four ’, ’,’, ’five ’]
  13. Flags Optional flags can be used to modify the behaviour

    of a regular expression. re.IGNORECASE – perform a case-insensitive match re.MULTILINE – ^ and $ match beginning and end of each line (they apply to the whole string by default) re.DOTALL – dot matches any character, including newline (by default it is excluded) re.VERBOSE – allows cleaner layout and annotation of long regular expressions flags parameter is a bitmask; you can combine flags with | re.match("I like [a-z]*", "I LIKE CATS!", re .IGNORECASE)
  14. Compiled regex objects If you’re going to use the same

    regex multiple times, it’s more efficient to compile it Every module-level function has an equivalent method on the compiled object SOMETHINGFISH = re.compile("[a-z]* fish") SOMETHINGFISH.findall("butterfish , monkfish and cuttlefish") # [’butterfish ’, ’monkfish ’, ’cuttlefish ’]
  15. How to write a good regular expression What do you

    want to match? What do you not want to match? Your regex needs to be specific, but not too specific. What is your input like? How predictable is it? Is it going to change in the future?
  16. Danger! High voltage! Parsing arbitrary text is fraught with peril

    Especially if it’s provided by a user Use tests and include error checking Easy to handle: regex not matching Harder: regex matching something it shouldn’t
  17. Pitfalls One regex to rule them all Use re.VERBOSE to

    make it more legible Parsing XML with regex: don’t do it If you do it, don’t tell anyone
  18. Further reading https://docs.python.org/2/library/re.html – Python 2 reference https://pypi.python.org/pypi/regex – third-party

    library intended as a re replacement http://www.regular-expressions.info/ – not Python-specific https://pythex.org/ – interactive regex sandbox for Python