PyCon Italia - Regex Strikes Back

Regex Strikes Back Regular Expressions for Text Mining

Who I am • Hi, I’m Daniel • Research Engineer
& Data Scientist • #SOReadyToHelp • Software Engineer (Java 😞) at Free Now

Who I am • Hi, I’m Daniel • Research Engineer
& Data Scientist • #SOReadyToHelp • Software Engineer (Java 😞) at Free Now 🚕 Free Now 🚕

Who I am (not)

In this talk • Basic text mining • Some Python
tips • How to make regular expressions fly

In this talk - Text Mining • Transforming unstructured (free)
text into data ◦ Known keywords ◦ Vector encoding (one-hot encoding)

In this talk - Text Mining • Transforming unstructured (free)
text into data ◦ Known keywords ◦ Vector encoding (one-hot encoding) • Ontologies, Vocabularies and Custom Dictionaries ◦ Keyword Search for Filtering, Cleaning, Encoding

Why regular expressions? Some people, when confronted with a problem,
think "I know, I'll use regular expressions." Now they have two problems.

Why regular expressions? • Powerful • Ubiquitous • Fast

Why regular expressions? • Powerful • Ubiquitous • Fast?!!

Regular Expression 101

Regular Expressions 101 import re re.search("pepper", "the sky is blue
and the pepper is yellow")

Regular Expressions 101 import re re.search("pepper", "the sky is blue
and the pepper is yellow") <re.Match object; span=(24, 30), match='pepper'>

Regular Expressions 101 re.search("pepper|sky", "the sky is blue and the
pepper is yellow")

pepper is yellow") <re.Match object; span=(4, 7), match='sky'>

pepper is yellow") <re.Match object; span=(4, 7), match='sky'> ⚡Filtering

Regular Expressions 101 re.findall("pepper|sky", "the sky is blue and the
pepper is yellow")

pepper is yellow") ['sky', 'pepper']

pepper is yellow") ['sky', 'pepper'] ⚡ One-hot Encoding

Regular Expressions 101 res = re.finditer("pepper|sky", "the sky is blue
and the pepper is yellow") list(res)

and the pepper is yellow") list(res) [<re.Match object; span=(4, 7), match='sky'>, <re.Match object; span=(24, 30), match='pepper'>]

and the pepper is yellow") [(match.start(), match.end(), match.group()) for match in res]

and the pepper is yellow") [(match.start(), match.end(), match.group()) for match in res] [(4, 7, 'sky'), (24, 30, 'pepper')]

Regular Expressions 101 re.sub("pepper|sky", "object", "the sky is blue and
the pepper is yellow")

Regular Expressions 101 re.sub("pepper|sky", "object", "the sky is blue and
the pepper is yellow") 'the object is blue and the object is yellow'

Regular Expressions 101 lookup = { "pepper" : "pimiento", "sky":
"cielo"} def repl(match, lookup=lookup, default="something"): return lookup.get(match.group(), default) re.sub("pepper|sky", repl, "the sky is blue and the pepper is yellow")

"cielo"} def repl(match, lookup=lookup, default="something"): return lookup.get(match.group(), default) re.sub("pepper|sky", repl, "the sky is blue and the pepper is yellow") 'the cielo is blue and the pimiento is yellow'

"cielo"} def repl(match, lookup=lookup, default="something"): return lookup.get(match.group(), default) re.sub("pepper|sky", repl, "the sky is blue and the pepper is yellow") 'the cielo is blue and the pimiento is yellow' ⚡ Cleaning, normalizing, etc

Regular Expressions 101 res = re.finditer("pepper|sky", "the PEPPER is yellow",
re.IGNORECASE) list(res)

Regular Expressions 101 res = re.finditer("pepper|sky", "the PEPPER is yellow",
re.IGNORECASE) list(res) [<re.Match object; span=(4, 10), match='PEPPER'>]

Regular Expressions 101 import regex pattern = '(monkey|monster|dog|cat){1<=e<=2}' regex.search(pattern, "This
is really a master dag", regex.BESTMATCH)

is really a master dag", regex.BESTMATCH) <regex.Match object; span=(24, 27), match='dag', fuzzy_counts=(1, 0, 0)>

is really a master dag", regex.BESTMATCH) <regex.Match object; span=(24, 27), match='dag', fuzzy_counts=(1, 0, 0)> ⚡ Fuzzy Matching

What’s the problem then? 🤔 import re words = [...]
pattern = re.compile("|".join(words))

What’s the problem then? 🤔

What’s the problem then? 🤔 import re [...] pattern =
re.findall(pattern, text) 110 ms ± 84.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The Regex Engine (a 🔎 under the hood)

The RegEx Engine re.compile("pepper|sky", re.DEBUG)

The RegEx Engine re.compile("pepper|sky", re.DEBUG) BRANCH LITERAL 112 'p' LITERAL
101 'e' LITERAL 112 'p' LITERAL 112 'p' LITERAL 101 'e' LITERAL 114 'r' OR LITERAL 115 's' LITERAL 107 'k' LITERAL 121 'y'

The RegEx Engine re.compile("baby|bat", re.DEBUG)

The RegEx Engine re.compile("baby|bat", re.DEBUG) LITERAL 98 'b' LITERAL 97
'a' BRANCH LITERAL 98 'b' LITERAL 121 'y' OR LITERAL 116 't'

The RegEx Engine # check if all items share a
common prefix while True: prefix = None for item in items: if not item: break if prefix is None: prefix = item[0] elif item[0] != prefix: break else: # all subitems start with a common "prefix". # move it out of the branch

trrex pip install trrex trrex = trie + regex

trrex ['bank', 'bat', 'baby', 'air']

trrex '\\b(?:ba(?:nk|by|t)|air)\\b' ['bank', 'bat', 'baby', 'air']

trrex import trrex as tx pattern = tx.make(["baby", "bat", "monkey",
"monster"], prefix="", suffix="")

trrex import trrex as tx pattern = tx.make(["baby", "bat", "monkey",
"monster"], prefix="", suffix="") '(?:mon(?:ster|key)|ba(?:by|t))'

RegEx Integrations (the 🍒 on the 🍰)

pandas import trrex as tx import pandas as pd df
= pd.DataFrame(["The quick brown fox", "jumps over", "the lazy dog"], columns=["txt"]) pattern = tx.make(["dog", "fox"]) df["txt"].str.contains(pattern)

pandas import trrex as tx import pandas as pd df
= pd.DataFrame(["The quick brown fox", "jumps over", "the lazy dog"], columns=["txt"]) pattern = tx.make(["dog", "fox"]) df["txt"].str.contains(pattern) 0 True 1 False 2 True Name: text, dtype: bool

pandas re pandas search contains, extract findall findall, extractall sub
replace

resources • trrex repo: https://github.com/mesejo/trex • flashtext repo: https://github.com/vi3k6i5/flashtext •
SO answer: https://stackoverflow.com/a/42789508/4001592

THANKS! 🎉

PyCon Italia - Regex Strikes Back

PyCon Italia - Regex Strikes Back

More Decks by Daniel Mesejo

Other Decks in Programming

Featured

Transcript