trrex: Efficient Keyword Extraction with Regular Expressions

trrex Efficient Keyword Extraction with Regular Expressions

Who I am • Hi, I’m Daniel • Research Engineer
& Data Scientist • #SOReadyToHelp • Software Engineer (Java 😞) at Free Now

Who I am • Hi, I’m Daniel • Research Engineer
& Data Scientist • #SOReadyToHelp • Software Engineer (Java 😞) at Free Now 🚕 Free Now 🚕

Why regular expressions? Some people, when confronted with a problem,
think "I know, I'll use regular expressions." Now they have two problems.

Why regular expressions? • Powerful • Ubiquitous • Fast

Why regular expressions? • Powerful • Ubiquitous • Fast?!!

Fast? 🤔 import re long_list_of_words = [...] pattern = "|".join(long_list_of_words)
print(re.search(pattern, "Need to find this keyword fast"))

Fast? 🤔 import re long_list_of_words = [...] pattern = "|".join(long_list_of_words)
print(re.search(pattern, "Need to find this keyword fast")) <re.Match object; span=(18, 25), match='keyword'>

Fast? 🤔

trrex pip install trrex trrex = trie + regex

trrex ['bank', 'bat', 'baby', 'air']

trrex '\\b(?:ba(?:nk|by|t)|air)\\b' ['bank', 'bat', 'baby', 'air']

trrex import trrex as tx pattern = tx.make(["baby", "bat", "monkey",
"monster"], prefix="", suffix="")

trrex import trrex as tx pattern = tx.make(["baby", "bat", "monkey",
"monster"], prefix="", suffix="") '(?:mon(?:ster|key)|ba(?:by|t))'

trrex vs union-regex

trrex vs flashtext

RegEx Integrations (the 🍒 on the 🍰)

pandas import trrex as tx import pandas as pd df
= pd.DataFrame(["The quick brown fox", "jumps over", "the lazy dog"], columns=["txt"]) pattern = tx.make(["dog", "fox"]) df["txt"].str.contains(pattern)

pandas import trrex as tx import pandas as pd df
= pd.DataFrame(["The quick brown fox", "jumps over", "the lazy dog"], columns=["txt"]) pattern = tx.make(["dog", "fox"]) df["txt"].str.contains(pattern) 0 True 1 False 2 True Name: text, dtype: bool

pandas re pandas search contains, extract findall findall, extractall sub
replace

regex import trrex as tx import regex words = ["cat",
"dog", "monkey", "monster"] pattern = tx.make(, prefix="", suffix=r'{1<=e<=2}') regex.search(pattern, "This is really a master dag", regex.BESTMATCH)

regex import trrex as tx import regex words = ["cat",
"dog", "monkey", "monster"] pattern = tx.make(, prefix="", suffix=r'{1<=e<=2}') regex.search(pattern, "This is really a master dag", regex.BESTMATCH) <regex.Match object; span=(24, 27), match='dag', fuzzy_counts=(1, 0, 0)>

spacy import trrex as tx from spacy.lang.en import English nlp
= English() pattern = tx.make(["Amazon", "Apple", "Netflix", "Netlify"]) nlp.add_pipe("entity_ruler").add_patterns( [ { "label": "ORG", "pattern": [{"TEXT": {"REGEX": pattern}}], }, ] )

spacy doc = nlp("Netflix HQ is in Los Gatos.") [(ent.text,
ent.label_) for ent in doc.ents]

spacy doc = nlp("Netflix HQ is in Los Gatos.") [(ent.text,
ent.label_) for ent in doc.ents] [('Netflix', 'ORG')]

repo ⭐ https://github.com/mesejo/trex ⭐

THANKS! 🎉

trrex: Efficient Keyword Extraction with Regula...

trrex: Efficient Keyword Extraction with Regular Expressions

Daniel Mesejo

More Decks by Daniel Mesejo

Other Decks in Programming

Featured

Transcript