Upgrade to Pro — share decks privately, control downloads, hide ads and more …

trrex: Efficient Keyword Extraction with Regular Expressions

trrex: Efficient Keyword Extraction with Regular Expressions

This slides introduce the trrex package, for performing efficient keyword extraction with regular expressions. Additionally I'll show how to integrate it with pandas for text cleaning, how to use it with spacy to build a gazetteer and how to perform fuzzy matching with the regex library.

Daniel Mesejo

November 25, 2022
Tweet

More Decks by Daniel Mesejo

Other Decks in Programming

Transcript

  1. Who I am • Hi, I’m Daniel • Research Engineer

    & Data Scientist • #SOReadyToHelp • Software Engineer (Java 😞) at Free Now
  2. Who I am • Hi, I’m Daniel • Research Engineer

    & Data Scientist • #SOReadyToHelp • Software Engineer (Java 😞) at Free Now 🚕 Free Now 🚕
  3. Why regular expressions? Some people, when confronted with a problem,

    think "I know, I'll use regular expressions." Now they have two problems.
  4. Fast? 🤔 import re long_list_of_words = [...] pattern = "|".join(long_list_of_words)

    print(re.search(pattern, "Need to find this keyword fast"))
  5. Fast? 🤔 import re long_list_of_words = [...] pattern = "|".join(long_list_of_words)

    print(re.search(pattern, "Need to find this keyword fast")) <re.Match object; span=(18, 25), match='keyword'>
  6. trrex import trrex as tx pattern = tx.make(["baby", "bat", "monkey",

    "monster"], prefix="", suffix="") '(?:mon(?:ster|key)|ba(?:by|t))'
  7. pandas import trrex as tx import pandas as pd df

    = pd.DataFrame(["The quick brown fox", "jumps over", "the lazy dog"], columns=["txt"]) pattern = tx.make(["dog", "fox"]) df["txt"].str.contains(pattern)
  8. pandas import trrex as tx import pandas as pd df

    = pd.DataFrame(["The quick brown fox", "jumps over", "the lazy dog"], columns=["txt"]) pattern = tx.make(["dog", "fox"]) df["txt"].str.contains(pattern) 0 True 1 False 2 True Name: text, dtype: bool
  9. regex import trrex as tx import regex words = ["cat",

    "dog", "monkey", "monster"] pattern = tx.make(, prefix="", suffix=r'{1<=e<=2}') regex.search(pattern, "This is really a master dag", regex.BESTMATCH)
  10. regex import trrex as tx import regex words = ["cat",

    "dog", "monkey", "monster"] pattern = tx.make(, prefix="", suffix=r'{1<=e<=2}') regex.search(pattern, "This is really a master dag", regex.BESTMATCH) <regex.Match object; span=(24, 27), match='dag', fuzzy_counts=(1, 0, 0)>
  11. spacy import trrex as tx from spacy.lang.en import English nlp

    = English() pattern = tx.make(["Amazon", "Apple", "Netflix", "Netlify"]) nlp.add_pipe("entity_ruler").add_patterns( [ { "label": "ORG", "pattern": [{"TEXT": {"REGEX": pattern}}], }, ] )
  12. spacy doc = nlp("Netflix HQ is in Los Gatos.") [(ent.text,

    ent.label_) for ent in doc.ents] [('Netflix', 'ORG')]