Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyCon Italia - Regex Strikes Back

PyCon Italia - Regex Strikes Back

Daniel Mesejo

June 21, 2022
Tweet

More Decks by Daniel Mesejo

Other Decks in Programming

Transcript

  1. Who I am • Hi, I’m Daniel • Research Engineer

    & Data Scientist • #SOReadyToHelp • Software Engineer (Java 😞) at Free Now
  2. Who I am • Hi, I’m Daniel • Research Engineer

    & Data Scientist • #SOReadyToHelp • Software Engineer (Java 😞) at Free Now 🚕 Free Now 🚕
  3. In this talk • Basic text mining • Some Python

    tips • How to make regular expressions fly
  4. In this talk - Text Mining • Transforming unstructured (free)

    text into data ◦ Known keywords ◦ Vector encoding (one-hot encoding)
  5. In this talk - Text Mining • Transforming unstructured (free)

    text into data ◦ Known keywords ◦ Vector encoding (one-hot encoding) • Ontologies, Vocabularies and Custom Dictionaries ◦ Keyword Search for Filtering, Cleaning, Encoding
  6. Why regular expressions? Some people, when confronted with a problem,

    think "I know, I'll use regular expressions." Now they have two problems.
  7. Regular Expressions 101 import re re.search("pepper", "the sky is blue

    and the pepper is yellow") <re.Match object; span=(24, 30), match='pepper'>
  8. Regular Expressions 101 re.search("pepper|sky", "the sky is blue and the

    pepper is yellow") <re.Match object; span=(4, 7), match='sky'>
  9. Regular Expressions 101 re.search("pepper|sky", "the sky is blue and the

    pepper is yellow") <re.Match object; span=(4, 7), match='sky'> ⚡Filtering
  10. Regular Expressions 101 re.findall("pepper|sky", "the sky is blue and the

    pepper is yellow") ['sky', 'pepper'] ⚡ One-hot Encoding
  11. Regular Expressions 101 res = re.finditer("pepper|sky", "the sky is blue

    and the pepper is yellow") list(res) [<re.Match object; span=(4, 7), match='sky'>, <re.Match object; span=(24, 30), match='pepper'>]
  12. Regular Expressions 101 res = re.finditer("pepper|sky", "the sky is blue

    and the pepper is yellow") [(match.start(), match.end(), match.group()) for match in res]
  13. Regular Expressions 101 res = re.finditer("pepper|sky", "the sky is blue

    and the pepper is yellow") [(match.start(), match.end(), match.group()) for match in res] [(4, 7, 'sky'), (24, 30, 'pepper')]
  14. Regular Expressions 101 re.sub("pepper|sky", "object", "the sky is blue and

    the pepper is yellow") 'the object is blue and the object is yellow'
  15. Regular Expressions 101 lookup = { "pepper" : "pimiento", "sky":

    "cielo"} def repl(match, lookup=lookup, default="something"): return lookup.get(match.group(), default) re.sub("pepper|sky", repl, "the sky is blue and the pepper is yellow")
  16. Regular Expressions 101 lookup = { "pepper" : "pimiento", "sky":

    "cielo"} def repl(match, lookup=lookup, default="something"): return lookup.get(match.group(), default) re.sub("pepper|sky", repl, "the sky is blue and the pepper is yellow") 'the cielo is blue and the pimiento is yellow'
  17. Regular Expressions 101 lookup = { "pepper" : "pimiento", "sky":

    "cielo"} def repl(match, lookup=lookup, default="something"): return lookup.get(match.group(), default) re.sub("pepper|sky", repl, "the sky is blue and the pepper is yellow") 'the cielo is blue and the pimiento is yellow' ⚡ Cleaning, normalizing, etc
  18. Regular Expressions 101 res = re.finditer("pepper|sky", "the PEPPER is yellow",

    re.IGNORECASE) list(res) [<re.Match object; span=(4, 10), match='PEPPER'>]
  19. Regular Expressions 101 import regex pattern = '(monkey|monster|dog|cat){1<=e<=2}' regex.search(pattern, "This

    is really a master dag", regex.BESTMATCH) <regex.Match object; span=(24, 27), match='dag', fuzzy_counts=(1, 0, 0)>
  20. Regular Expressions 101 import regex pattern = '(monkey|monster|dog|cat){1<=e<=2}' regex.search(pattern, "This

    is really a master dag", regex.BESTMATCH) <regex.Match object; span=(24, 27), match='dag', fuzzy_counts=(1, 0, 0)> ⚡ Fuzzy Matching
  21. What’s the problem then? 🤔 import re words = [...]

    pattern = re.compile("|".join(words))
  22. What’s the problem then? 🤔 import re [...] pattern =

    re.findall(pattern, text) 110 ms ± 84.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  23. The RegEx Engine re.compile("pepper|sky", re.DEBUG) BRANCH LITERAL 112 'p' LITERAL

    101 'e' LITERAL 112 'p' LITERAL 112 'p' LITERAL 101 'e' LITERAL 114 'r' OR LITERAL 115 's' LITERAL 107 'k' LITERAL 121 'y'
  24. The RegEx Engine re.compile("baby|bat", re.DEBUG) LITERAL 98 'b' LITERAL 97

    'a' BRANCH LITERAL 98 'b' LITERAL 121 'y' OR LITERAL 116 't'
  25. The RegEx Engine # check if all items share a

    common prefix while True: prefix = None for item in items: if not item: break if prefix is None: prefix = item[0] elif item[0] != prefix: break else: # all subitems start with a common "prefix". # move it out of the branch
  26. trrex import trrex as tx pattern = tx.make(["baby", "bat", "monkey",

    "monster"], prefix="", suffix="") '(?:mon(?:ster|key)|ba(?:by|t))'
  27. pandas import trrex as tx import pandas as pd df

    = pd.DataFrame(["The quick brown fox", "jumps over", "the lazy dog"], columns=["txt"]) pattern = tx.make(["dog", "fox"]) df["txt"].str.contains(pattern)
  28. pandas import trrex as tx import pandas as pd df

    = pd.DataFrame(["The quick brown fox", "jumps over", "the lazy dog"], columns=["txt"]) pattern = tx.make(["dog", "fox"]) df["txt"].str.contains(pattern) 0 True 1 False 2 True Name: text, dtype: bool