Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Memorable Uses For A Regular Expression Library...

PyCon 2014
April 10, 2014
630

Memorable Uses For A Regular Expression Library by Alex Perry

Sponsor Workshop session by Google at PyCon 2014.

PyCon 2014

April 10, 2014
Tweet

Transcript

  1. Memorable uses for a Regular Expression library Learning the syntax

    by examples Alex Perry SRE, Google, Los Angeles April 2014
  2. Outline • Simple Regular Expressions • import re ◦ http://docs.python.org/2/library/re.html

    • Parsing • import sre • Formatting • import sre_yield • Arithmetic • Performance uncertainty • import re2
  3. Basic Regular Expressions abc “abc” [abc] “a” “b” “c” abc?

    “ab” “abc” abc* “ab” “abc” “abcc” ... abc+ “abc” “abcc” “abccc” ... abc{3,4} “abccc” “abcccc” ab|c\+ “ab” “c+” ab. “ab.” “ab1” … “ab\n”DOTALL
  4. The standard library - compiling >>> import re >>> o

    = re.compile(“abc?”) >>> [bool(o.match(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, False] >>> [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, True]
  5. The standard library - endings >>> o = re.compile("^abc?$") >>>

    [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, False, False] >>> s = re.compile("i*") # yes, that s matches “” >>> s.split("oiooiioooiii") # split ignores that silliness ['o', 'oo', 'ooo', ''] >>> s.sub("x", "oiooiioooiii") # but sub does not 'xoxoxoxoxoxox'
  6. Parsing strings easily >>> import re >>> cell = re.compile(r"(?P<row>[$]?[a-z]+)"

    r"(?P<col>[$]?[0-9]+)") >>> m = cell.search("Spreadsheet cell aa$15") >>> m <_sre.SRE_Match object at 0x7f220a8e9360> >>> m.groupdict() {'col': '$15', 'row': 'aa'}
  7. Formatting after parsing using a regular expression >>> rc =

    m.groupdict() >>> rc {'col': '$15', 'row': 'aa'} >>> 'It was row %(row)s and column %(col)s' % rc 'It was row aa and column $15' >>> txt = "from a1 2 b$22 as well as 4 $c4" >>> f = r"<%(col)s,%(row)s>" >>> ";".join(f % m.groupdict() for m in cell.finditer(txt)) '<1,a>;<$22,b>;<4,$c>'
  8. Secret (labs) RE engine - internals • Originally separate from

    module “re” ◦ As of version 2.0 onwards they’re equivalent ◦ Call it “sre” in any backward compatible code >>> import sre_parse >>> sre_parse.parse("ab|c") [('branch', (None, [ [('literal', 97), ('literal', 98)], [('literal', 99)] ]) )]
  9. Secret Regular Expression Yield • New module called sre_yield ◦

    https://github.com/google/sre_yield • def Values(regex, flags=0, charset=CHARSET) ◦ Examines output from sre_parse.parse() ◦ Returns a convenient sequence like object • Sequence has an efficient membership test ◦ We were given a regex describing its content • Some features (lookahead, etc) still missing ◦ Easy to add if sequence can contain None
  10. Iterating over all matching strings >>> import sre_yield >>> sre_yield.Values(r'1(?P<x>234?|49?)')[:]

    ['123', '1234', '14', '149'] >>> len(sre_yield.Values('.')) 256 >>> sre_yield.Values('a*')[5:10] ['aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']
  11. What do we do about infinite repetitions >>> len(sre_yield.Values('0*')) 65536

    # Yes, really. sre library can only specify 65534 max >>> a77k = 'a' * 77000 >>> len(re.compile(r'.{,65534}').match(a77k).group(0)) 65534 >>> len(re.compile(r'.{,65535}').match(a77k).group(0)) 77000 >>> len(re.compile(r'.{60000}.{,6000}|.{,60000}') .match(a77k).group(0)) 66000
  12. How many matching strings >>> import sre_yield >>> bits =

    sre_yield.Values('[01]*') # All binary nums >>> len(bits) # how many are there? Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: long int too large to convert to int >>> bits.__len__() == 2**65536 - 1 # check the answer True >>> len(str(bits.__len__())) # Is the number that big? 19729 >>> "001001" in bits, "002001" in bits (True, False)
  13. Python does understand working with large numbers >>> import sre_yield

    >>> anything = sre_yield.Values('.*') >>> a = 1 >>> for _ in xrange(65535): a = a * 256 + 1 >>> anything.__len__() == a True >>> str_a = str(a) # This does take a while >>> len(str_a) 157825 >>> str_a[:9], str_a[-9:] ('101818453', '945826561')
  14. But why bother yielding from a regex • It can

    be more compact than a literal list, for example: ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1| sa-east-1|us-east-1|us-west-1|us-west-2 • That doesn’t get much shorter when rewritten: (ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap- southea)st-2 • On the other hand, others are more convenient: www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.]) widget[.](?P<domain>com|co[.]uk|ch|de) • Some things would better be machine generated: 192\.168(?:\.(?:[1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])){2}
  15. • Implementation uses backtracking, i.e. PCRE ◦ So it is

    fast providing it never guesses wrong ◦ Trivial to write an expression that is … slow def test(n): t = "a" * n r = "a?" * n + t return bool( re.match(r, t)) timeit.timeit( stmt="test(6)", setup="from __main__ import test") How fast is the “re” library
  16. The RE2 library • https://code.google.com/p/re2 • https://github.com/axiak/pyre2 • RE2 tries

    all possible code paths in parallel ◦ never backtracks, so omits features that need it • drops support for backreferences ◦ and generalized zero-width assertions • Predictable worst case performance for any input ◦ Safe to accept untrusted regular expressions Test(10) takes 4 milliseconds instead of one minute
  17. Summary •Regular expressions are built into Python ◦re_obj = re.compile(pattern)

    ◦print re_obj.pattern •They can parse strings into a dictionary ◦Or iteratively many dictionaries •They can compactly represent large lists ◦Without expanding the whole iterator out •For reliable performance, use RE2 ◦Especially if users are supplying patterns
  18. Questions? •mail -s us.pycon.org/2014 \ ◦[email protected] • Nothing to do

    with me, but pretty good: ◦ http://qntm.org/files/re/re.html