Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Memorable Uses For A Regular Expression Library...

Avatar for PyCon 2014 PyCon 2014
April 10, 2014
630

Memorable Uses For A Regular Expression Library by Alex Perry

Sponsor Workshop session by Google at PyCon 2014.

Avatar for PyCon 2014

PyCon 2014

April 10, 2014
Tweet

More Decks by PyCon 2014

Transcript

  1. Memorable uses for a Regular Expression library Learning the syntax

    by examples Alex Perry SRE, Google, Los Angeles April 2014
  2. Outline • Simple Regular Expressions • import re ◦ http://docs.python.org/2/library/re.html

    • Parsing • import sre • Formatting • import sre_yield • Arithmetic • Performance uncertainty • import re2
  3. Basic Regular Expressions abc “abc” [abc] “a” “b” “c” abc?

    “ab” “abc” abc* “ab” “abc” “abcc” ... abc+ “abc” “abcc” “abccc” ... abc{3,4} “abccc” “abcccc” ab|c\+ “ab” “c+” ab. “ab.” “ab1” … “ab\n”DOTALL
  4. The standard library - compiling >>> import re >>> o

    = re.compile(“abc?”) >>> [bool(o.match(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, False] >>> [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, True]
  5. The standard library - endings >>> o = re.compile("^abc?$") >>>

    [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, False, False] >>> s = re.compile("i*") # yes, that s matches “” >>> s.split("oiooiioooiii") # split ignores that silliness ['o', 'oo', 'ooo', ''] >>> s.sub("x", "oiooiioooiii") # but sub does not 'xoxoxoxoxoxox'
  6. Parsing strings easily >>> import re >>> cell = re.compile(r"(?P<row>[$]?[a-z]+)"

    r"(?P<col>[$]?[0-9]+)") >>> m = cell.search("Spreadsheet cell aa$15") >>> m <_sre.SRE_Match object at 0x7f220a8e9360> >>> m.groupdict() {'col': '$15', 'row': 'aa'}
  7. Formatting after parsing using a regular expression >>> rc =

    m.groupdict() >>> rc {'col': '$15', 'row': 'aa'} >>> 'It was row %(row)s and column %(col)s' % rc 'It was row aa and column $15' >>> txt = "from a1 2 b$22 as well as 4 $c4" >>> f = r"<%(col)s,%(row)s>" >>> ";".join(f % m.groupdict() for m in cell.finditer(txt)) '<1,a>;<$22,b>;<4,$c>'
  8. Secret (labs) RE engine - internals • Originally separate from

    module “re” ◦ As of version 2.0 onwards they’re equivalent ◦ Call it “sre” in any backward compatible code >>> import sre_parse >>> sre_parse.parse("ab|c") [('branch', (None, [ [('literal', 97), ('literal', 98)], [('literal', 99)] ]) )]
  9. Secret Regular Expression Yield • New module called sre_yield ◦

    https://github.com/google/sre_yield • def Values(regex, flags=0, charset=CHARSET) ◦ Examines output from sre_parse.parse() ◦ Returns a convenient sequence like object • Sequence has an efficient membership test ◦ We were given a regex describing its content • Some features (lookahead, etc) still missing ◦ Easy to add if sequence can contain None
  10. Iterating over all matching strings >>> import sre_yield >>> sre_yield.Values(r'1(?P<x>234?|49?)')[:]

    ['123', '1234', '14', '149'] >>> len(sre_yield.Values('.')) 256 >>> sre_yield.Values('a*')[5:10] ['aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']
  11. What do we do about infinite repetitions >>> len(sre_yield.Values('0*')) 65536

    # Yes, really. sre library can only specify 65534 max >>> a77k = 'a' * 77000 >>> len(re.compile(r'.{,65534}').match(a77k).group(0)) 65534 >>> len(re.compile(r'.{,65535}').match(a77k).group(0)) 77000 >>> len(re.compile(r'.{60000}.{,6000}|.{,60000}') .match(a77k).group(0)) 66000
  12. How many matching strings >>> import sre_yield >>> bits =

    sre_yield.Values('[01]*') # All binary nums >>> len(bits) # how many are there? Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: long int too large to convert to int >>> bits.__len__() == 2**65536 - 1 # check the answer True >>> len(str(bits.__len__())) # Is the number that big? 19729 >>> "001001" in bits, "002001" in bits (True, False)
  13. Python does understand working with large numbers >>> import sre_yield

    >>> anything = sre_yield.Values('.*') >>> a = 1 >>> for _ in xrange(65535): a = a * 256 + 1 >>> anything.__len__() == a True >>> str_a = str(a) # This does take a while >>> len(str_a) 157825 >>> str_a[:9], str_a[-9:] ('101818453', '945826561')
  14. But why bother yielding from a regex • It can

    be more compact than a literal list, for example: ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1| sa-east-1|us-east-1|us-west-1|us-west-2 • That doesn’t get much shorter when rewritten: (ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap- southea)st-2 • On the other hand, others are more convenient: www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.]) widget[.](?P<domain>com|co[.]uk|ch|de) • Some things would better be machine generated: 192\.168(?:\.(?:[1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])){2}
  15. • Implementation uses backtracking, i.e. PCRE ◦ So it is

    fast providing it never guesses wrong ◦ Trivial to write an expression that is … slow def test(n): t = "a" * n r = "a?" * n + t return bool( re.match(r, t)) timeit.timeit( stmt="test(6)", setup="from __main__ import test") How fast is the “re” library
  16. The RE2 library • https://code.google.com/p/re2 • https://github.com/axiak/pyre2 • RE2 tries

    all possible code paths in parallel ◦ never backtracks, so omits features that need it • drops support for backreferences ◦ and generalized zero-width assertions • Predictable worst case performance for any input ◦ Safe to accept untrusted regular expressions Test(10) takes 4 milliseconds instead of one minute
  17. Summary •Regular expressions are built into Python ◦re_obj = re.compile(pattern)

    ◦print re_obj.pattern •They can parse strings into a dictionary ◦Or iteratively many dictionaries •They can compactly represent large lists ◦Without expanding the whole iterator out •For reliable performance, use RE2 ◦Especially if users are supplying patterns
  18. Questions? •mail -s us.pycon.org/2014 \ ◦[email protected] • Nothing to do

    with me, but pretty good: ◦ http://qntm.org/files/re/re.html