Memorable Uses For A Regular Expression Library by Alex Perry

Memorable uses for a Regular Expression library Learning the syntax
by examples Alex Perry SRE, Google, Los Angeles April 2014

Outline • Simple Regular Expressions • import re ◦ http://docs.python.org/2/library/re.html
• Parsing • import sre • Formatting • import sre_yield • Arithmetic • Performance uncertainty • import re2

Basic Regular Expressions abc “abc” [abc] “a” “b” “c” abc?
“ab” “abc” abc* “ab” “abc” “abcc” ... abc+ “abc” “abcc” “abccc” ... abc{3,4} “abccc” “abcccc” ab|c\+ “ab” “c+” ab. “ab.” “ab1” … “ab\n”DOTALL

The standard library - compiling >>> import re >>> o
= re.compile(“abc?”) >>> [bool(o.match(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, False] >>> [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, True]

The standard library - endings >>> o = re.compile("^abc?$") >>>
[bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, False, False] >>> s = re.compile("i*") # yes, that s matches “” >>> s.split("oiooiioooiii") # split ignores that silliness ['o', 'oo', 'ooo', ''] >>> s.sub("x", "oiooiioooiii") # but sub does not 'xoxoxoxoxoxox'

Parsing strings easily >>> import re >>> cell = re.compile(r"(?P<row>[$]?[a-z]+)"
r"(?P<col>[$]?[0-9]+)") >>> m = cell.search("Spreadsheet cell aa$15") >>> m <_sre.SRE_Match object at 0x7f220a8e9360> >>> m.groupdict() {'col': '$15', 'row': 'aa'}

Formatting after parsing using a regular expression >>> rc =
m.groupdict() >>> rc {'col': '$15', 'row': 'aa'} >>> 'It was row %(row)s and column %(col)s' % rc 'It was row aa and column $15' >>> txt = "from a1 2 b$22 as well as 4 $c4" >>> f = r"<%(col)s,%(row)s>" >>> ";".join(f % m.groupdict() for m in cell.finditer(txt)) '<1,a>;<$22,b>;<4,$c>'

Secret (labs) RE engine - internals • Originally separate from
module “re” ◦ As of version 2.0 onwards they’re equivalent ◦ Call it “sre” in any backward compatible code >>> import sre_parse >>> sre_parse.parse("ab|c") [('branch', (None, [ [('literal', 97), ('literal', 98)], [('literal', 99)] ]) )]

Secret Regular Expression Yield • New module called sre_yield ◦
https://github.com/google/sre_yield • def Values(regex, flags=0, charset=CHARSET) ◦ Examines output from sre_parse.parse() ◦ Returns a convenient sequence like object • Sequence has an efficient membership test ◦ We were given a regex describing its content • Some features (lookahead, etc) still missing ◦ Easy to add if sequence can contain None

Iterating over all matching strings >>> import sre_yield >>> sre_yield.Values(r'1(?P<x>234?|49?)')[:]
['123', '1234', '14', '149'] >>> len(sre_yield.Values('.')) 256 >>> sre_yield.Values('a*')[5:10] ['aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']

What do we do about infinite repetitions >>> len(sre_yield.Values('0*')) 65536
# Yes, really. sre library can only specify 65534 max >>> a77k = 'a' * 77000 >>> len(re.compile(r'.{,65534}').match(a77k).group(0)) 65534 >>> len(re.compile(r'.{,65535}').match(a77k).group(0)) 77000 >>> len(re.compile(r'.{60000}.{,6000}|.{,60000}') .match(a77k).group(0)) 66000

How many matching strings >>> import sre_yield >>> bits =
sre_yield.Values('[01]*') # All binary nums >>> len(bits) # how many are there? Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: long int too large to convert to int >>> bits.__len__() == 2**65536 - 1 # check the answer True >>> len(str(bits.__len__())) # Is the number that big? 19729 >>> "001001" in bits, "002001" in bits (True, False)

Python does understand working with large numbers >>> import sre_yield
>>> anything = sre_yield.Values('.*') >>> a = 1 >>> for _ in xrange(65535): a = a * 256 + 1 >>> anything.__len__() == a True >>> str_a = str(a) # This does take a while >>> len(str_a) 157825 >>> str_a[:9], str_a[-9:] ('101818453', '945826561')

But why bother yielding from a regex • It can
be more compact than a literal list, for example: ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1| sa-east-1|us-east-1|us-west-1|us-west-2 • That doesn’t get much shorter when rewritten: (ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap- southea)st-2 • On the other hand, others are more convenient: www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.]) widget[.](?P<domain>com|co[.]uk|ch|de) • Some things would better be machine generated: 192\.168(?:\.(?:[1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])){2}

• Implementation uses backtracking, i.e. PCRE ◦ So it is
fast providing it never guesses wrong ◦ Trivial to write an expression that is … slow def test(n): t = "a" * n r = "a?" * n + t return bool( re.match(r, t)) timeit.timeit( stmt="test(6)", setup="from __main__ import test") How fast is the “re” library

The RE2 library • https://code.google.com/p/re2 • https://github.com/axiak/pyre2 • RE2 tries
all possible code paths in parallel ◦ never backtracks, so omits features that need it • drops support for backreferences ◦ and generalized zero-width assertions • Predictable worst case performance for any input ◦ Safe to accept untrusted regular expressions Test(10) takes 4 milliseconds instead of one minute

Summary •Regular expressions are built into Python ◦re_obj = re.compile(pattern)
◦print re_obj.pattern •They can parse strings into a dictionary ◦Or iteratively many dictionaries •They can compactly represent large lists ◦Without expanding the whole iterator out •For reliable performance, use RE2 ◦Especially if users are supplying patterns

Questions? •mail -s us.pycon.org/2014 \ ◦[email protected] • Nothing to do
with me, but pretty good: ◦ http://qntm.org/files/re/re.html

Memorable Uses For A Regular Expression Library by Alex Perry

Memorable Uses For A Regular Expression Library by Alex Perry

PyCon 2014

More Decks by PyCon 2014

Featured

Transcript

Memorable uses for a Regular Expression library Learning the syntax

Outline • Simple Regular Expressions • import re ◦ http://docs.python.org/2/library/re.html

Basic Regular Expressions abc “abc” [abc] “a” “b” “c” abc?

The standard library - compiling >>> import re >>> o

The standard library - endings >>> o = re.compile("^abc?$") >>>

Parsing strings easily >>> import re >>> cell = re.compile(r"(?P<row>[$]?[a-z]+)"

Formatting after parsing using a regular expression >>> rc =

Secret (labs) RE engine - internals • Originally separate from

Secret Regular Expression Yield • New module called sre_yield ◦

Iterating over all matching strings >>> import sre_yield >>> sre_yield.Values(r'1(?P<x>234?|49?)')[:]

What do we do about infinite repetitions >>> len(sre_yield.Values('0*')) 65536

How many matching strings >>> import sre_yield >>> bits =

Python does understand working with large numbers >>> import sre_yield

But why bother yielding from a regex • It can

• Implementation uses backtracking, i.e. PCRE ◦ So it is

The RE2 library • https://code.google.com/p/re2 • https://github.com/axiak/pyre2 • RE2 tries

Summary •Regular expressions are built into Python ◦re_obj = re.compile(pattern)

Questions? •mail -s us.pycon.org/2014 \ ◦[email protected] • Nothing to do