[bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, False, False] >>> s = re.compile("i*") # yes, that s matches “” >>> s.split("oiooiioooiii") # split ignores that silliness ['o', 'oo', 'ooo', ''] >>> s.sub("x", "oiooiioooiii") # but sub does not 'xoxoxoxoxoxox'
m.groupdict() >>> rc {'col': '$15', 'row': 'aa'} >>> 'It was row %(row)s and column %(col)s' % rc 'It was row aa and column $15' >>> txt = "from a1 2 b$22 as well as 4 $c4" >>> f = r"<%(col)s,%(row)s>" >>> ";".join(f % m.groupdict() for m in cell.finditer(txt)) '<1,a>;<$22,b>;<4,$c>'
https://github.com/google/sre_yield • def Values(regex, flags=0, charset=CHARSET) ◦ Examines output from sre_parse.parse() ◦ Returns a convenient sequence like object • Sequence has an efficient membership test ◦ We were given a regex describing its content • Some features (lookahead, etc) still missing ◦ Easy to add if sequence can contain None
sre_yield.Values('[01]*') # All binary nums >>> len(bits) # how many are there? Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: long int too large to convert to int >>> bits.__len__() == 2**65536 - 1 # check the answer True >>> len(str(bits.__len__())) # Is the number that big? 19729 >>> "001001" in bits, "002001" in bits (True, False)
>>> anything = sre_yield.Values('.*') >>> a = 1 >>> for _ in xrange(65535): a = a * 256 + 1 >>> anything.__len__() == a True >>> str_a = str(a) # This does take a while >>> len(str_a) 157825 >>> str_a[:9], str_a[-9:] ('101818453', '945826561')
be more compact than a literal list, for example: ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1| sa-east-1|us-east-1|us-west-1|us-west-2 • That doesn’t get much shorter when rewritten: (ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap- southea)st-2 • On the other hand, others are more convenient: www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.]) widget[.](?P<domain>com|co[.]uk|ch|de) • Some things would better be machine generated: 192\.168(?:\.(?:[1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])){2}
fast providing it never guesses wrong ◦ Trivial to write an expression that is … slow def test(n): t = "a" * n r = "a?" * n + t return bool( re.match(r, t)) timeit.timeit( stmt="test(6)", setup="from __main__ import test") How fast is the “re” library
all possible code paths in parallel ◦ never backtracks, so omits features that need it • drops support for backreferences ◦ and generalized zero-width assertions • Predictable worst case performance for any input ◦ Safe to accept untrusted regular expressions Test(10) takes 4 milliseconds instead of one minute
◦print re_obj.pattern •They can parse strings into a dictionary ◦Or iteratively many dictionaries •They can compactly represent large lists ◦Without expanding the whole iterator out •For reliable performance, use RE2 ◦Especially if users are supplying patterns