Worry-Free Parsers with Parsley by Allen Short

Slide 1

Slide 1 text

Worry-Free Parsers With Parsley Allen Short [email protected]

Slide 2

Slide 2 text

{ "name": "Sir Robin", "braveness": -1, "companions": [ "Lancelot", "Arthur" ] } Every interesting program has to deal with some kind of structured data. XML, HTML, JSON, CSV, etc. are all common. Here's a nice little JSON object with some strings and numbers and arrays in it. Python makes this pretty easy to deal with; call json.loads() and you get a nice dictionary back.

Slide 3

Slide 3 text

{ name: "Sir Robin", braveness: -1, companions: [ "Lancelot",,"Arthur" ] } Sometimes, though, we have to deal with a format we don't have premade solution for. Here's some data that isn't JSON. It's valid Javascript, but as you can see, there's no quotes around the keys... and Sir Not Appearing In This Film wound up in the array down there.

Slide 4

Slide 4 text

3 At this point, you have three choices:

Slide 5

Slide 5 text

do it wrong Strangely, this is the most popular option.

Slide 6

Slide 6 text

"HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. The cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty." It's tempting to just throw together some regexes and string manipulation and guesswork to write something that will extract the information you need. But as a famous post from Andrew Clover on Stack Overflow reminds us, this doesn't always work out as well as we might hope.

Slide 7

Slide 7 text

ocess HTML establishes a breach between this world an he dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but ore corrupt) a mere glimpse of the world of regex parser for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he omes, the pestilent slithy regex-infection will devour you HTML parser, application and existence for all time like sual Basic only worse he comes he comes do not fight h com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enlı̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the ong of re̸gular expression parsing will extinguish the voice mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ beautiful the final snuffing of the lies of Man ALL IS LOŚ ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO OO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Some minor detail like whitespace might change, or we might have forgotten some case in the syntax that wasn't in our original example, and the code breaks on later inputs.

Slide 8

Slide 8 text

only worse he comes do not fight he s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟u k͏e liquid pain, the song of re̸gular exp n parsing will extinguish the voices of m n from the sphere I can see it can you t is beautiful the final snuffing of the lie LL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he he c̶̮omes he comes the ichor perme FACE MY FACE ᵒh god no NO NOO̼O p the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Trying to fix those complicates the code even more, making later breakages even harder to understand and fix.

Slide 9

Slide 9 text

e sphere I can see it can e ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final uffing of the lies of Man A LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the comes he c̶̮omes he com e ichor permeates all MY ACE MY FACE ᵒh god no N OO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ H̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Eventually, we spiral into madness and despair.

Slide 10

Slide 10 text

recursive descent So we don't want to do that. Another option is to write a recursive descent parser that manually iterates over each character or token, building a structure as it goes. This is usually the route taken for code that needs the most control over the parser's behavior.

Slide 11

Slide 11 text

json.loads() Your C compiler is probably written this way. Python's json parser is written this way, and it works quite well. But have any of you looked at the implementation?

Slide 12

Slide 12 text

300 It's pretty tedious reading. Not counting comments or whitespace, the JSON parser is about 300 lines of code.

Slide 13

Slide 13 text

And JSON is a pretty simple format. It fits on a business card. A more complicated format is going to require a lot more code.

Slide 14

Slide 14 text

parser generator Finally, you could use a parser generator. Parser generators were invented to let us write parsers in less code by providing a special syntax for it.

Slide 15

Slide 15 text

yacc bison PLY ANTLR SPARK There's several available. yacc and bison are the classic C versions. PLY, ANTLR, and SPARK are available in Python. If you were in Alex Gaynor's talk, you heard him talk about how parsing was the most tedious and complicated part of writing an interpreter; his code example was using PLY. Now this is where the anxiety usually comes in, because for most people who know about parser generators, this is the first thing it makes them think of:

Slide 16

Slide 16 text

The problem is that parser generators work completely different from Python. I always felt like I had to turn on a different part of my brain to deal with this kind of stuff. This is a real drag when trying to get new contributors to hack on your parser. Have a look at PEP 306 some time. There’s something like 10 or 12 steps in there. I don’t even know what this picture means. It’s just something that came up on Google Images when I searched for “LALR parser”.

Slide 17

Slide 17 text

Because it's so different, most discussions of parsing start with the abstract mathematical underpinnings. The famous Dragon Book does this; it talks about finite state machines and context-free grammars before talking about how it actually relates to something you'd want to do. Furthermore, most parser generators get talked about in the context of writing a compiler, which is really a small part of what they're useful for.

Slide 18

Slide 18 text

? So is there a way to write parsers that work _and_ are readable and debuggable? Fortunately there's been some new research in this area in the last decade.

Slide 19

Slide 19 text

.strip() .split() [:x] == Python already has good tools for basic text manipulation, so we want something that lets you use your knowledge of those.

Slide 20

Slide 20 text

re Regular expressions are a powerful tool for some kinds of text manipulation, but they're too limited to do the entire job. So let's look at something that can give the benefits of both.

Slide 21

Slide 21 text

PEG Parsing expression grammars are a relatively new kind of tool for describing parsers. Unlike the previous generation of parser generator tools, PEGs have a very similar model of execution to Python, while keeping the concise syntax of regular expressions.

Slide 22

Slide 22 text

a(b|c)d+e Here's a very simple regular expression. it matches a, then b or c, then one or more ds, then e.

Slide 23

Slide 23 text

'a' ('b' | 'c') 'd'+ 'e' Here's the PEG expression for the same thing. As you can see, thesyntax is very similar. In PEG, spaces mean "and" and the pipe means "or".

Slide 24

Slide 24 text

match('a') and (match('b') or match('c')) and match1OrMore('d') and match('e') So here's a pseudo-python expression for what that PEG does. ANDs and ORs short circuit, just like in Python.

Slide 25

Slide 25 text

foo = 'a' ('b' | 'c') 'd'+ 'e' Now let's talk about how PEGs do more than regexes. We're going to give this rule a name, foo.

Slide 26

Slide 26 text

baz = 'b' | 'c' foo = 'a' baz 'd'+ 'e' Rules can call other rules. So here we can move part of foo into a separate rule. PEGs are _expressions_, so they return a value. Here foo returns the last thing it matches, so 'e'.

Slide 27

Slide 27 text

baz = 'b' | 'c' foo = 'a' baz:x 'd'+ 'e' -> x But we can give names to those values, and change the value a rule returns. Now foo returns 'b' or 'c', depending on the input.

Slide 28

Slide 28 text

PEG Operation Python code ‘a' match('a') foo foo() foo(x y) foo(x, y) x y x and y x | y x or y ~x not x x* while x: ... x+ x while x: ... x? x or None x:name name = x Each piece of syntax corresponds to similar Python syntax.

Slide 29

Slide 29 text

PyParsing LEPL Parsimonious There are already a few PEG tools in Python, like these libraries. I ended up writing my own, for a couple reasons.

Slide 30

Slide 30 text

Parsley Parsley is based on OMeta, which is a parser technique developed in Smalltalk and in Javascript in 2007.

Slide 31

Slide 31 text

Brevity Libraries like PyParsing embed their parser expressions in Python syntax, which requires a few contortions. Having a grammar syntax let me make parsing expressions short, like regexes.

Slide 32

Slide 32 text

Error reporting Nearly all of the parser libraries I’ve used didn’t give you much more than “syntax error”, with no information about where or why.

Slide 33

Slide 33 text

Speed And I wanted it to be fast enough to use for hard problems.

Slide 34

Slide 34 text

300 The JSON parser in Python is 300 lines, like I said. Using Parsley we can do it in 30.

Slide 35

Slide 35 text

Slide 36

Slide 36 text

from parsley import makeGrammar jsonGrammar = r""" object = token('{') members:m token('}') -> dict(m) members = (pair:first (token(',') pair)*:rest -> [first] + rest) | -> [] pair = string:k token(':') value:v -> (k, v) array = token('[') elements:xs token(']') -> xs elements = (value:first (token(',') value)*:rest -> [first] + rest) | -> [] value = (string | number | object | array | token('true') -> True | token('false') -> False | token('null') -> None) string = token('"') (escapedChar | ~'"' anything)*:c '"' -> ''.join(c) escapedChar = '\\' (('"' -> '"') |('\\' -> '\\') |('/' -> '/') |('b' -> '\b') |('f' -> '\f') |('n' -> '\n') |('r' -> '\r') |('t' -> '\t') |('\'' -> '\'') | escapedUnicode) hexdigit = :x ?(x in '0123456789abcdefABCDEF') -> x escapedUnicode = 'u' :hs -> unichr(int(hs, 16)) number = spaces ('-' | -> ''):sign (intPart:ds (floatPart(sign ds) | -> int(sign + ds))) digit = :x ?(x in '0123456789') -> x digits = digit1_9 = :x ?(x in '123456789') -> x intPart = (digit1_9:first digits:rest -> first + rest) | digit floatPart :sign :ds = <('.' digits exponent?) | exponent>:tail -> float(sign + ds + tail) exponent = ('e' | 'E') ('+' | '-')? digits """ JSONParser = makeGrammar(jsonGrammar, {}) Here are those 30 lines. This looks curiously similar to the business card I showed you earlier.

Slide 37

Slide 37 text

So let’s look at a couple of these rules, just to get a flavor of how PEG parsing for a real format looks.

Slide 38

Slide 38 text

ws = (' ' |'\t' |'\r'|'\n')* This one’s real simple. This matches zero or more whitespace characters.

Slide 39

Slide 39 text

object = ws '{' members:m ws '}' -> dict(m) Here’s the rule for a JSON object. This skips whitespace if it’s present, then matches an open brace. It then calls the members rule and saves the result as m. Then it skips more whitespace and matches a close brace. It then calls dict() on m and returns the dict.

Slide 40

Slide 40 text

members = (pair:first (ws ',' pair)*:rest -> [first] + rest) | -> [] Here’s that members rule the previous one called. dict() takes a list of key-value pairs. so members will return one. There are three cases we need to handle here:

Slide 41

Slide 41 text

members = (pair:first (ws ',' pair)*:rest -> [first] + rest) | -> [] First we try to parse a single pair.

Slide 42

Slide 42 text

members = (pair:first (ws ',' pair)*:rest -> [first] + rest) | -> [] If that fails. we return an empty list because there is not even a single key-value pair.

Slide 43

Slide 43 text

members = (pair:first (ws ',' pair)*:rest -> [first] + rest) | -> [] If it suceeds we move on, and try to match a comma and then another pair. We repeat this as many times as possible, due to the star. Using star or plus returns a list of results.

Slide 44

Slide 44 text

members = (pair:first (ws ',' pair)*:rest -> [first] + rest) | -> [] When we run out of pairs we return a list of all the ones we collected.

Slide 45

Slide 45 text

pair = ws string:k ws ':' value:v -> (k, v) The rule for a single pair maches a string, a colon, and some JSON value and returns them as a 2-tuple.

Slide 46

Slide 46 text

Slide 47

Slide 47 text

JSONParser = makeGrammar( jsonGrammar, {}) Here’s how we use this from Python. jsonGrammar is the string containing all these rules we just looked at. This creates a JSONParser class. Each rule is a method.

Slide 48

Slide 48 text

>>> from parsley_json import JSONParser >>> txt = """ ... { ... "name": "Sir Robin", ... "braveness": -1 ... } ... """ >>> JSONParser(txt).object() Using that parser is pretty easy. We create an instance of the parser and call the object rule.

Slide 49

Slide 49 text

{'braveness': -1,'name': 'Sir Robin'} Ta da, we get a dict.

Slide 50

Slide 50 text

>>> from parsley_json import JSONParser >>> txt = """ ... { ... name: "Sir Robin", ... braveness: -1 ... } ... """ >>> JSONParser(txt).object() So let’s see what happens when we invoke this on our non-JSON example.

Slide 51

Slide 51 text

ometa.runtime.ParseError: name: "Sir Robin", ^ Parse error at line 3, column 2: expected one of token '"', or token '}'. trail: [string pair] We get an exception! And it shows us where the parse failed, what it expected, and what rules it called trying to parse it.

Slide 52

Slide 52 text

nqjGrammar = """ jsname = ws pair = (jsname | string):k ws ':' value:v -> (k, v) """ So now let’s write a jsname rule that matches a javascript identifier. We then write a rule that’s like our pair rule we looked at before, but it matches either a string or a javascript name.

Slide 53

Slide 53 text

NQJParser = makeGrammar( nqjGrammar, {}, extends=JSONParser) We then _extend_ the existing parser class, creating a Not-Quite-Javascript parser that subclasses JSONParser.

Slide 54

Slide 54 text

>>> p = NQJParser(txt) >>> print p.object() So using our new parser...

Slide 55

Slide 55 text

>>> p = NQJParser(txt) >>> print p.object() {'braveness': -1, 'name': 'Sir Robin'} ... we handle our custom format just fine.

Slide 56

Slide 56 text

http://isometric.sixsided.org/strips/ semicolon_of_death/

Slide 57

Slide 57 text

[email protected] ? http://github.com/washort/parsley

Slide 58

Slide 58 text

foo = 'a' ('b' | 'c') 'd'+ 'e'

Slide 59

Slide 59 text

Rule('foo', And(Exactly('a'), Or(Exactly('b'), Exactly('c')), Many1(Exactly('d')), Exactly('e')))

Slide 60

Slide 60 text

def rule_foo(self): self.exactly(‘a’) def _or_1(): self.exactly(‘b’) def _or_2(): return self.exactly(‘c’) self._or(_or_1, _or_2) def _many1_3(): return self.exactly(‘d’) self.many1(_many1_3) return self.exactly(‘e’)

Slide 61

Slide 61 text

interp = ['+' interp:x interp:y] -> x + y | ['*' interp:x interp:y] -> x * y | :x -> int(x)

Slide 62

Slide 62 text

interp = ['+' interp:x interp:y] -> x + y | ['*' interp:x interp:y] -> x * y | :x -> int(x) [['+', '3', ['*', '5', '2']]]

Slide 63

Slide 63 text

interp = ['+' interp:x interp:y] -> x + y | ['*' interp:x interp:y] -> x * y | :x -> int(x) [['+', '3', ['*', '5', '2']]] 13