Worry-Free Parsers with Parsley by Allen Short

Worry-Free Parsers With Parsley Allen Short [email protected]

{ "name": "Sir Robin", "braveness": -1, "companions": [ "Lancelot", "Arthur"
] } Every interesting program has to deal with some kind of structured data. XML, HTML, JSON, CSV, etc. are all common. Here's a nice little JSON object with some strings and numbers and arrays in it. Python makes this pretty easy to deal with; call json.loads() and you get a nice dictionary back.

{ name: "Sir Robin", braveness: -1, companions: [ "Lancelot",,"Arthur" ]
} Sometimes, though, we have to deal with a format we don't have premade solution for. Here's some data that isn't JSON. It's valid Javascript, but as you can see, there's no quotes around the keys... and Sir Not Appearing In This Film wound up in the array down there.

3 At this point, you have three choices:

do it wrong Strangely, this is the most popular option.

"HTML is a language of sufficient complexity that it cannot
be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty." It's tempting to just throw together some regexes and string manipulation and guesswork to write something that will extract the information you need. But as a famous post from Andrew Clover on Stack Overflow reminds us, this doesn't always work out as well as we might hope.

ocess HTML establishes a breach between this world an he
dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but ore corrupt) a mere glimpse of the world of regex parser for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he omes, the pestilent slithy regex-infection will devour you HTML parser, application and existence for all time like sual Basic only worse he comes he comes do not fight h com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enlı̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the ong of re̸gular expression parsing will extinguish the voice mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ beautiful the final snuffing of the lies of Man ALL IS LOŚ ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO OO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Some minor detail like whitespace might change, or we might have forgotten some case in the syntax that wasn't in our original example, and the code breaks on later inputs.

only worse he comes do not fight he s, ̕h̵is
un̨ho͞ly radiańcé destro҉ying all enment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟u k͏e liquid pain, the song of re̸gular exp n parsing will extinguish the voices of m n from the sphere I can see it can you t is beautiful the final snuffing of the lie LL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he he c̶̮omes he comes the ichor perme FACE MY FACE ᵒh god no NO NOO̼O p the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Trying to fix those complicates the code even more, making later breakages even harder to understand and fix.

e sphere I can see it can e ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it
is beautiful the final uffing of the lies of Man A LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the comes he c̶̮omes he com e ichor permeates all MY ACE MY FACE ᵒh god no N OO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ H̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Eventually, we spiral into madness and despair.

recursive descent So we don't want to do that. Another
option is to write a recursive descent parser that manually iterates over each character or token, building a structure as it goes. This is usually the route taken for code that needs the most control over the parser's behavior.

json.loads() Your C compiler is probably written this way. Python's
json parser is written this way, and it works quite well. But have any of you looked at the implementation?

300 It's pretty tedious reading. Not counting comments or whitespace,
the JSON parser is about 300 lines of code.

And JSON is a pretty simple format. It fits on
a business card. A more complicated format is going to require a lot more code.

parser generator Finally, you could use a parser generator. Parser
generators were invented to let us write parsers in less code by providing a special syntax for it.

yacc bison PLY ANTLR SPARK There's several available. yacc and
bison are the classic C versions. PLY, ANTLR, and SPARK are available in Python. If you were in Alex Gaynor's talk, you heard him talk about how parsing was the most tedious and complicated part of writing an interpreter; his code example was using PLY. Now this is where the anxiety usually comes in, because for most people who know about parser generators, this is the first thing it makes them think of:

The problem is that parser generators work completely different from
Python. I always felt like I had to turn on a different part of my brain to deal with this kind of stuff. This is a real drag when trying to get new contributors to hack on your parser. Have a look at PEP 306 some time. There’s something like 10 or 12 steps in there. I don’t even know what this picture means. It’s just something that came up on Google Images when I searched for “LALR parser”.

Because it's so different, most discussions of parsing start with
the abstract mathematical underpinnings. The famous Dragon Book does this; it talks about finite state machines and context-free grammars before talking about how it actually relates to something you'd want to do. Furthermore, most parser generators get talked about in the context of writing a compiler, which is really a small part of what they're useful for.

? So is there a way to write parsers that
work _and_ are readable and debuggable? Fortunately there's been some new research in this area in the last decade.

.strip() .split() [:x] == Python already has good tools for
basic text manipulation, so we want something that lets you use your knowledge of those.

re Regular expressions are a powerful tool for some kinds
of text manipulation, but they're too limited to do the entire job. So let's look at something that can give the benefits of both.

PEG Parsing expression grammars are a relatively new kind of
tool for describing parsers. Unlike the previous generation of parser generator tools, PEGs have a very similar model of execution to Python, while keeping the concise syntax of regular expressions.

a(b|c)d+e Here's a very simple regular expression. it matches a,
then b or c, then one or more ds, then e.

'a' ('b' | 'c') 'd'+ 'e' Here's the PEG expression
for the same thing. As you can see, thesyntax is very similar. In PEG, spaces mean "and" and the pipe means "or".

match('a') and (match('b') or match('c')) and match1OrMore('d') and match('e') So
here's a pseudo-python expression for what that PEG does. ANDs and ORs short circuit, just like in Python.

foo = 'a' ('b' | 'c') 'd'+ 'e' Now let's
talk about how PEGs do more than regexes. We're going to give this rule a name, foo.

baz = 'b' | 'c' foo = 'a' baz 'd'+
'e' Rules can call other rules. So here we can move part of foo into a separate rule. PEGs are _expressions_, so they return a value. Here foo returns the last thing it matches, so 'e'.

baz = 'b' | 'c' foo = 'a' baz:x 'd'+
'e' -> x But we can give names to those values, and change the value a rule returns. Now foo returns 'b' or 'c', depending on the input.

PEG Operation Python code ‘a' match('a') foo foo() foo(x y)
foo(x, y) x y x and y x | y x or y ~x not x x* while x: ... x+ x while x: ... x? x or None x:name name = x Each piece of syntax corresponds to similar Python syntax.

PyParsing LEPL Parsimonious There are already a few PEG tools
in Python, like these libraries. I ended up writing my own, for a couple reasons.

Parsley Parsley is based on OMeta, which is a parser
technique developed in Smalltalk and in Javascript in 2007.

Brevity Libraries like PyParsing embed their parser expressions in Python
syntax, which requires a few contortions. Having a grammar syntax let me make parsing expressions short, like regexes.

Error reporting Nearly all of the parser libraries I’ve used
didn’t give you much more than “syntax error”, with no information about where or why.

Speed And I wanted it to be fast enough to
use for hard problems.

300 The JSON parser in Python is 300 lines, like
I said. Using Parsley we can do it in 30.

from parsley import makeGrammar jsonGrammar = r""" object = token('{')
members:m token('}') -> dict(m) members = (pair:first (token(',') pair)*:rest -> [first] + rest) | -> [] pair = string:k token(':') value:v -> (k, v) array = token('[') elements:xs token(']') -> xs elements = (value:first (token(',') value)*:rest -> [first] + rest) | -> [] value = (string | number | object | array | token('true') -> True | token('false') -> False | token('null') -> None) string = token('"') (escapedChar | ~'"' anything)*:c '"' -> ''.join(c) escapedChar = '\\' (('"' -> '"') |('\\' -> '\\') |('/' -> '/') |('b' -> '\b') |('f' -> '\f') |('n' -> '\n') |('r' -> '\r') |('t' -> '\t') |('\'' -> '\'') | escapedUnicode) hexdigit = :x ?(x in '0123456789abcdefABCDEF') -> x escapedUnicode = 'u' <hexdigit{4}>:hs -> unichr(int(hs, 16)) number = spaces ('-' | -> ''):sign (intPart:ds (floatPart(sign ds) | -> int(sign + ds))) digit = :x ?(x in '0123456789') -> x digits = <digit*> digit1_9 = :x ?(x in '123456789') -> x intPart = (digit1_9:first digits:rest -> first + rest) | digit floatPart :sign :ds = <('.' digits exponent?) | exponent>:tail -> float(sign + ds + tail) exponent = ('e' | 'E') ('+' | '-')? digits """ JSONParser = makeGrammar(jsonGrammar, {}) Here are those 30 lines. This looks curiously similar to the business card I showed you earlier.

So let’s look at a couple of these rules, just
to get a flavor of how PEG parsing for a real format looks.

ws = (' ' |'\t' |'\r'|'\n')* This one’s real simple.
This matches zero or more whitespace characters.

object = ws '{' members:m ws '}' -> dict(m) Here’s
the rule for a JSON object. This skips whitespace if it’s present, then matches an open brace. It then calls the members rule and saves the result as m. Then it skips more whitespace and matches a close brace. It then calls dict() on m and returns the dict.

members = (pair:first (ws ',' pair)*:rest -> [first] + rest)
| -> [] Here’s that members rule the previous one called. dict() takes a list of key-value pairs. so members will return one. There are three cases we need to handle here:

| -> [] First we try to parse a single pair.

| -> [] If that fails. we return an empty list because there is not even a single key-value pair.

| -> [] If it suceeds we move on, and try to match a comma and then another pair. We repeat this as many times as possible, due to the star. Using star or plus returns a list of results.

| -> [] When we run out of pairs we return a list of all the ones we collected.

pair = ws string:k ws ':' value:v -> (k, v)
The rule for a single pair maches a string, a colon, and some JSON value and returns them as a 2-tuple.

JSONParser = makeGrammar( jsonGrammar, {}) Here’s how we use this
from Python. jsonGrammar is the string containing all these rules we just looked at. This creates a JSONParser class. Each rule is a method.

>>> from parsley_json import JSONParser >>> txt = """ ...
{ ... "name": "Sir Robin", ... "braveness": -1 ... } ... """ >>> JSONParser(txt).object() Using that parser is pretty easy. We create an instance of the parser and call the object rule.

{'braveness': -1,'name': 'Sir Robin'} Ta da, we get a dict.

>>> from parsley_json import JSONParser >>> txt = """ ...
{ ... name: "Sir Robin", ... braveness: -1 ... } ... """ >>> JSONParser(txt).object() So let’s see what happens when we invoke this on our non-JSON example.

ometa.runtime.ParseError: name: "Sir Robin", ^ Parse error at line 3,
column 2: expected one of token '"', or token '}'. trail: [string pair] We get an exception! And it shows us where the parse failed, what it expected, and what rules it called trying to parse it.

nqjGrammar = """ jsname = ws <letter (letter | digit
| '$' | '_')*> pair = (jsname | string):k ws ':' value:v -> (k, v) """ So now let’s write a jsname rule that matches a javascript identifier. We then write a rule that’s like our pair rule we looked at before, but it matches either a string or a javascript name.

NQJParser = makeGrammar( nqjGrammar, {}, extends=JSONParser) We then _extend_ the
existing parser class, creating a Not-Quite-Javascript parser that subclasses JSONParser.

>>> p = NQJParser(txt) >>> print p.object() So using our
new parser...

>>> p = NQJParser(txt) >>> print p.object() {'braveness': -1, 'name':
'Sir Robin'} ... we handle our custom format just fine.

http://isometric.sixsided.org/strips/ semicolon_of_death/

[email protected] ? http://github.com/washort/parsley

foo = 'a' ('b' | 'c') 'd'+ 'e'

Rule('foo', And(Exactly('a'), Or(Exactly('b'), Exactly('c')), Many1(Exactly('d')), Exactly('e')))

def rule_foo(self): self.exactly(‘a’) def _or_1(): self.exactly(‘b’) def _or_2(): return self.exactly(‘c’)
self._or(_or_1, _or_2) def _many1_3(): return self.exactly(‘d’) self.many1(_many1_3) return self.exactly(‘e’)

interp = ['+' interp:x interp:y] -> x + y |
['*' interp:x interp:y] -> x * y | <digit+>:x -> int(x)

['*' interp:x interp:y] -> x * y | <digit+>:x -> int(x) [['+', '3', ['*', '5', '2']]]

['*' interp:x interp:y] -> x * y | <digit+>:x -> int(x) [['+', '3', ['*', '5', '2']]] 13

Worry-Free Parsers with Parsley by Allen Short

Worry-Free Parsers with Parsley by Allen Short

More Decks by PyCon 2013

Other Decks in Programming

Featured

Transcript