Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Worry-Free Parsers with Parsley by Allen Short

Worry-Free Parsers with Parsley by Allen Short

Writing effective parsers for structured formats has usually required either tedious or inscrutable code. Here we introduce PEG parsers, which provide a concise and understandable way to write parsers for complex languages, as implemented via the Parsley library.

PyCon 2013

March 17, 2013
Tweet

More Decks by PyCon 2013

Other Decks in Programming

Transcript

  1. { "name": "Sir Robin", "braveness": -1, "companions": [ "Lancelot", "Arthur"

    ] } Every interesting program has to deal with some kind of structured data. XML, HTML, JSON, CSV, etc. are all common. Here's a nice little JSON object with some strings and numbers and arrays in it. Python makes this pretty easy to deal with; call json.loads() and you get a nice dictionary back.
  2. { name: "Sir Robin", braveness: -1, companions: [ "Lancelot",,"Arthur" ]

    } Sometimes, though, we have to deal with a format we don't have premade solution for. Here's some data that isn't JSON. It's valid Javascript, but as you can see, there's no quotes around the keys... and Sir Not Appearing In This Film wound up in the array down there.
  3. "HTML is a language of sufficient complexity that it cannot

    be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty." It's tempting to just throw together some regexes and string manipulation and guesswork to write something that will extract the information you need. But as a famous post from Andrew Clover on Stack Overflow reminds us, this doesn't always work out as well as we might hope.
  4. ocess HTML establishes a breach between this world an he

    dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but ore corrupt) a mere glimpse of the world of reg​ex parser for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he omes, the pestilent slithy regex-infection wil​l devour you HT​ML parser, application and existence for all time like sual Basic only worse he comes he comes do not fi​ght h com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enlı̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the ong of re̸gular exp​ression parsing will exti​nguish the voice mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO OO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Some minor detail like whitespace might change, or we might have forgotten some case in the syntax that wasn't in our original example, and the code breaks on later inputs.
  5. only worse he comes do not fi​ght he s, ̕h̵i​s

    un̨ho͞ly radiańcé destro҉ying all enment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟u k͏e liq​uid pain, the song of re̸gular exp n parsing will exti​nguish the voices of m n from the sp​here I can see it can you t is beautiful t​he final snuffing of the lie LL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he he c̶̮omes he comes the ich​or perme FACE MY FACE ᵒh god no NO NOO̼O p the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Trying to fix those complicates the code even more, making later breakages even harder to understand and fix.
  6. e sp​here I can see it can e ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it

    is beautiful t​he final uffing of the lie​s of Man A LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the comes he c̶̮omes he com e ich​or permeates all MY ACE MY FACE ᵒh god no N OO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ H̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Eventually, we spiral into madness and despair.
  7. recursive descent So we don't want to do that. Another

    option is to write a recursive descent parser that manually iterates over each character or token, building a structure as it goes. This is usually the route taken for code that needs the most control over the parser's behavior.
  8. json.loads() Your C compiler is probably written this way. Python's

    json parser is written this way, and it works quite well. But have any of you looked at the implementation?
  9. 300 It's pretty tedious reading. Not counting comments or whitespace,

    the JSON parser is about 300 lines of code.
  10. And JSON is a pretty simple format. It fits on

    a business card. A more complicated format is going to require a lot more code.
  11. parser generator Finally, you could use a parser generator. Parser

    generators were invented to let us write parsers in less code by providing a special syntax for it.
  12. yacc bison PLY ANTLR SPARK There's several available. yacc and

    bison are the classic C versions. PLY, ANTLR, and SPARK are available in Python. If you were in Alex Gaynor's talk, you heard him talk about how parsing was the most tedious and complicated part of writing an interpreter; his code example was using PLY. Now this is where the anxiety usually comes in, because for most people who know about parser generators, this is the first thing it makes them think of:
  13. The problem is that parser generators work completely different from

    Python. I always felt like I had to turn on a different part of my brain to deal with this kind of stuff. This is a real drag when trying to get new contributors to hack on your parser. Have a look at PEP 306 some time. There’s something like 10 or 12 steps in there. I don’t even know what this picture means. It’s just something that came up on Google Images when I searched for “LALR parser”.
  14. Because it's so different, most discussions of parsing start with

    the abstract mathematical underpinnings. The famous Dragon Book does this; it talks about finite state machines and context-free grammars before talking about how it actually relates to something you'd want to do. Furthermore, most parser generators get talked about in the context of writing a compiler, which is really a small part of what they're useful for.
  15. ? So is there a way to write parsers that

    work _and_ are readable and debuggable? Fortunately there's been some new research in this area in the last decade.
  16. .strip() .split() [:x] == Python already has good tools for

    basic text manipulation, so we want something that lets you use your knowledge of those.
  17. re Regular expressions are a powerful tool for some kinds

    of text manipulation, but they're too limited to do the entire job. So let's look at something that can give the benefits of both.
  18. PEG Parsing expression grammars are a relatively new kind of

    tool for describing parsers. Unlike the previous generation of parser generator tools, PEGs have a very similar model of execution to Python, while keeping the concise syntax of regular expressions.
  19. a(b|c)d+e Here's a very simple regular expression. it matches a,

    then b or c, then one or more ds, then e.
  20. 'a' ('b' | 'c') 'd'+ 'e' Here's the PEG expression

    for the same thing. As you can see, thesyntax is very similar. In PEG, spaces mean "and" and the pipe means "or".
  21. match('a') and (match('b') or match('c')) and match1OrMore('d') and match('e') So

    here's a pseudo-python expression for what that PEG does. ANDs and ORs short circuit, just like in Python.
  22. foo = 'a' ('b' | 'c') 'd'+ 'e' Now let's

    talk about how PEGs do more than regexes. We're going to give this rule a name, foo.
  23. baz = 'b' | 'c' foo = 'a' baz 'd'+

    'e' Rules can call other rules. So here we can move part of foo into a separate rule. PEGs are _expressions_, so they return a value. Here foo returns the last thing it matches, so 'e'.
  24. baz = 'b' | 'c' foo = 'a' baz:x 'd'+

    'e' -> x But we can give names to those values, and change the value a rule returns. Now foo returns 'b' or 'c', depending on the input.
  25. PEG Operation Python code ‘a' match('a') foo foo() foo(x y)

    foo(x, y) x y x and y x | y x or y ~x not x x* while x: ... x+ x while x: ... x? x or None x:name name = x Each piece of syntax corresponds to similar Python syntax.
  26. PyParsing LEPL Parsimonious There are already a few PEG tools

    in Python, like these libraries. I ended up writing my own, for a couple reasons.
  27. Parsley Parsley is based on OMeta, which is a parser

    technique developed in Smalltalk and in Javascript in 2007.
  28. Brevity Libraries like PyParsing embed their parser expressions in Python

    syntax, which requires a few contortions. Having a grammar syntax let me make parsing expressions short, like regexes.
  29. Error reporting Nearly all of the parser libraries I’ve used

    didn’t give you much more than “syntax error”, with no information about where or why.
  30. 300 The JSON parser in Python is 300 lines, like

    I said. Using Parsley we can do it in 30.
  31. 30

  32. from parsley import makeGrammar jsonGrammar = r""" object = token('{')

    members:m token('}') -> dict(m) members = (pair:first (token(',') pair)*:rest -> [first] + rest) | -> [] pair = string:k token(':') value:v -> (k, v) array = token('[') elements:xs token(']') -> xs elements = (value:first (token(',') value)*:rest -> [first] + rest) | -> [] value = (string | number | object | array | token('true') -> True | token('false') -> False | token('null') -> None) string = token('"') (escapedChar | ~'"' anything)*:c '"' -> ''.join(c) escapedChar = '\\' (('"' -> '"') |('\\' -> '\\') |('/' -> '/') |('b' -> '\b') |('f' -> '\f') |('n' -> '\n') |('r' -> '\r') |('t' -> '\t') |('\'' -> '\'') | escapedUnicode) hexdigit = :x ?(x in '0123456789abcdefABCDEF') -> x escapedUnicode = 'u' <hexdigit{4}>:hs -> unichr(int(hs, 16)) number = spaces ('-' | -> ''):sign (intPart:ds (floatPart(sign ds) | -> int(sign + ds))) digit = :x ?(x in '0123456789') -> x digits = <digit*> digit1_9 = :x ?(x in '123456789') -> x intPart = (digit1_9:first digits:rest -> first + rest) | digit floatPart :sign :ds = <('.' digits exponent?) | exponent>:tail -> float(sign + ds + tail) exponent = ('e' | 'E') ('+' | '-')? digits """ JSONParser = makeGrammar(jsonGrammar, {}) Here are those 30 lines. This looks curiously similar to the business card I showed you earlier.
  33. So let’s look at a couple of these rules, just

    to get a flavor of how PEG parsing for a real format looks.
  34. ws = (' ' |'\t' |'\r'|'\n')* This one’s real simple.

    This matches zero or more whitespace characters.
  35. object = ws '{' members:m ws '}' -> dict(m) Here’s

    the rule for a JSON object. This skips whitespace if it’s present, then matches an open brace. It then calls the members rule and saves the result as m. Then it skips more whitespace and matches a close brace. It then calls dict() on m and returns the dict.
  36. members = (pair:first (ws ',' pair)*:rest -> [first] + rest)

    | -> [] Here’s that members rule the previous one called. dict() takes a list of key-value pairs. so members will return one. There are three cases we need to handle here:
  37. members = (pair:first (ws ',' pair)*:rest -> [first] + rest)

    | -> [] First we try to parse a single pair.
  38. members = (pair:first (ws ',' pair)*:rest -> [first] + rest)

    | -> [] If that fails. we return an empty list because there is not even a single key-value pair.
  39. members = (pair:first (ws ',' pair)*:rest -> [first] + rest)

    | -> [] If it suceeds we move on, and try to match a comma and then another pair. We repeat this as many times as possible, due to the star. Using star or plus returns a list of results.
  40. members = (pair:first (ws ',' pair)*:rest -> [first] + rest)

    | -> [] When we run out of pairs we return a list of all the ones we collected.
  41. pair = ws string:k ws ':' value:v -> (k, v)

    The rule for a single pair maches a string, a colon, and some JSON value and returns them as a 2-tuple.
  42. value = ws ( string | number | object |

    array | 'true' -> True | 'false' -> False | 'null' -> None ) Here’s the rule for a JSON value; we check for any of the JSON types, one by one. I’m going to skip the string, number, and array rules, they’re relatively boring.
  43. JSONParser = makeGrammar( jsonGrammar, {}) Here’s how we use this

    from Python. jsonGrammar is the string containing all these rules we just looked at. This creates a JSONParser class. Each rule is a method.
  44. >>> from parsley_json import JSONParser >>> txt = """ ...

    { ... "name": "Sir Robin", ... "braveness": -1 ... } ... """ >>> JSONParser(txt).object() Using that parser is pretty easy. We create an instance of the parser and call the object rule.
  45. >>> from parsley_json import JSONParser >>> txt = """ ...

    { ... name: "Sir Robin", ... braveness: -1 ... } ... """ >>> JSONParser(txt).object() So let’s see what happens when we invoke this on our non-JSON example.
  46. ometa.runtime.ParseError: name: "Sir Robin", ^ Parse error at line 3,

    column 2: expected one of token '"', or token '}'. trail: [string pair] We get an exception! And it shows us where the parse failed, what it expected, and what rules it called trying to parse it.
  47. nqjGrammar = """ jsname = ws <letter (letter | digit

    | '$' | '_')*> pair = (jsname | string):k ws ':' value:v -> (k, v) """ So now let’s write a jsname rule that matches a javascript identifier. We then write a rule that’s like our pair rule we looked at before, but it matches either a string or a javascript name.
  48. NQJParser = makeGrammar( nqjGrammar, {}, extends=JSONParser) We then _extend_ the

    existing parser class, creating a Not-Quite-Javascript parser that subclasses JSONParser.
  49. >>> p = NQJParser(txt) >>> print p.object() {'braveness': -1, 'name':

    'Sir Robin'} ... we handle our custom format just fine.
  50. def rule_foo(self): self.exactly(‘a’) def _or_1(): self.exactly(‘b’) def _or_2(): return self.exactly(‘c’)

    self._or(_or_1, _or_2) def _many1_3(): return self.exactly(‘d’) self.many1(_many1_3) return self.exactly(‘e’)
  51. interp = ['+' interp:x interp:y] -> x + y |

    ['*' interp:x interp:y] -> x * y | <digit+>:x -> int(x)
  52. interp = ['+' interp:x interp:y] -> x + y |

    ['*' interp:x interp:y] -> x * y | <digit+>:x -> int(x) [['+', '3', ['*', '5', '2']]]
  53. interp = ['+' interp:x interp:y] -> x + y |

    ['*' interp:x interp:y] -> x * y | <digit+>:x -> int(x) [['+', '3', ['*', '5', '2']]] 13