Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Worry-Free Parsers with Parsley by Allen Short

Worry-Free Parsers with Parsley by Allen Short

Writing effective parsers for structured formats has usually required either tedious or inscrutable code. Here we introduce PEG parsers, which provide a concise and understandable way to write parsers for complex languages, as implemented via the Parsley library.

PyCon 2013

March 17, 2013
Tweet

More Decks by PyCon 2013

Other Decks in Programming

Transcript

  1. Worry-Free
    Parsers
    With Parsley
    Allen Short
    [email protected]

    View Slide

  2. {
    "name": "Sir Robin",
    "braveness": -1,
    "companions": [
    "Lancelot",
    "Arthur"
    ]
    }
    Every interesting program has to deal with some kind of structured
    data. XML, HTML, JSON, CSV, etc. are all common. Here's a nice little
    JSON object with some strings and numbers and arrays in it. Python
    makes this pretty easy to deal with; call json.loads() and you get a
    nice dictionary back.

    View Slide

  3. {
    name: "Sir Robin",
    braveness: -1,
    companions: [
    "Lancelot",,"Arthur"
    ]
    }
    Sometimes, though, we have to deal with a format we don't have premade solution for. Here's some data that isn't JSON. It's valid Javascript, but as you can
    see, there's no quotes around the keys... and Sir Not Appearing In This Film wound up in the array down there.

    View Slide

  4. 3
    At this point, you have three
    choices:

    View Slide

  5. do it wrong
    Strangely, this is the most popular
    option.

    View Slide

  6. "HTML is a language of sufficient complexity
    that it cannot be parsed by regular
    expressions. Even Jon Skeet cannot parse
    HTML using regular expressions. Every time
    you attempt to parse HTML with regular
    expressions, the unholy child weeps the
    blood of virgins, and Russian hackers pwn
    your webapp. Parsing HTML with regex
    summons tainted souls into the realm of the
    living. The cannot hold it is too late.
    The force of regex and HTML together in the
    same conceptual space will destroy your mind
    like so much watery putty."
    It's tempting to just throw together some regexes and string manipulation and guesswork to write something that will extract the information you need. But as a
    famous post from Andrew Clover on Stack Overflow reminds us,
    this doesn't always work out as well as we might hope.

    View Slide

  7. ocess HTML establishes a breach between this world an
    he dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but
    ore corrupt) a mere glimpse of the world of reg​ex parser
    for HTML will ins​tantly transport a programmer's
    consciousness into a world of ceaseless screaming, he
    omes, the pestilent slithy regex-infection wil​l devour you
    HT​ML parser, application and existence for all time like
    sual Basic only worse he comes he comes do not fi​ght h
    com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enlı̍̈́̂̈́ghtenment,
    HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the
    ong of re̸gular exp​ression parsing will exti​nguish the voice
    mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀
    beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ
    ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the
    ich​or permeates all MY FACE MY FACE ᵒh god no NO
    OO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ
    TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
    Some minor detail like whitespace might change, or we might have forgotten some case in the syntax that wasn't in our original example, and the code breaks
    on later inputs.

    View Slide

  8. only worse he comes do not fi​ght he
    s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all
    enment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟u
    k͏e liq​uid pain, the song of re̸gular exp
    n parsing will exti​nguish the voices of m
    n from the sp​here I can see it can you
    t is beautiful t​he final snuffing of the lie
    LL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he
    he c̶̮omes he comes the ich​or perme
    FACE MY FACE ᵒh god no NO NOO̼O
    p the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱
    TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
    Trying to fix those complicates the code even more, making later breakages even harder to understand and fix.

    View Slide

  9. e sp​here I can see it can
    e ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final
    uffing of the lie​s of Man A
    LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the
    comes he c̶̮omes he com
    e ich​or permeates all MY
    ACE MY FACE ᵒh god no N
    OO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe
    e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ
    H̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
    Eventually, we spiral into madness and
    despair.

    View Slide

  10. recursive descent
    So we don't want to do that.
    Another option is to write a recursive descent parser that manually iterates over each character or token, building a structure as it goes. This is usually the route
    taken for code that needs the most control over the parser's behavior.

    View Slide

  11. json.loads()
    Your C compiler is probably written this way. Python's json parser is written this way, and it works quite well. But have any of you looked at the
    implementation?

    View Slide

  12. 300
    It's pretty tedious reading. Not counting comments or whitespace, the JSON parser is about 300 lines of code.

    View Slide

  13. And JSON is a pretty simple format. It fits on a business card. A more complicated format is going to require a lot more code.

    View Slide

  14. parser generator
    Finally, you could use a parser generator. Parser generators were invented to let us write parsers in less code by providing a special syntax for it.

    View Slide

  15. yacc
    bison
    PLY
    ANTLR
    SPARK
    There's several available. yacc and bison are the classic C versions. PLY, ANTLR, and SPARK are available in Python. If you were in Alex Gaynor's talk, you
    heard him talk about how parsing was the most tedious and complicated part of writing an interpreter; his code example was using PLY.
    Now this is where the anxiety usually comes in, because for most people who know about parser generators, this is the first thing it makes them think of:

    View Slide

  16. The problem is that parser generators work completely different from Python. I always felt like I had to turn on a different part of my brain to deal with this
    kind of stuff. This is a real drag when trying to get new contributors to hack on your parser. Have a look at PEP 306 some time. There’s something like 10 or 12
    steps in there.
    I don’t even know what this picture means. It’s just something that came up on Google Images when I searched for “LALR parser”.

    View Slide

  17. Because it's so different, most discussions of parsing start with the abstract mathematical underpinnings. The famous Dragon Book does this; it talks about
    finite state machines and context-free grammars before talking about how it actually relates to something you'd want to do. Furthermore, most parser generators
    get talked about in the context of writing a compiler, which is really a small part of what they're useful for.

    View Slide

  18. ?
    So is there a way to write parsers that work _and_ are readable and debuggable?
    Fortunately there's been some new research in this area in the last decade.

    View Slide

  19. .strip()
    .split()
    [:x]
    ==
    Python already has good tools for basic text manipulation, so we want something that lets you use your knowledge of those.

    View Slide

  20. re
    Regular expressions are a powerful tool for some kinds of text manipulation, but they're too limited to do the entire job.
    So let's look at something that can give the benefits of both.

    View Slide

  21. PEG
    Parsing expression grammars are a relatively new kind of tool for describing parsers. Unlike the previous generation of parser generator tools, PEGs have a
    very similar model of execution to Python, while keeping the concise syntax of regular expressions.

    View Slide

  22. a(b|c)d+e
    Here's a very simple regular expression. it matches a, then b or c, then one or more ds,
    then e.

    View Slide

  23. 'a' ('b' | 'c') 'd'+ 'e'
    Here's the PEG expression for the same thing. As you can see, thesyntax is very similar. In PEG, spaces mean "and" and the pipe means "or".

    View Slide

  24. match('a') and
    (match('b') or
    match('c'))
    and match1OrMore('d')
    and match('e')
    So here's a pseudo-python expression for what that PEG does. ANDs and ORs short circuit, just like in Python.

    View Slide

  25. foo = 'a' ('b' | 'c') 'd'+ 'e'
    Now let's talk about how PEGs do more than regexes. We're going to give this rule a name,
    foo.

    View Slide

  26. baz = 'b' | 'c'
    foo = 'a' baz 'd'+ 'e'
    Rules can call other rules. So here we can move part of foo into a separate rule.
    PEGs are _expressions_, so they return a value. Here foo returns the last thing it matches, so 'e'.

    View Slide

  27. baz = 'b' | 'c'
    foo = 'a' baz:x 'd'+ 'e' -> x
    But we can give names to those values, and change the value a rule returns. Now foo returns 'b' or 'c', depending on the input.

    View Slide

  28. PEG Operation Python code
    ‘a' match('a')
    foo foo()
    foo(x y) foo(x, y)
    x y x and y
    x | y x or y
    ~x not x
    x* while x:
    ...
    x+ x
    while x:
    ...
    x? x or None
    x:name name = x
    Each piece of syntax corresponds to similar Python
    syntax.

    View Slide

  29. PyParsing
    LEPL
    Parsimonious
    There are already a few PEG tools in Python, like these libraries. I ended up writing my own, for a couple reasons.

    View Slide

  30. Parsley
    Parsley is based on OMeta, which is a parser technique developed in Smalltalk and in Javascript in 2007.

    View Slide

  31. Brevity
    Libraries like PyParsing embed their parser expressions in Python syntax, which requires a few contortions. Having a grammar syntax let me make parsing
    expressions short, like regexes.

    View Slide

  32. Error
    reporting
    Nearly all of the parser libraries I’ve used didn’t give you much more than “syntax error”, with no information about where or why.

    View Slide

  33. Speed
    And I wanted it to be fast enough to use for hard
    problems.

    View Slide

  34. 300
    The JSON parser in Python is 300 lines, like I said. Using Parsley we can do it in
    30.

    View Slide

  35. 30

    View Slide

  36. from parsley import makeGrammar
    jsonGrammar = r"""
    object = token('{') members:m token('}') -> dict(m)
    members = (pair:first (token(',') pair)*:rest -> [first] + rest) | -> []
    pair = string:k token(':') value:v -> (k, v)
    array = token('[') elements:xs token(']') -> xs
    elements = (value:first (token(',') value)*:rest -> [first] + rest) | -> []
    value = (string | number | object | array
    | token('true') -> True
    | token('false') -> False
    | token('null') -> None)
    string = token('"') (escapedChar | ~'"' anything)*:c '"' -> ''.join(c)
    escapedChar = '\\' (('"' -> '"') |('\\' -> '\\')
    |('/' -> '/') |('b' -> '\b')
    |('f' -> '\f') |('n' -> '\n')
    |('r' -> '\r') |('t' -> '\t')
    |('\'' -> '\'') | escapedUnicode)
    hexdigit = :x ?(x in '0123456789abcdefABCDEF') -> x
    escapedUnicode = 'u' :hs -> unichr(int(hs, 16))
    number = spaces ('-' | -> ''):sign (intPart:ds (floatPart(sign ds)
    | -> int(sign + ds)))
    digit = :x ?(x in '0123456789') -> x
    digits =
    digit1_9 = :x ?(x in '123456789') -> x
    intPart = (digit1_9:first digits:rest -> first + rest) | digit
    floatPart :sign :ds = <('.' digits exponent?) | exponent>:tail
    -> float(sign + ds + tail)
    exponent = ('e' | 'E') ('+' | '-')? digits
    """
    JSONParser = makeGrammar(jsonGrammar, {})
    Here are those 30 lines.
    This looks curiously similar to the business card I showed you earlier.

    View Slide

  37. So let’s look at a couple of these rules, just to get a flavor of how PEG parsing for a real format looks.

    View Slide

  38. ws = (' ' |'\t'
    |'\r'|'\n')*
    This one’s real simple. This matches zero or more whitespace
    characters.

    View Slide

  39. object = ws '{'
    members:m
    ws '}'
    -> dict(m)
    Here’s the rule for a JSON object. This skips whitespace if it’s present, then matches an open brace. It then calls the members rule and saves the result as m.
    Then it skips more whitespace and matches a close brace. It then calls dict() on m and returns the dict.

    View Slide

  40. members =
    (pair:first
    (ws ',' pair)*:rest
    -> [first] + rest)
    | -> []
    Here’s that members rule the previous one called. dict() takes a list of key-value pairs. so members will return one. There are three cases we need to handle
    here:

    View Slide

  41. members =
    (pair:first
    (ws ',' pair)*:rest
    -> [first] + rest)
    | -> []
    First we try to parse a single
    pair.

    View Slide

  42. members =
    (pair:first
    (ws ',' pair)*:rest
    -> [first] + rest)
    | -> []
    If that fails. we return an empty list because there is not even a single key-value
    pair.

    View Slide

  43. members =
    (pair:first
    (ws ',' pair)*:rest
    -> [first] + rest)
    | -> []
    If it suceeds we move on, and try to match a comma and then another pair. We repeat this as many times as possible, due to the star. Using star or plus returns a
    list of results.

    View Slide

  44. members =
    (pair:first
    (ws ',' pair)*:rest
    -> [first] + rest)
    | -> []
    When we run out of pairs we return a list of all the ones we
    collected.

    View Slide

  45. pair = ws string:k
    ws ':'
    value:v
    -> (k, v)
    The rule for a single pair maches a string, a colon, and some JSON value and returns them as a 2-tuple.

    View Slide

  46. value = ws (
    string
    | number
    | object
    | array
    | 'true' -> True
    | 'false' -> False
    | 'null' -> None
    )
    Here’s the rule for a JSON value; we check for any of the JSON types, one by one.
    I’m going to skip the string, number, and array rules, they’re relatively boring.

    View Slide

  47. JSONParser = makeGrammar(
    jsonGrammar, {})
    Here’s how we use this from Python. jsonGrammar is the string containing all these rules we just looked at. This creates a JSONParser class. Each rule is a
    method.

    View Slide

  48. >>> from parsley_json import
    JSONParser
    >>> txt = """
    ... {
    ... "name": "Sir Robin",
    ... "braveness": -1
    ... }
    ... """
    >>> JSONParser(txt).object()
    Using that parser is pretty easy. We create an instance of the parser and call the object
    rule.

    View Slide

  49. {'braveness': -1,'name': 'Sir Robin'}
    Ta da, we get a
    dict.

    View Slide

  50. >>> from parsley_json import
    JSONParser
    >>> txt = """
    ... {
    ... name: "Sir Robin",
    ... braveness: -1
    ... }
    ... """
    >>> JSONParser(txt).object()
    So let’s see what happens when we invoke this on our non-JSON
    example.

    View Slide

  51. ometa.runtime.ParseError:
    name: "Sir Robin",
    ^
    Parse error at line 3,
    column 2: expected one of
    token '"', or token '}'.
    trail: [string pair]
    We get an exception! And it shows us where the parse failed, what it expected, and what rules it called trying to parse it.

    View Slide

  52. nqjGrammar = """
    jsname = ws (letter | digit
    | '$' | '_')*>
    pair = (jsname | string):k
    ws ':'
    value:v -> (k, v)
    """
    So now let’s write a jsname rule that matches a javascript identifier. We then write a rule that’s like our pair rule we looked at before, but it matches either a
    string or a javascript name.

    View Slide

  53. NQJParser = makeGrammar(
    nqjGrammar, {},
    extends=JSONParser)
    We then _extend_ the existing parser class, creating a Not-Quite-Javascript parser that subclasses JSONParser.

    View Slide

  54. >>> p = NQJParser(txt)
    >>> print p.object()
    So using our new
    parser...

    View Slide

  55. >>> p = NQJParser(txt)
    >>> print p.object()
    {'braveness': -1,
    'name': 'Sir Robin'}
    ... we handle our custom format just
    fine.

    View Slide

  56. http://isometric.sixsided.org/strips/
    semicolon_of_death/

    View Slide

  57. [email protected]
    ?
    http://github.com/washort/parsley

    View Slide

  58. foo = 'a' ('b' | 'c') 'd'+ 'e'

    View Slide

  59. Rule('foo',
    And(Exactly('a'),
    Or(Exactly('b'),
    Exactly('c')),
    Many1(Exactly('d')),
    Exactly('e')))

    View Slide

  60. def rule_foo(self):
    self.exactly(‘a’)
    def _or_1():
    self.exactly(‘b’)
    def _or_2():
    return self.exactly(‘c’)
    self._or(_or_1, _or_2)
    def _many1_3():
    return self.exactly(‘d’)
    self.many1(_many1_3)
    return self.exactly(‘e’)

    View Slide

  61. interp =
    ['+' interp:x interp:y] -> x + y
    | ['*' interp:x interp:y] -> x * y
    | :x -> int(x)

    View Slide

  62. interp =
    ['+' interp:x interp:y] -> x + y
    | ['*' interp:x interp:y] -> x * y
    | :x -> int(x)
    [['+', '3', ['*', '5', '2']]]

    View Slide

  63. interp =
    ['+' interp:x interp:y] -> x + y
    | ['*' interp:x interp:y] -> x * y
    | :x -> int(x)
    [['+', '3', ['*', '5', '2']]]
    13

    View Slide