Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Worry Free Parsers with Parsley

washort
March 17, 2013

Worry Free Parsers with Parsley

Writing effective parsers for structured formats has usually required either tedious or inscrutable code. Here we introduce PEG parsers, which provide a concise and understandable way to write parsers for complex languages, as implemented via the Parsley library.

washort

March 17, 2013
Tweet

Other Decks in Programming

Transcript

  1. 3

  2. "HTML is a language of sufficient complexity that it cannot

    be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty."
  3. ocess HTML establishes a breach between this world an he

    dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but ore corrupt) a mere glimpse of the world of reg​ex parser for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he omes, the pestilent slithy regex-infection wil​l devour you HT​ML parser, application and existence for all time like sual Basic only worse he comes he comes do not fi​ght h com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enlı̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the ong of re̸gular exp​ression parsing will exti​nguish the voice mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO OO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
  4. only worse he comes do not fi​ght he s, ̕h̵i​s

    un̨ho͞ly radiańcé destro҉ying all enment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟u k͏e liq​uid pain, the song of re̸gular exp n parsing will exti​nguish the voices of m n from the sp​here I can see it can you t is beautiful t​he final snuffing of the lie LL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he he c̶̮omes he comes the ich​or perme FACE MY FACE ᵒh god no NO NOO̼O p the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
  5. e sp​here I can see it can e ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it

    is beautiful t​he final uffing of the lie​s of Man A LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the comes he c̶̮omes he com e ich​or permeates all MY ACE MY FACE ᵒh god no N OO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe
  6. 300

  7. ?

  8. re

  9. PEG

  10. PEG Operation Python code ‘a' match('a') foo foo() foo(x y)

    foo(x, y) x y x and y x | y x or y ~x not x x* while x: ... x+ x while x: ... x? x or None x:name name = x
  11. 300

  12. 30

  13. from parsley import makeGrammar jsonGrammar = r""" object = token('{')

    members:m token('}') -> dict(m) members = (pair:first (token(',') pair)*:rest -> [first] + rest) | -> [] pair = string:k token(':') value:v -> (k, v) array = token('[') elements:xs token(']') -> xs elements = (value:first (token(',') value)*:rest -> [first] + rest) | -> [] value = (string | number | object | array | token('true') -> True | token('false') -> False | token('null') -> None) string = token('"') (escapedChar | ~'"' anything)*:c '"' -> ''.join(c) escapedChar = '\\' (('"' -> '"') |('\\' -> '\\') |('/' -> '/') |('b' -> '\b') |('f' -> '\f') |('n' -> '\n') |('r' -> '\r') |('t' -> '\t') |('\'' -> '\'') | escapedUnicode) hexdigit = :x ?(x in '0123456789abcdefABCDEF') -> x escapedUnicode = 'u' <hexdigit{4}>:hs -> unichr(int(hs, 16)) number = spaces ('-' | -> ''):sign (intPart:ds (floatPart(sign ds) | -> int(sign + ds))) digit = :x ?(x in '0123456789') -> x digits = <digit*> digit1_9 = :x ?(x in '123456789') -> x intPart = (digit1_9:first digits:rest -> first + rest) | digit floatPart :sign :ds = <('.' digits exponent?) | exponent>:tail -> float(sign + ds + tail) exponent = ('e' | 'E') ('+' | '-')? digits """ JSONParser = makeGrammar(jsonGrammar, {})
  14. value = ws ( string | number | object |

    array | 'true' -> True | 'false' -> False | 'null' -> None )
  15. >>> from parsley_json import JSONParser >>> txt = """ ...

    { ... "name": "Sir Robin", ... "braveness": -1 ... } ... """ >>> JSONParser(txt).object()
  16. >>> from parsley_json import JSONParser >>> txt = """ ...

    { ... name: "Sir Robin", ... braveness: -1 ... } ... """ >>> JSONParser(txt).object()
  17. ometa.runtime.ParseError: name: "Sir Robin", ^ Parse error at line 3,

    column 2: expected one of token '"', or token '}'. trail: [string pair]
  18. nqjGrammar = """ jsname = ws <letter (letter | digit

    | '$' | '_')*> pair = (jsname | string):k ws ':' value:v -> (k, v) """
  19. def rule_foo(self): self.exactly(‘a’) def _or_1(): self.exactly(‘b’) def _or_2(): return self.exactly(‘c’)

    self._or(_or_1, _or_2) def _many1_3(): return self.exactly(‘d’) self.many1(_many1_3) return self.exactly(‘e’)
  20. interp = ['+' interp:x interp:y] -> x + y |

    ['*' interp:x interp:y] -> x * y | <digit+>:x -> int(x)
  21. interp = ['+' interp:x interp:y] -> x + y |

    ['*' interp:x interp:y] -> x * y | <digit+>:x -> int(x) [['+', '3', ['*', '5', '2']]]
  22. interp = ['+' interp:x interp:y] -> x + y |

    ['*' interp:x interp:y] -> x * y | <digit+>:x -> int(x) [['+', '3', ['*', '5', '2']]] 13