Packrat parsing in python

June 10, 2011

Parsing PEG grammars in python


June 10, 2011

  1. Introduction to packrat parsing for PEGs (Parsing Expression Grammars) gavin

  2. 2 roadmap Motivation PEG theory pyparsing PyMeta PyPy rlib/parsing Closing

    34 mins mins 04 05 16 07 min 01 min 01 mins mins mins
  3. motivation Natural languages Mini languages (DSLs) Structured / unstructured file

    formats 4 thoughts : i. Aren't structured formats like JSON, XML, HTML well-served by existing parsers ? ii. Parsing log files & configuration files are easy with python. iii. Regular expression is good enough. 3 How to parse texts with PEGs NLTK iv. What is wrong with the classical way of writing parsers ?
  4. CFG (Context Free Grammars) In formal language theory, CFG is

    suitable for modeling both natural & computer languages. 4 BNF is the defacto notation for describing syntax of CFGs. if_stmt ::= "if" expression ":" suite ( "elif" expression ":" suite )* [ "else" ":" suite ] EBNF Original BNF only supported recursion. sequence, decision(choice) repetition, recursion S S → a S → Ɛ
  5. CFG & Ambiguity CFG grammars are potentially ambiguous. Dangling else

    problem 1 if( x > 5 ) 2 if( y > 5 ) 3 console.log("heaven"); 4 else console.log("limbo"); IfExp IfExp Comp Name 'x' Ops > Log test Num 5 body orelse Str 'limbo' values Comp Name 'x' Name 'y' test Str 'heaven' Log values body AST #1 5
  6. CFG & Ambiguity (2) 6 IfExp IfExp Comp Name 'x'

    Ops > Log test Num 5 body orelse Str 'limbo' values Comp Name 'x' Name 'y' test Str 'heaven' Log values body AST #2
  7. Definitions Parse trees vs AST Top-Down vs Bottom-up = concrete

    whitespace, braces, semicolons = abstract = begin with start nonterminal. = work down the parse tree. = identify terminals = infer nonterminals = climb the parse tree. = nodes are nonterminals from grammar = uses tree nodes specific to language constructs
  8. Definitions (2) Recursive descent parsing 8 * A top-down parser

    constructed from recursive functions. * Each function represents a rule in the grammar. version ::= <digit> '.' <digit> digit ::= '0' | '1' ... | '9' def version( source, position=0 ): digit( source, position ) period( source, position + 1 ) digit( source, position + 2 ) Run (pymeta) nose --nocapture -v test_rdp_list.py
  9. Recursive Descent Parsing 9 def digit(source, position): fn = (lambda

    t: t in string.digits,this_rule()) expect(source, position, fn) def expect(source, position, comparator): try: expecting, msg = comparator if not expecting(source[0]): raise ParseError(position, msg) source.popleft() #consume ! except IndexError: raise EOFError(position) def period(source, position): fn = (lambda t: t == '.',this_rule()) expect(source, position, fn)
  10. Recursive Descent Parsing (2) >>> version(collections.deque('1.6')) >>> import collections ParseError:

    (1, 'expected <period>') >>> version(collections.deque('1,6')) >>> version(collections.deque('1.')) EOFError: (2, [('message', 'end of input')]) 10 ParseError: (0, 'expected <digit>') >>> version(collections.deque('A.6'))
  11. Classical method of parsing Specific to LALR(1) bottom-up parsers 11

    1. Flesh out a grammar in BNF 2. Lexical analysis phase lexer ( patterns, stream-of-characters) => stream of tokens 3. Parsing phase parser ( grammar, stream-of-tokens) => parse tree / AST 4. Use your parser Photo attribution: http://www.flickr.com/photos/j_aroche/2160902499/
  12. Spectrum of parsing solutions Regex Lex / Yacc parser generators

    (GNU flex/bison) PEG parsers Handwritten Recursive Descent Parsers ANTLR 12
  13. PEG Scanner-less Formalized by Bryan Ford in 2002-2004 Grammar mimics

    a recursive descent parser (+ backtracking). 14 A PEG grammar consists of a set of parsing expressions of the form: A e → One expression is denoted the starting expression e1 / e2 Ordered Choice e1 e2 Sequence e+ e? e* Repetition &e !e Predicates PEG != EBNF
  14. PEG's ordered choice S → “Hitch” / “Hitchens” Q. Given

    an input string of “Hitchens”, what is the result of the parse ? Law #1: Given an input of A, the parsing expression matches a prefix A' of A or fails. Law #2: A rule S -> M / N will try to parse for a M. If that fails, backtrack & look for N. Answer: Hitch 15
  15. PEG vs CFG PEG CFG Handles ambiguous grammars No Yes

    Syntax definition philosophy Analytical Generative Requires a lexical analysis phase ? No Yes (lex/yacc) Choice alternation Ordered Commutative e1/e2 16 Left recursion * No Yes * Warth et al. Packrat parsers can support left recursion (2008)
  16. PEG & Packrat parsing Neotoma Cinerea 17 Solution: memoization guarantees

    linear time performance. Context: recursive descent parsing with backtracking Problem: an input substring might be re-parsed during backtracking. grammar ::= AB | AC Photo attribution: http://en.wikipedia.org/wiki/File:Neotoma_cinerea.jpg
  17. Parse modern Japanese dates in various formats. If the date

    parses successfully, convert it to its equivalent datetime.date instance. 18 case study #1 problem statement
  18. case study #1 : The four ERAs 19 HEISEI (

    ) 1989 Jan 8 - present SHOWA ( ) 1926 Dec 25 – 1989 Jan 7 TAISHOU ( ) 1912 Jul 30 – 1926 Dec 24 MEIJI ( ) 1868 Sep 8 – 1912 July 29 Akihito Hirohito Yoshihito Mutsuhito
  19. case study #1 : liberties taken 1. No support for

    days-of-the-week tagged onto the end. 2. Numbers use western digits, not kanji. 3. Some eras have overlapping days. Ignore. 4. For 1st year of an era, no support for gannen. 20
  20. case study #1 : initial attempt from pyparsing import Literal,

    Word, nums year = Literal( u'\u5e74' ) month = Literal( u'\u6708' ) day = Literal( u'\u65e5' ) heisei_era = Literal( u'\u5e73\u6210' ) integer = Word(nums) 21 Word(nums, exact=2)
  21. case study #1 : initial attempt (2) western_year = integer('yyyy')

    + year imperial_year = heisei_era + western_year day_spec = integer.setResultsName('dd') + day month_spec = integer('mm') + month year_spec = (imperial_year('imperial') | western_year('western')) grammar = year_spec + month_spec + day_spec
  22. case study #1 : initial attempt (3) 23 result =

    grammar.parseString(japanese_date) print result.dump()
  23. pyparsing : introduction Easy to use PEG-based text parser Grammar

    definitions in python Framework distributed as one file pyparsing.py Runs on both python 2.x & 3.x . Future releases after 1.5.x will be focused on python 3.x only 24 Not classified as recursive descent !
  24. pyparsing & PEGs : correlation e1 e2 ̷ e1 e2

    e* e+ e? &e !e PEG pyparsing e1 + e2 == And( e1, e2 ) e1 | e2 == MatchFirst( [e1,e2] ) ZeroOrMore( e ) OneOrMore( e ) Optional( e ) Followed( e ) ~e == NotAny( e )
  25. pyparsing : ordered choice MatchFirst will short circuit as soon

    as a match is found. Not commutative. Shadowing literals in which one is a substring of the other should be avoided. 28 Keywords are different
  26. pyparsing : backtracking Or forces the parser to make an

    exhaustive search of the alternatives. (match longest) Or might introduce ambiguities. No better than non-PEG parsers. Tweak the order of alternatives & put most probable (e.g. frequency of occurrence) first. Avoids wasteful backtracking. 29
  27. pyparsing : backtracking p1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta']) first =

    p2 + p1 + p4 second = p2 + p1 + p5 third = p2 + p1 + p3 grammar = first | second | third print grammar.parseString( "messi ronaldo park-ji-sung" ) Ballon d'Or 2011 example
  28. pyparsing : left factored p1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta']) absolute_certainty

    = p2 + p1 too_close_to_call = p4 | p5 | p3 grammar = absolute_certainty + too_close_to_call print grammar.parseString( "messi ronaldo \ park-ji-sung" ) 32
  29. pyparsing : packrat Memoization must be manually turned on. ParserElement.enablePackrat()

    Caches: a. ParseResults b. Exceptions thrown run python select_parser.py 33 Caveat emptor: A grammar with parse actions that has side effects do not always play well with memoization turned on.
  30. pyparsing : semantic actions In pyparsing parlance, a ParserElement can

    have zero or more parsing actions. 34 4 forms of parse actions: fn(s,loc,toks) fn(loc,toks) fn(toks) fn() Usage: ParserElement.setParseAction( *fn ) ParserElement.addParseAction( *fn ) Uses: 1. Perform validation (see ParseException ) 2. Process the matched token(s) & modify it Returning a value overwrites the matched token(s). 3. Annotate with custom types (collary of #2)
  31. case study #1 : Semantic action integer = Word(nums).setParseAction( lambda

    t: int(t[0])) All users of the integer expression will inherit the parse action. def range_check(toks): month = int(toks[0]) if month <=0 or month >= 13: raise ParseException('month must be in range 1..12') month_spec = integer('m').addParseAction(range_check) + month Selective assignments of parse action to copies. Show: japan_simple.py 35 integer.copy().addParseAction( .. ) integer( 'result_name' ).addParseAction( .. ) !
  32. case study #1 : complete solution Show: japan_dates.py Demo: 37

    @traceParseAction def convert_kanji_year(toks): if 'imperial' in toks.keys(): year = toks.imperial.yearZero + toks.imperial.yy toks['era'] = toks.imperial.type_ toks['yyyy'] = year elif 'western' in toks.keys(): year = toks.yyyy try: toks['modernDate'] = date(year, toks.mm, toks.dd) except ValueError, error: raise ParseException(error.args[0])
  33. case study #2 problem statement Parse Gmail search criterias. Supports

    a tiny subset of the full grammar : from : ( <sender> ) label : inbox -label : sent yyyyy -yyyyy “zzzzz” -”zzzzz” 38
  34. case study #2: example strings label : sarawak -label :

    not-urgent from : ( bruno manser ) from : ( bruno.manser@swiss.org ) from : ( @swiss.org ) “penan injustice” -logging 39
  35. case study #2: email addresses emailfull = Regex(r"(?P<user>[A-Za-z0-9._%+-]+)@ (?P<hostname>[A-Za-z0-9.-]+)\.(?P<tld>[A-Za-z]{2,4})") emailpartial

    = Regex(r"@(?P<hostname>[A-Za-z0-9.-]+)\. (?P<tld>[A-Za-z]{2,4})") 40 email = (emailpartial | emailfull) squeeze = lambda t: ' '.join( t[0].split() ) name = ZeroOrMore(Word(alphanums + ' ')) .setParseAction( squeeze )
  36. case study #2: email addresses opener,closer,colon = map(Suppress,'():') enclosed =

    email | name nested = opener + enclosed + closer grammar_email = Combine(Suppress('from') + colon + nested) 41
  37. case study #2: email addresses result = grammar_email.parseString( 'from:(bruno.manser.25@borneo.org)' )

    print result.dump() 42 result = grammar.parseString( 'from:( Marco de Gasperi )') print result.dump() Run: nosetests -v testFromTo.py
  38. case study #2: labels hyphen = Suppress('-') label_rhs = delimitedList(Word(alphanums),

    delim='-', combine=True ) 43 Combine( expr + ZeroOrMore( delim + expr ) ) label_include = Combine( Suppress('label') + colon + label_rhs ) label_exclude = Combine( hyphen + label_include ) label_all = MatchFirst([ label_exclude.setResultsName('labels.exclude', listAllMatches=True), label_include('labels.include*')]) grammar_label = ZeroOrMore( label_all ) pyparsing 1.5.6 GOAL: group the excluded and included labels into their own sub-lists. E.g. label : fukushima1 -label : aloo-gobi
  39. case study #2: labels result = grammar_label.parseString('-label:fukushima1 label:onagawa -label:aloo-gobi label:cheese-naan'

    ) print result.dump() Question. Will this grammar work if the user entered LABEL instead of label ? 44 CaselessLiteral('label') Answer.
  40. case study #2: search strings GOAL: group the excluded and

    included search strings into their own sub-lists. key_single = Word(alphanums) key_quoted = quotedString.setParseAction(removeQuotes) key_included = key_quoted | key_single key_excluded = Combine(hyphen + key_included) key_all = MatchFirst( [key_excluded("key.exclude*"), key_included("key.include*")] ) grammar_key = ZeroOrMore( key_all ) 45 rumi - “ jack kerouac ”
  41. case study #2: search strings result = grammar_key.parseString( ' -osama

    obama -"bin laden" "white house" ' ) print result.dump() Question. If the user entered single instead of double quotes, will it conform to the grammar ? 46 Answer. Yes
  42. case study #2: Final solution email_all = grammar_email('from*') gmail =

    (ZeroOrMore(email_all | label_all | key_all) + Suppress(restOfLine)) Let's compose all the individual pieces together. 47 result = gmail.parseString('love label:writing-tips "bird by bird" from:(Anne Lamott) -"dalai lama" -label:macchu-pichu from:(agnes.obel@sparrow.net) -label:french-guiana -"epictetus" label:yoga "bugle podcast" label from:(@microsoft.com)') print result.dump() nested = opener + Group(enclosed) + closer
  43. 48 case study #2: Final solution ['love', 'writing-tips', 'bird-by-bird', 'Anne

    Lamott', 'dalai lama', 'macchu-pichu', '@microsoft.com', 'agnes.obel@sparrow.net', 'french-guiana', 'epictetus', 'yoga', 'bugle podcast', 'label'] -from: ['Anne Lamott','@microsoft.com', 'agnes.obel@sparrow.net'] -key.exclude: ['dalai lama','epictetus'] -key.include: ['love', 'bird by bird', 'bugle podcast', 'label'] -labels.exclude: ['macchu-pichu', 'french-guiana'] -labels.include: ['writing-tips','yoga']
  44. pyparsing: Recursion 49 A grammar is recursive when there exists

    a nonterminal which has itself in the right-hand-side of the production rule. number ::= digit rest rest ::= digit rest | empty digit = Word(nums,exact=1).setName('1-digit') rest = Forward() rest << Optional(digit + rest) number = Combine(digit + rest, adjacent=False) ('digit-list') grammar = number.setParseAction(lambda t:int(t[0])) + Suppress(restOfLine) Run
  45. case study #3: binary tree Parse parentheses notation for binary

    trees. (nil,4,nil) ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil)) 2 3 4 5 6 7 Convert it to list notation in python 50
  46. case study #3: recursive solution node ::= '(' node ','

    number ',' node ')' | empty BNF Code left, right, comma = map(Suppress, '(),') empty = (CaselessLiteral('nil') .setParseAction(replaceWith(None))) tree = Forward() value = Word(nums).setParseAction(lambda t:int(t[0])) tree << ((left + Group(tree) + bookend(value) + Group(tree) + right) 51 Run
  47. “ ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil)) ” [[[None],2,[[None],3,[None]]],4,[[[None],5, [[None],6,[None]]],7,[None]]] case study #3: recursive solution

    Input : Output : How to fix it : Group(tree) Re-implement Group in class TreeGroup(TokenConverter): def postParse(self, instring, loc, tokenlist): if len(tokenlist) == 1 and tokenlist[0] is None: return tokenlist else: return [tokenlist] 52
  48. pyparsing does not support left recursion. term ::= \d+ expr

    ::= expr + term | term @raises(RecursiveGrammarException) def test_left_recursion(self): expr.validate() Run 53 pyparsing : left recursion pyparsing will raise a RuntimeError with message 'maximum recursion depth exceeded' ' Eliminate left recursion if you want it to work in pyparsing
  49. PyMeta : introduction 55 lowercase ::= <char_range 'a' 'z'> OMeta

    is a language prototyping system (PEG). Implemented in several programming languages. * Packrat memoization * Grammar: BNF dialect (with host language snippets) * Object-Oriented: inheritance, overriding rules def rule_lowercase(): // ..body.. * <anything> consumes one object from the input stream. (c.f. regex) * Built-in rules <letter> <digit> <letterOrDigit> <token '?'>
  50. PEGs & PyMeta PEG PyMeta Syntactic Predicates (unlimited lookahead) e1

    e2 e1 | e2 ~~e !e == ~e e* e+ e? e1 e2 e* e+ &e e1 / e2 e? !e
  51. case study #1 : in PyMeta Modest goals: a) recognize

    western and Heisei imperial dates b) read & parse both imperial.utf8 & western.utf8 common.py : Common rules & utilities western_dates.py : Grammar to recognize western dates era_heisei.py : Grammar to recognize heisei dates japan_date_parser.py : Final grammar Separate files: 57
  52. case study #1 : in PyMeta pt A from pymeta.grammar

    import OMeta baseGrammar = r""" # common literals for all ERAs year ::= <token u'\u5E74'> month ::= <token u'\u6708'> day ::= <token u'\u65E5'> common.py range_num :min :max ::= <digit>+:m ?(int(join(m)) >= min & int(join(m)) <= max) => m rest_of_line ::= <anything>* <token '\n'>? => None empty_line ::= <spaces> <rest_of_line> => None python_comment ::= <token '#'> <rest_of_line> => None """ JapanCommonParser = OMeta.makeGrammar(baseGrammar, globals(), "JapanCommonParser") def join(x): return ''.join(x) 58
  53. case study #1 : in PyMeta pt B western_dates.py westernGrammar

    = r""" western ::= <spaces> <digit>+:y <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => westernized( int(join(y)),int(join(m)), int(join(d))) grammar ::= <python_comment> | <western>""" def westernized(yyyy, mm, dd): retval = JapanDate() retval['western'] = date(yyyy,mm,dd) return retval WesternParser = JapanCommonParser.makeGrammar( westernGrammar, globals(), 'WesternParser') 59
  54. case study #1 : in PyMeta pt C era_heisei.py 60

    era_heisei = Era('Heisei','Akihito', (u'\u5E73\u6210',u'\u337B'), startDate=date(1989,1,8)) def heisei_year_ok(yy): return (yy >= 1 and yy <= era_heisei.maxYearUnit) def collect( yy, mm, dd ): retval = JapanDate() retval['imperial'] = date( era_heisei.yearZero + yy, mm, dd ) retval['era'] = [ era_heisei.name, yy ] return retval
  55. case study #1: in PyMeta pt C (2) era_heisei.py (continued)

    heiseiGrammar = r""" hlong ::= <token u'\u5e73\u6210'> hshort ::= <token u'\u337b'> heisei ::= (<hlong> | <hshort>) <digit>+:y ?(heisei_year_ok(int(join(y)))) <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => collect(int(join(y)),int(join(m)),int(join(d))) """ HeiseiParser = JapanCommonParser.makeGrammar(heiseiGrammar, globals(), 'HeiseiParser') 61
  56. case study #1 : in PyMeta pt D japan_date_parser.py finalGrammar

    = r""" # override 'grammar' in WesternParser grammar ::= <super> | <heisei> | <empty_line>""" class BaseParser(HeiseiParser, WesternParser): pass BaseParser.globals.update(WesternParser.globals) BaseParser.globals.update(HeiseiParser.globals) JapanDateParser = BaseParser.makeGrammar( finalGrammar, globals(), "JapanDateParser") 62
  57. case study #1 : in PyMeta pt D (2) japan_date_parser.py

    (continued) def parse_file(filename): “”” iterate through each line “”” .... snipped ... parser = JapanDateParser(line) result,error = parser.apply('grammar') .... snipped ... results = parse_file('imperial.utf8') results = parse_file('western.utf8') Run 63
  58. PyMeta : Left Recursion recursiveGrammar = r""" num ::= <num>:n

    <digit>:d => n * 10 + d | <digit> digit ::= :d ?((d>='0') & (d<='9')) => int(d)""" PyMeta can handle left recursion. Run 65 Quiz. Is the following grammar equivalent ? num ::= <digit> | <num>:n <digit>:d => n * 10 + d
  59. PyMeta : Matching objects listGrammar = “”” digit ::= :x

    ?(x.isdigit()) => int(x) interp ::= [<digit>:x '+' <digit>:y] => x + y””” g = OMeta.makeGrammar(listGrammar, {}) parser = g( [['600','+','66']] ) result,error = parser.apply('interp') iterable python list 66 >>> result 666 >>> error ParseError(2,[])
  60. PyMeta : Matching objects (2) import :i ::= <anything>:a ?(a.__class__

    == Import) => 'import '+', '.join(import_match(a.names)) Object graph (e.g. tree) python rewriter project visits the AST tree created by the compiler module (python 2.x) & regenerates the python statement. >>> import compiler >>> print compiler.parse('import ctypes') >>> Module(None, Stmt([Import(['ctypes', None)])])) 67
  61. pyparsing vs PyMeta pyparsing PyMeta Whitespace sensitive? No. But turned

    on via leaveWhitespace() Yes. Use <spaces> rule to eat whitespaces Left recursion No Yes Packrat memoization Yes. Off by default. Yes. Only no-arg rules Operates on character streams Yes Yes Operates on object streams No Yes Syntactic predicates Yes Yes Semantic predicates No (@see parse actions) Yes Semantic actions Yes Yes Regex support No Yes 68
  62. PyPy rlib/parsing 69 Library for generating tokenizers & parsers in

    RPython. Consists of: regex / packrat parser tree structure / EBNF parser NUMBER: "\-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][\ +\-]?[0-9]+)?"; value: <STRING> | <NUMBER> | <object> | <array> | <"null"> |<"true"> | <"false">; array: ["["] (value [","])* value ["]"]; entry: STRING [":"] value; Sample JSON ebnf Resulting parse tree can be transformed or traversed with custom visitors. (dot)
  63. Topics not covered • Usage of syntactic predicates • Parsing

    grammars of mathematical expression in order to preserve operator precedence • Handling indents/dedents in order to parse indentation-sensitive languages – e.g. coffeescript, python, haskell
  64. Resources pyparsing PyMeta PyPy Rpython parsing library http://pyparsing.wikispaces.com/ http://www.tinlizzie.org/ometa/ http://doc.pypy.org/en/latest/rlib.html

    http://gitorious.org/python-decompiler/python_rewriter https://github.com/marcua/tweeql 71 rubycoder@gmail.com