Packrat parsing in python

Introduction to packrat parsing for PEGs (Parsing Expression Grammars) gavin
bong pycon APAC 2011, Singapore Le 8 juin 2011 2129 mercredi

2 roadmap Motivation PEG theory pyparsing PyMeta PyPy rlib/parsing Closing
34 mins mins 04 05 16 07 min 01 min 01 mins mins mins

motivation Natural languages Mini languages (DSLs) Structured / unstructured file
formats 4 thoughts : i. Aren't structured formats like JSON, XML, HTML well-served by existing parsers ? ii. Parsing log files & configuration files are easy with python. iii. Regular expression is good enough. 3 How to parse texts with PEGs NLTK iv. What is wrong with the classical way of writing parsers ?

CFG (Context Free Grammars) In formal language theory, CFG is
suitable for modeling both natural & computer languages. 4 BNF is the defacto notation for describing syntax of CFGs. if_stmt ::= "if" expression ":" suite ( "elif" expression ":" suite )* [ "else" ":" suite ] EBNF Original BNF only supported recursion. sequence, decision(choice) repetition, recursion S S → a S → Ɛ

CFG & Ambiguity CFG grammars are potentially ambiguous. Dangling else
problem 1 if( x > 5 ) 2 if( y > 5 ) 3 console.log("heaven"); 4 else console.log("limbo"); IfExp IfExp Comp Name 'x' Ops > Log test Num 5 body orelse Str 'limbo' values Comp Name 'x' Name 'y' test Str 'heaven' Log values body AST #1 5

CFG & Ambiguity (2) 6 IfExp IfExp Comp Name 'x'
Ops > Log test Num 5 body orelse Str 'limbo' values Comp Name 'x' Name 'y' test Str 'heaven' Log values body AST #2

Definitions Parse trees vs AST Top-Down vs Bottom-up = concrete
whitespace, braces, semicolons = abstract = begin with start nonterminal. = work down the parse tree. = identify terminals = infer nonterminals = climb the parse tree. = nodes are nonterminals from grammar = uses tree nodes specific to language constructs

Definitions (2) Recursive descent parsing 8 * A top-down parser
constructed from recursive functions. * Each function represents a rule in the grammar. version ::= <digit> '.' <digit> digit ::= '0' | '1' ... | '9' def version( source, position=0 ): digit( source, position ) period( source, position + 1 ) digit( source, position + 2 ) Run (pymeta) nose --nocapture -v test_rdp_list.py

Recursive Descent Parsing 9 def digit(source, position): fn = (lambda
t: t in string.digits,this_rule()) expect(source, position, fn) def expect(source, position, comparator): try: expecting, msg = comparator if not expecting(source[0]): raise ParseError(position, msg) source.popleft() #consume ! except IndexError: raise EOFError(position) def period(source, position): fn = (lambda t: t == '.',this_rule()) expect(source, position, fn)

Recursive Descent Parsing (2) >>> version(collections.deque('1.6')) >>> import collections ParseError:
(1, 'expected <period>') >>> version(collections.deque('1,6')) >>> version(collections.deque('1.')) EOFError: (2, [('message', 'end of input')]) 10 ParseError: (0, 'expected <digit>') >>> version(collections.deque('A.6'))

Classical method of parsing Specific to LALR(1) bottom-up parsers 11
1. Flesh out a grammar in BNF 2. Lexical analysis phase lexer ( patterns, stream-of-characters) => stream of tokens 3. Parsing phase parser ( grammar, stream-of-tokens) => parse tree / AST 4. Use your parser Photo attribution: http://www.flickr.com/photos/j_aroche/2160902499/

Spectrum of parsing solutions Regex Lex / Yacc parser generators
(GNU flex/bison) PEG parsers Handwritten Recursive Descent Parsers ANTLR 12

Other python parsing toolkits PLY funcparserlib Yapps http:// wiki.python.org /
moin / LanguageParsing 13

PEG Scanner-less Formalized by Bryan Ford in 2002-2004 Grammar mimics
a recursive descent parser (+ backtracking). 14 A PEG grammar consists of a set of parsing expressions of the form: A e → One expression is denoted the starting expression e1 / e2 Ordered Choice e1 e2 Sequence e+ e? e* Repetition &e !e Predicates PEG != EBNF

PEG's ordered choice S → “Hitch” / “Hitchens” Q. Given
an input string of “Hitchens”, what is the result of the parse ? Law #1: Given an input of A, the parsing expression matches a prefix A' of A or fails. Law #2: A rule S -> M / N will try to parse for a M. If that fails, backtrack & look for N. Answer: Hitch 15

PEG vs CFG PEG CFG Handles ambiguous grammars No Yes
Syntax definition philosophy Analytical Generative Requires a lexical analysis phase ? No Yes (lex/yacc) Choice alternation Ordered Commutative e1/e2 16 Left recursion * No Yes * Warth et al. Packrat parsers can support left recursion (2008)

PEG & Packrat parsing Neotoma Cinerea 17 Solution: memoization guarantees
linear time performance. Context: recursive descent parsing with backtracking Problem: an input substring might be re-parsed during backtracking. grammar ::= AB | AC Photo attribution: http://en.wikipedia.org/wiki/File:Neotoma_cinerea.jpg

Parse modern Japanese dates in various formats. If the date
parses successfully, convert it to its equivalent datetime.date instance. 18 case study #1 problem statement

case study #1 : The four ERAs 19 HEISEI (
) 1989 Jan 8 - present SHOWA ( ) 1926 Dec 25 – 1989 Jan 7 TAISHOU ( ) 1912 Jul 30 – 1926 Dec 24 MEIJI ( ) 1868 Sep 8 – 1912 July 29 Akihito Hirohito Yoshihito Mutsuhito

case study #1 : liberties taken 1. No support for
days-of-the-week tagged onto the end. 2. Numbers use western digits, not kanji. 3. Some eras have overlapping days. Ignore. 4. For 1st year of an era, no support for gannen. 20

case study #1 : initial attempt from pyparsing import Literal,
Word, nums year = Literal( u'\u5e74' ) month = Literal( u'\u6708' ) day = Literal( u'\u65e5' ) heisei_era = Literal( u'\u5e73\u6210' ) integer = Word(nums) 21 Word(nums, exact=2)

case study #1 : initial attempt (2) western_year = integer('yyyy')
+ year imperial_year = heisei_era + western_year day_spec = integer.setResultsName('dd') + day month_spec = integer('mm') + month year_spec = (imperial_year('imperial') | western_year('western')) grammar = year_spec + month_spec + day_spec

case study #1 : initial attempt (3) 23 result =
grammar.parseString(japanese_date) print result.dump()

pyparsing : introduction Easy to use PEG-based text parser Grammar
definitions in python Framework distributed as one file pyparsing.py Runs on both python 2.x & 3.x . Future releases after 1.5.x will be focused on python 3.x only 24 Not classified as recursive descent !

25 pyparsing : framework overview

pyparsing & PEGs : correlation e1 e2 ̷ e1 e2
e* e+ e? &e !e PEG pyparsing e1 + e2 == And( e1, e2 ) e1 | e2 == MatchFirst( [e1,e2] ) ZeroOrMore( e ) OneOrMore( e ) Optional( e ) Followed( e ) ~e == NotAny( e )

pyparsing : framework overview 27

pyparsing : ordered choice MatchFirst will short circuit as soon
as a match is found. Not commutative. Shadowing literals in which one is a substring of the other should be avoided. 28 Keywords are different

pyparsing : backtracking Or forces the parser to make an
exhaustive search of the alternatives. (match longest) Or might introduce ambiguities. No better than non-PEG parsers. Tweak the order of alternatives & put most probable (e.g. frequency of occurrence) first. Avoids wasteful backtracking. 29

pyparsing : backtracking p1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta']) first =
p2 + p1 + p4 second = p2 + p1 + p5 third = p2 + p1 + p3 grammar = first | second | third print grammar.parseString( "messi ronaldo park-ji-sung" ) Ballon d'Or 2011 example

pyparsing : backtracking 31

pyparsing : left factored p1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta']) absolute_certainty
= p2 + p1 too_close_to_call = p4 | p5 | p3 grammar = absolute_certainty + too_close_to_call print grammar.parseString( "messi ronaldo \ park-ji-sung" ) 32

pyparsing : packrat Memoization must be manually turned on. ParserElement.enablePackrat()
Caches: a. ParseResults b. Exceptions thrown run python select_parser.py 33 Caveat emptor: A grammar with parse actions that has side effects do not always play well with memoization turned on.

pyparsing : semantic actions In pyparsing parlance, a ParserElement can
have zero or more parsing actions. 34 4 forms of parse actions: fn(s,loc,toks) fn(loc,toks) fn(toks) fn() Usage: ParserElement.setParseAction( *fn ) ParserElement.addParseAction( *fn ) Uses: 1. Perform validation (see ParseException ) 2. Process the matched token(s) & modify it Returning a value overwrites the matched token(s). 3. Annotate with custom types (collary of #2)

case study #1 : Semantic action integer = Word(nums).setParseAction( lambda
t: int(t[0])) All users of the integer expression will inherit the parse action. def range_check(toks): month = int(toks[0]) if month <=0 or month >= 13: raise ParseException('month must be in range 1..12') month_spec = integer('m').addParseAction(range_check) + month Selective assignments of parse action to copies. Show: japan_simple.py 35 integer.copy().addParseAction( .. ) integer( 'result_name' ).addParseAction( .. ) !

case study #1 : test files imperial . utf8 western
. utf8 36

case study #1 : complete solution Show: japan_dates.py Demo: 37
@traceParseAction def convert_kanji_year(toks): if 'imperial' in toks.keys(): year = toks.imperial.yearZero + toks.imperial.yy toks['era'] = toks.imperial.type_ toks['yyyy'] = year elif 'western' in toks.keys(): year = toks.yyyy try: toks['modernDate'] = date(year, toks.mm, toks.dd) except ValueError, error: raise ParseException(error.args[0])

case study #2 problem statement Parse Gmail search criterias. Supports
a tiny subset of the full grammar : from : ( <sender> ) label : inbox -label : sent yyyyy -yyyyy “zzzzz” -”zzzzz” 38

case study #2: example strings label : sarawak -label :
not-urgent from : ( bruno manser ) from : ( [email protected] ) from : ( @swiss.org ) “penan injustice” -logging 39

case study #2: email addresses emailfull = Regex(r"(?P<user>[A-Za-z0-9._%+-]+)@ (?P<hostname>[A-Za-z0-9.-]+)\.(?P<tld>[A-Za-z]{2,4})") emailpartial
= Regex(r"@(?P<hostname>[A-Za-z0-9.-]+)\. (?P<tld>[A-Za-z]{2,4})") 40 email = (emailpartial | emailfull) squeeze = lambda t: ' '.join( t[0].split() ) name = ZeroOrMore(Word(alphanums + ' ')) .setParseAction( squeeze )

case study #2: email addresses opener,closer,colon = map(Suppress,'():') enclosed =
email | name nested = opener + enclosed + closer grammar_email = Combine(Suppress('from') + colon + nested) 41

case study #2: email addresses result = grammar_email.parseString( 'from:([email protected])' )
print result.dump() 42 result = grammar.parseString( 'from:( Marco de Gasperi )') print result.dump() Run: nosetests -v testFromTo.py

case study #2: labels hyphen = Suppress('-') label_rhs = delimitedList(Word(alphanums),
delim='-', combine=True ) 43 Combine( expr + ZeroOrMore( delim + expr ) ) label_include = Combine( Suppress('label') + colon + label_rhs ) label_exclude = Combine( hyphen + label_include ) label_all = MatchFirst([ label_exclude.setResultsName('labels.exclude', listAllMatches=True), label_include('labels.include*')]) grammar_label = ZeroOrMore( label_all ) pyparsing 1.5.6 GOAL: group the excluded and included labels into their own sub-lists. E.g. label : fukushima1 -label : aloo-gobi

case study #2: labels result = grammar_label.parseString('-label:fukushima1 label:onagawa -label:aloo-gobi label:cheese-naan'
) print result.dump() Question. Will this grammar work if the user entered LABEL instead of label ? 44 CaselessLiteral('label') Answer.

case study #2: search strings GOAL: group the excluded and
included search strings into their own sub-lists. key_single = Word(alphanums) key_quoted = quotedString.setParseAction(removeQuotes) key_included = key_quoted | key_single key_excluded = Combine(hyphen + key_included) key_all = MatchFirst( [key_excluded("key.exclude*"), key_included("key.include*")] ) grammar_key = ZeroOrMore( key_all ) 45 rumi - “ jack kerouac ”

case study #2: search strings result = grammar_key.parseString( ' -osama
obama -"bin laden" "white house" ' ) print result.dump() Question. If the user entered single instead of double quotes, will it conform to the grammar ? 46 Answer. Yes

case study #2: Final solution email_all = grammar_email('from*') gmail =
(ZeroOrMore(email_all | label_all | key_all) + Suppress(restOfLine)) Let's compose all the individual pieces together. 47 result = gmail.parseString('love label:writing-tips "bird by bird" from:(Anne Lamott) -"dalai lama" -label:macchu-pichu from:([email protected]) -label:french-guiana -"epictetus" label:yoga "bugle podcast" label from:(@microsoft.com)') print result.dump() nested = opener + Group(enclosed) + closer

48 case study #2: Final solution ['love', 'writing-tips', 'bird-by-bird', 'Anne
Lamott', 'dalai lama', 'macchu-pichu', '@microsoft.com', '[email protected]', 'french-guiana', 'epictetus', 'yoga', 'bugle podcast', 'label'] -from: ['Anne Lamott','@microsoft.com', '[email protected]'] -key.exclude: ['dalai lama','epictetus'] -key.include: ['love', 'bird by bird', 'bugle podcast', 'label'] -labels.exclude: ['macchu-pichu', 'french-guiana'] -labels.include: ['writing-tips','yoga']

pyparsing: Recursion 49 A grammar is recursive when there exists
a nonterminal which has itself in the right-hand-side of the production rule. number ::= digit rest rest ::= digit rest | empty digit = Word(nums,exact=1).setName('1-digit') rest = Forward() rest << Optional(digit + rest) number = Combine(digit + rest, adjacent=False) ('digit-list') grammar = number.setParseAction(lambda t:int(t[0])) + Suppress(restOfLine) Run

case study #3: binary tree Parse parentheses notation for binary
trees. (nil,4,nil) ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil)) 2 3 4 5 6 7 Convert it to list notation in python 50

case study #3: recursive solution node ::= '(' node ','
number ',' node ')' | empty BNF Code left, right, comma = map(Suppress, '(),') empty = (CaselessLiteral('nil') .setParseAction(replaceWith(None))) tree = Forward() value = Word(nums).setParseAction(lambda t:int(t[0])) tree << ((left + Group(tree) + bookend(value) + Group(tree) + right) 51 Run

“ ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil)) ” [[[None],2,[[None],3,[None]]],4,[[[None],5, [[None],6,[None]]],7,[None]]] case study #3: recursive solution
Input : Output : How to fix it : Group(tree) Re-implement Group in class TreeGroup(TokenConverter): def postParse(self, instring, loc, tokenlist): if len(tokenlist) == 1 and tokenlist[0] is None: return tokenlist else: return [tokenlist] 52

pyparsing does not support left recursion. term ::= \d+ expr
::= expr + term | term @raises(RecursiveGrammarException) def test_left_recursion(self): expr.validate() Run 53 pyparsing : left recursion pyparsing will raise a RuntimeError with message 'maximum recursion depth exceeded' ' Eliminate left recursion if you want it to work in pyparsing

PyMeta : introduction 55 lowercase ::= <char_range 'a' 'z'> OMeta
is a language prototyping system (PEG). Implemented in several programming languages. * Packrat memoization * Grammar: BNF dialect (with host language snippets) * Object-Oriented: inheritance, overriding rules def rule_lowercase(): // ..body.. * <anything> consumes one object from the input stream. (c.f. regex) * Built-in rules <letter> <digit> <letterOrDigit> <token '?'>

PEGs & PyMeta PEG PyMeta Syntactic Predicates (unlimited lookahead) e1
e2 e1 | e2 ~~e !e == ~e e* e+ e? e1 e2 e* e+ &e e1 / e2 e? !e

case study #1 : in PyMeta Modest goals: a) recognize
western and Heisei imperial dates b) read & parse both imperial.utf8 & western.utf8 common.py : Common rules & utilities western_dates.py : Grammar to recognize western dates era_heisei.py : Grammar to recognize heisei dates japan_date_parser.py : Final grammar Separate files: 57

case study #1 : in PyMeta pt A from pymeta.grammar
import OMeta baseGrammar = r""" # common literals for all ERAs year ::= <token u'\u5E74'> month ::= <token u'\u6708'> day ::= <token u'\u65E5'> common.py range_num :min :max ::= <digit>+:m ?(int(join(m)) >= min & int(join(m)) <= max) => m rest_of_line ::= <anything>* <token '\n'>? => None empty_line ::= <spaces> <rest_of_line> => None python_comment ::= <token '#'> <rest_of_line> => None """ JapanCommonParser = OMeta.makeGrammar(baseGrammar, globals(), "JapanCommonParser") def join(x): return ''.join(x) 58

case study #1 : in PyMeta pt B western_dates.py westernGrammar
= r""" western ::= <spaces> <digit>+:y <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => westernized( int(join(y)),int(join(m)), int(join(d))) grammar ::= <python_comment> | <western>""" def westernized(yyyy, mm, dd): retval = JapanDate() retval['western'] = date(yyyy,mm,dd) return retval WesternParser = JapanCommonParser.makeGrammar( westernGrammar, globals(), 'WesternParser') 59

case study #1 : in PyMeta pt C era_heisei.py 60
era_heisei = Era('Heisei','Akihito', (u'\u5E73\u6210',u'\u337B'), startDate=date(1989,1,8)) def heisei_year_ok(yy): return (yy >= 1 and yy <= era_heisei.maxYearUnit) def collect( yy, mm, dd ): retval = JapanDate() retval['imperial'] = date( era_heisei.yearZero + yy, mm, dd ) retval['era'] = [ era_heisei.name, yy ] return retval

case study #1: in PyMeta pt C (2) era_heisei.py (continued)
heiseiGrammar = r""" hlong ::= <token u'\u5e73\u6210'> hshort ::= <token u'\u337b'> heisei ::= (<hlong> | <hshort>) <digit>+:y ?(heisei_year_ok(int(join(y)))) <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => collect(int(join(y)),int(join(m)),int(join(d))) """ HeiseiParser = JapanCommonParser.makeGrammar(heiseiGrammar, globals(), 'HeiseiParser') 61

case study #1 : in PyMeta pt D japan_date_parser.py finalGrammar
= r""" # override 'grammar' in WesternParser grammar ::= <super> | <heisei> | <empty_line>""" class BaseParser(HeiseiParser, WesternParser): pass BaseParser.globals.update(WesternParser.globals) BaseParser.globals.update(HeiseiParser.globals) JapanDateParser = BaseParser.makeGrammar( finalGrammar, globals(), "JapanDateParser") 62

case study #1 : in PyMeta pt D (2) japan_date_parser.py
(continued) def parse_file(filename): “”” iterate through each line “”” .... snipped ... parser = JapanDateParser(line) result,error = parser.apply('grammar') .... snipped ... results = parse_file('imperial.utf8') results = parse_file('western.utf8') Run 63

case study #1 : PyMeta output 64

PyMeta : Left Recursion recursiveGrammar = r""" num ::= <num>:n
<digit>:d => n * 10 + d | <digit> digit ::= :d ?((d>='0') & (d<='9')) => int(d)""" PyMeta can handle left recursion. Run 65 Quiz. Is the following grammar equivalent ? num ::= <digit> | <num>:n <digit>:d => n * 10 + d

PyMeta : Matching objects listGrammar = “”” digit ::= :x
?(x.isdigit()) => int(x) interp ::= [<digit>:x '+' <digit>:y] => x + y””” g = OMeta.makeGrammar(listGrammar, {}) parser = g( [['600','+','66']] ) result,error = parser.apply('interp') iterable python list 66 >>> result 666 >>> error ParseError(2,[])

PyMeta : Matching objects (2) import :i ::= <anything>:a ?(a.__class__
== Import) => 'import '+', '.join(import_match(a.names)) Object graph (e.g. tree) python rewriter project visits the AST tree created by the compiler module (python 2.x) & regenerates the python statement. >>> import compiler >>> print compiler.parse('import ctypes') >>> Module(None, Stmt([Import(['ctypes', None)])])) 67

pyparsing vs PyMeta pyparsing PyMeta Whitespace sensitive? No. But turned
on via leaveWhitespace() Yes. Use <spaces> rule to eat whitespaces Left recursion No Yes Packrat memoization Yes. Off by default. Yes. Only no-arg rules Operates on character streams Yes Yes Operates on object streams No Yes Syntactic predicates Yes Yes Semantic predicates No (@see parse actions) Yes Semantic actions Yes Yes Regex support No Yes 68

PyPy rlib/parsing 69 Library for generating tokenizers & parsers in
RPython. Consists of: regex / packrat parser tree structure / EBNF parser NUMBER: "\-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][\ +\-]?[0-9]+)?"; value: <STRING> | <NUMBER> | <object> | <array> | <"null"> |<"true"> | <"false">; array: ["["] (value [","])* value ["]"]; entry: STRING [":"] value; Sample JSON ebnf Resulting parse tree can be transformed or traversed with custom visitors. (dot)

Topics not covered • Usage of syntactic predicates • Parsing
grammars of mathematical expression in order to preserve operator precedence • Handling indents/dedents in order to parse indentation-sensitive languages – e.g. coffeescript, python, haskell

Resources pyparsing PyMeta PyPy Rpython parsing library http://pyparsing.wikispaces.com/ http://www.tinlizzie.org/ometa/ http://doc.pypy.org/en/latest/rlib.html
http://gitorious.org/python-decompiler/python_rewriter https://github.com/marcua/tweeql 71 [email protected]

Packrat parsing in python

Packrat parsing in python

More Decks by GavinB

Other Decks in Technology

Featured

Transcript