Slide 1

Slide 1 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking Bryan Ford MIT Master’s Thesis 2002 Sean Cribbs @seancribbs Papers We Love Chicago #1 20 August 2014

Slide 2

Slide 2 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion About Me Engineer at Basho, Conference Junkie Favorite CS class was Compiler Construction Creator of neotoma, packrat-parser toolkit for Erlang

Slide 3

Slide 3 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Outline Languages, Grammars, Parsing This Thesis’ Contributions Why I Love It Discussion

Slide 4

Slide 4 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion What is parsing? Languages express information linearly, as sequences of symbols Applications that use languages must derive higher-level constructs words, phrases, clauses, sentences, expressions, statements, ... This is called syntax analysis or parsing

Slide 5

Slide 5 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Terminology Language Grammar Terminal/Nonterminal Rule (production, reduction) Top-down parsing (recursive descent) Bottom-up parsing (shift/reduce)

Slide 6

Slide 6 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Who is this? Figure : © Duncan Rawlinson, CC BY 2.0

Slide 7

Slide 7 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Chomsky’s Hierarchy regular context-free context-sensitive recursively enumerable Figure : © J. Finkelstein, CC BY-SA 3.0

Slide 8

Slide 8 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Regular Very popular in practical use via regular expressions Recognizable via finite-state automata Used in scanning, aka “lexical analysis” Figure : Deterministic finite-state automaton (DFA)

Slide 9

Slide 9 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Context-free What we usually think of when discuss “grammars” - tools like yacc/Bison/ANTLR Unlike Regular languages, can express recursive constructs Recognizable via pushdown-automata (stack-based) Used in parsing, aka “syntax analysis” S → aSb | (1) | ab | aabb | aaabbb | ... (2) S → a∗b∗ (3) a | b | ab | aab | aabb | aabbb | ... (4)

Slide 10

Slide 10 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Problems with CFGs for machine-readable languages Focused on generating strings, not recognizing Often ambiguous: both locally and globally, frequently augmented with manual disambiguation Generally require separate scanning step Predictive and shift/reduce parsing algorithms only work with restricted CFGs

Slide 11

Slide 11 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion This Thesis’ Contributions 1. Makes Top-Down Parsing Language a practical notation 2. Packrat-parsing algorithm 3. A TDPL/Packrat-parser generator for Haskell, Pappy

Slide 12

Slide 12 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion TDPL Top-Down Parsing Language Formalism for describing top-down parsing algorithms (aka recursive-descent) Expresses how to read strings rather than write them Also known as parsing expression grammar (PEG) Fundamentally unambiguous via ordered choice

Slide 13

Slide 13 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion PEG/TDPL constructs Empty string () Terminal a Nonterminal A Sequence e1 e2 e3 ... Ordered-choice e1/e2/e3/... Greedy repetition e∗ Greedy positive repetition e+ Optional e? Followed-By &(e) Not-Followed-By !(e)

Slide 14

Slide 14 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion PEG advantages Tokenization can be done inline (scannerless parsing) Constructs like “reserved words” can be expressed directly Associativity is more directly controlled

Slide 15

Slide 15 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Longest-match disambiguation The dangling else problem IF a THEN IF b THEN c ELSE d

Slide 16

Slide 16 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Longest-match disambiguation The dangling else problem IF a THEN IF b THEN c ELSE d IF a THEN (IF b THEN c) ELSE d IF a THEN (IF b THEN c ELSE d)

Slide 17

Slide 17 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Longest-match disambiguation The dangling else problem IF a THEN IF b THEN c ELSE d IF a THEN (IF b THEN c) ELSE d IF a THEN (IF b THEN c ELSE d) Languages usually pick the latter. Ordered-choice makes this disambiguation explicit: if_stmt <- "IF" e "THEN" s "ELSE" s / "IF" e "THEN" s

Slide 18

Slide 18 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion PEG limitations Cannot express ambiguous syntax Some CFGs have no corresponding PEG Repetition is always greedy, workaround with predicates Left-recursion is erroneous*

Slide 19

Slide 19 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Parsing Consider the PEG: Additive <- Multitive ‘+’ Additive / Multitive Multitive <- Primary ‘*’ Multitive / Primary Primary <- ‘(’ Additive ‘)’ / Decimal Decimal <- ‘0’ / ... / ‘9’

Slide 20

Slide 20 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Simple Top-Down Parser Recursive Descent with Backtracking data Result v = Parsed v String | NoParse

Slide 21

Slide 21 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Simple Top-Down Parser Recursive Descent with Backtracking data Result v = Parsed v String | NoParse -- calls itself and pMultitive pAdditive :: String -> Result Int -- calls itself and pPrimary pMultitive :: String -> Result Int -- calls pAdditive and pDecimal pPrimary :: String -> Result Int -- consumes a digit pDecimal :: String -> Result Int

Slide 22

Slide 22 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Simple Top-Down Parser Recursive Descent with Backtracking -- Parse an additive-precedence expression pAdditive :: String -> Result Int pAdditive s = alt1 where -- Additive <- Multitive ‘+’ Additive alt1 = case pMultitive s of Parsed vleft s’ -> case s’ of (‘+’:s’’) -> case pAdditive s’’ of Parsed vright s’’’ -> Parsed (vleft + vright) s’’’ _ -> alt2 _ -> alt2 _ -> alt2

Slide 23

Slide 23 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Simple Top-Down Parser Recursive Descent with Backtracking -- continued from previous slide -- Additive <- Multitive alt2 = case pMultitive s of Parsed v s’ -> Parsed v s’ NoParse -> NoParse

Slide 24

Slide 24 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Simple Top-Down Parser Recursive Descent with Backtracking -- continued from previous slide -- Additive <- Multitive alt2 = case pMultitive s of Parsed v s’ -> Parsed v s’ NoParse -> NoParse On a plain Multitive expression, we compute that twice! Worst-case backtracking can result in exponential parse times: O(2n)

Slide 25

Slide 25 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Tabular Parsing Figure 3-2 column C1 C2 C3 C4 C5 C6 C7 C8 pAdditive (7,C7) X (4,C7) X X pMultitive ↑ (3,C5) X (4,C7) X X pPrimary ← ? (3,C5) X (4,C7) X X pDecimal X (3,C5) X (4,C7) X X input 2 * ( 3 + 4 ) (end)

Slide 26

Slide 26 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Tabular Parsing Figure 3-2 column C1 C2 C3 C4 C5 C6 C7 C8 pAdditive (7,C7) X (4,C7) X X pMultitive ↑ (3,C5) X (4,C7) X X pPrimary ← ? (3,C5) X (4,C7) X X pDecimal X (3,C5) X (4,C7) X X input 2 * ( 3 + 4 ) (end) Linear parse time (proportional to input) All results are precomputed and can be referred to directly rather than recursing

Slide 27

Slide 27 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Tabular Parsing Figure 3-2 column C1 C2 C3 C4 C5 C6 C7 C8 pAdditive (7,C7) X (4,C7) X X pMultitive ↑ (3,C5) X (4,C7) X X pPrimary ← ? (3,C5) X (4,C7) X X pDecimal X (3,C5) X (4,C7) X X input 2 * ( 3 + 4 ) (end) Linear parse time (proportional to input) All results are precomputed and can be referred to directly rather than recursing BUT It computes way more results than necessary! AND The order in which rules are computed must be carefully chosen!

Slide 28

Slide 28 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Packrat Parsing Lazy version of the tabular algorithm, computing results only as needed Computes results in the same order as a recursive-descent parser BUT it doesn’t compute the same “cell” more than once, unlike the naive backtracking parser

Slide 29

Slide 29 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Packrat Parsing Lazy version of the tabular algorithm, computing results only as needed Computes results in the same order as a recursive-descent parser BUT it doesn’t compute the same “cell” more than once, unlike the naive backtracking parser All of this is accomplished with plain algebraic data types in Haskell!

Slide 30

Slide 30 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Making a Packrat Parser Modifying our existing one -- A column in our table data Derivs = Derivs { dvAdditive :: Result Int dvMultitive :: Result Int dvPrimary :: Result Int dvDecimal :: Result Int dvChar :: Result Char }

Slide 31

Slide 31 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Making a Packrat Parser Modifying our existing one -- A column in our table data Derivs = Derivs { dvAdditive :: Result Int dvMultitive :: Result Int dvPrimary :: Result Int dvDecimal :: Result Int dvChar :: Result Char } -- Modify the Result type data Result v = Parsed v Derivs | NoParse

Slide 32

Slide 32 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Making a Packrat Parser Modifying our existing one -- A column in our table data Derivs = Derivs { dvAdditive :: Result Int dvMultitive :: Result Int dvPrimary :: Result Int dvDecimal :: Result Int dvChar :: Result Char } -- Modify the Result type data Result v = Parsed v Derivs | NoParse -- And the parsing functions pAdditive :: Derivs -> Result Int pMultitive :: Derivs -> Result Int pPrimary :: Derivs -> Result Int pDecimal :: Derivs -> Result Int

Slide 33

Slide 33 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Rewriting pAdditive pAdditive :: Derivs -> Result Int pAdditive d = alt1 where -- Additive <- Multitive ‘+’ Additive alt1 = case dvMultitive d of Parsed vleft d’ -> case dvChar d’ of Parsed ‘+’ d’’ -> case dvAdditive d’’ of Parsed vright d’’’ -> Parsed (vleft + vright) d’’’ _ -> alt2 _ -> alt2 _ -> alt2 -- Additive <- Multitive alt2 = dvMultitive d

Slide 34

Slide 34 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Tie-Up the Structure with data-recursion -- Create a result matrix for an input string parse :: String -> Derivs parse s = d where d = Derivs add mult prim dec chr add = pAdditive d mult = pMultitive d prim = pPrimary d dec = pDecimal d chr = case s of (c:s’) -> Parsed c (parse s’) [] -> NoParse

Slide 35

Slide 35 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Derivs expanded 14 14 2 2 ’2’ ’*’ ’(’ ’3’ ’+’ ’4’ ’)’ 4 4 4 ’4’ 7 3 3 3 7 7 7 dvAdditive dvMultitive dvPrimary dvDecimal dvChar Figure 3-3: Derivs data structure produced by parsing the string ‘2*(3+4)’ properly preserves sharing relationships during evaluation, the arrows in the diagram will literally correspond to pointers in the heap, and a given cell in the structure will never be evaluated twice. Shaded boxes represent cells that would never be evaluated at all in the likely case that the dvAdditive result in the leftmost column is the only value ultimately needed by the application. This illustration should make it clear why this algorithm can run in O(n) time under a lazy evaluator for an input string of length n. The top-level parse function is the only Figure : Parsing “2 * (3 + 4)”

Slide 36

Slide 36 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion mindblown.hs

Slide 37

Slide 37 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Limitations of Packrat Parsing Language must be unambiguous Limited state Space consumption is large

Slide 38

Slide 38 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Stuff I’m not covering but is still awesome Monadic Packrat Parsing Error Handling Maintaining Input Position Information Stateful Parsers Pappy, the parser generator

Slide 39

Slide 39 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Why I Love It Simple principles, powerful results Written in a very thorough, approachable style Very honest about tradeoffs

Slide 40

Slide 40 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Why I Love It Simple principles, powerful results Written in a very thorough, approachable style Very honest about tradeoffs I hardly use regular expressions for anything complicated anymore. neotoma is almost more popular than yecc in Erlang open-source projects

Slide 41

Slide 41 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Further Reading B. Ford The Packrat Parsing and Parsing Expression Grammars Page http://bford.info/packrat/

Slide 42

Slide 42 text

Packrat Parsing Sean Cribbs Background Contributions Why I Love It Discussion Discussion Questions? Comments? Ideas?