Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sean Cribbs on Packrat Parsing: a Practical Lin...

Sean Cribbs on Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking

Bryan Ford's 2002 Masters Thesis is remarkable in that it breaks decades of compiler-construction dogma with some simple principles and a compelling alternative to the complexity of parsing with context-free grammars (CFGs). He reveals a forgotten class of grammars-- top-down parsing language (TDPL), and some extensions known as parsing-expression grammars (PEGS) -- that directly correspond to the parsers that implement them. His primary contribution, however, is applying modern functional programming techniques of laziness and algebraic data structures to make TDPL/PEG parsers computationally efficient.

Presented at Papers We Love Chicago #1: http://www.meetup.com/Papers-We-Love-Chicago/events/196391192/.

Papers_We_Love

August 20, 2014
Tweet

More Decks by Papers_We_Love

Other Decks in Technology

Transcript

  1. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking Bryan Ford MIT Master’s Thesis 2002 Sean Cribbs @seancribbs Papers We Love Chicago #1 20 August 2014
  2. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion About Me Engineer at Basho, Conference Junkie Favorite CS class was Compiler Construction Creator of neotoma, packrat-parser toolkit for Erlang
  3. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Outline Languages, Grammars, Parsing This Thesis’ Contributions Why I Love It Discussion
  4. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion What is parsing? Languages express information linearly, as sequences of symbols Applications that use languages must derive higher-level constructs words, phrases, clauses, sentences, expressions, statements, ... This is called syntax analysis or parsing
  5. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Terminology Language Grammar Terminal/Nonterminal Rule (production, reduction) Top-down parsing (recursive descent) Bottom-up parsing (shift/reduce)
  6. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Who is this? Figure : © Duncan Rawlinson, CC BY 2.0
  7. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Chomsky’s Hierarchy regular context-free context-sensitive recursively enumerable Figure : © J. Finkelstein, CC BY-SA 3.0
  8. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Regular Very popular in practical use via regular expressions Recognizable via finite-state automata Used in scanning, aka “lexical analysis” Figure : Deterministic finite-state automaton (DFA)
  9. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Context-free What we usually think of when discuss “grammars” - tools like yacc/Bison/ANTLR Unlike Regular languages, can express recursive constructs Recognizable via pushdown-automata (stack-based) Used in parsing, aka “syntax analysis” S → aSb | (1) | ab | aabb | aaabbb | ... (2) S → a∗b∗ (3) a | b | ab | aab | aabb | aabbb | ... (4)
  10. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Problems with CFGs for machine-readable languages Focused on generating strings, not recognizing Often ambiguous: both locally and globally, frequently augmented with manual disambiguation Generally require separate scanning step Predictive and shift/reduce parsing algorithms only work with restricted CFGs
  11. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion This Thesis’ Contributions 1. Makes Top-Down Parsing Language a practical notation 2. Packrat-parsing algorithm 3. A TDPL/Packrat-parser generator for Haskell, Pappy
  12. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion TDPL Top-Down Parsing Language Formalism for describing top-down parsing algorithms (aka recursive-descent) Expresses how to read strings rather than write them Also known as parsing expression grammar (PEG) Fundamentally unambiguous via ordered choice
  13. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion PEG/TDPL constructs Empty string () Terminal a Nonterminal A Sequence e1 e2 e3 ... Ordered-choice e1/e2/e3/... Greedy repetition e∗ Greedy positive repetition e+ Optional e? Followed-By &(e) Not-Followed-By !(e)
  14. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion PEG advantages Tokenization can be done inline (scannerless parsing) Constructs like “reserved words” can be expressed directly Associativity is more directly controlled
  15. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Longest-match disambiguation The dangling else problem IF a THEN IF b THEN c ELSE d
  16. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Longest-match disambiguation The dangling else problem IF a THEN IF b THEN c ELSE d IF a THEN (IF b THEN c) ELSE d IF a THEN (IF b THEN c ELSE d)
  17. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Longest-match disambiguation The dangling else problem IF a THEN IF b THEN c ELSE d IF a THEN (IF b THEN c) ELSE d IF a THEN (IF b THEN c ELSE d) Languages usually pick the latter. Ordered-choice makes this disambiguation explicit: if_stmt <- "IF" e "THEN" s "ELSE" s / "IF" e "THEN" s
  18. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion PEG limitations Cannot express ambiguous syntax Some CFGs have no corresponding PEG Repetition is always greedy, workaround with predicates Left-recursion is erroneous*
  19. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Parsing Consider the PEG: Additive <- Multitive ‘+’ Additive / Multitive Multitive <- Primary ‘*’ Multitive / Primary Primary <- ‘(’ Additive ‘)’ / Decimal Decimal <- ‘0’ / ... / ‘9’
  20. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Simple Top-Down Parser Recursive Descent with Backtracking data Result v = Parsed v String | NoParse
  21. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Simple Top-Down Parser Recursive Descent with Backtracking data Result v = Parsed v String | NoParse -- calls itself and pMultitive pAdditive :: String -> Result Int -- calls itself and pPrimary pMultitive :: String -> Result Int -- calls pAdditive and pDecimal pPrimary :: String -> Result Int -- consumes a digit pDecimal :: String -> Result Int
  22. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Simple Top-Down Parser Recursive Descent with Backtracking -- Parse an additive-precedence expression pAdditive :: String -> Result Int pAdditive s = alt1 where -- Additive <- Multitive ‘+’ Additive alt1 = case pMultitive s of Parsed vleft s’ -> case s’ of (‘+’:s’’) -> case pAdditive s’’ of Parsed vright s’’’ -> Parsed (vleft + vright) s’’’ _ -> alt2 _ -> alt2 _ -> alt2
  23. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Simple Top-Down Parser Recursive Descent with Backtracking -- continued from previous slide -- Additive <- Multitive alt2 = case pMultitive s of Parsed v s’ -> Parsed v s’ NoParse -> NoParse
  24. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Simple Top-Down Parser Recursive Descent with Backtracking -- continued from previous slide -- Additive <- Multitive alt2 = case pMultitive s of Parsed v s’ -> Parsed v s’ NoParse -> NoParse On a plain Multitive expression, we compute that twice! Worst-case backtracking can result in exponential parse times: O(2n)
  25. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Tabular Parsing Figure 3-2 column C1 C2 C3 C4 C5 C6 C7 C8 pAdditive (7,C7) X (4,C7) X X pMultitive ↑ (3,C5) X (4,C7) X X pPrimary ← ? (3,C5) X (4,C7) X X pDecimal X (3,C5) X (4,C7) X X input 2 * ( 3 + 4 ) (end)
  26. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Tabular Parsing Figure 3-2 column C1 C2 C3 C4 C5 C6 C7 C8 pAdditive (7,C7) X (4,C7) X X pMultitive ↑ (3,C5) X (4,C7) X X pPrimary ← ? (3,C5) X (4,C7) X X pDecimal X (3,C5) X (4,C7) X X input 2 * ( 3 + 4 ) (end) Linear parse time (proportional to input) All results are precomputed and can be referred to directly rather than recursing
  27. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Tabular Parsing Figure 3-2 column C1 C2 C3 C4 C5 C6 C7 C8 pAdditive (7,C7) X (4,C7) X X pMultitive ↑ (3,C5) X (4,C7) X X pPrimary ← ? (3,C5) X (4,C7) X X pDecimal X (3,C5) X (4,C7) X X input 2 * ( 3 + 4 ) (end) Linear parse time (proportional to input) All results are precomputed and can be referred to directly rather than recursing BUT It computes way more results than necessary! AND The order in which rules are computed must be carefully chosen!
  28. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Packrat Parsing Lazy version of the tabular algorithm, computing results only as needed Computes results in the same order as a recursive-descent parser BUT it doesn’t compute the same “cell” more than once, unlike the naive backtracking parser
  29. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Packrat Parsing Lazy version of the tabular algorithm, computing results only as needed Computes results in the same order as a recursive-descent parser BUT it doesn’t compute the same “cell” more than once, unlike the naive backtracking parser All of this is accomplished with plain algebraic data types in Haskell!
  30. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Making a Packrat Parser Modifying our existing one -- A column in our table data Derivs = Derivs { dvAdditive :: Result Int dvMultitive :: Result Int dvPrimary :: Result Int dvDecimal :: Result Int dvChar :: Result Char }
  31. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Making a Packrat Parser Modifying our existing one -- A column in our table data Derivs = Derivs { dvAdditive :: Result Int dvMultitive :: Result Int dvPrimary :: Result Int dvDecimal :: Result Int dvChar :: Result Char } -- Modify the Result type data Result v = Parsed v Derivs | NoParse
  32. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Making a Packrat Parser Modifying our existing one -- A column in our table data Derivs = Derivs { dvAdditive :: Result Int dvMultitive :: Result Int dvPrimary :: Result Int dvDecimal :: Result Int dvChar :: Result Char } -- Modify the Result type data Result v = Parsed v Derivs | NoParse -- And the parsing functions pAdditive :: Derivs -> Result Int pMultitive :: Derivs -> Result Int pPrimary :: Derivs -> Result Int pDecimal :: Derivs -> Result Int
  33. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Rewriting pAdditive pAdditive :: Derivs -> Result Int pAdditive d = alt1 where -- Additive <- Multitive ‘+’ Additive alt1 = case dvMultitive d of Parsed vleft d’ -> case dvChar d’ of Parsed ‘+’ d’’ -> case dvAdditive d’’ of Parsed vright d’’’ -> Parsed (vleft + vright) d’’’ _ -> alt2 _ -> alt2 _ -> alt2 -- Additive <- Multitive alt2 = dvMultitive d
  34. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Tie-Up the Structure with data-recursion -- Create a result matrix for an input string parse :: String -> Derivs parse s = d where d = Derivs add mult prim dec chr add = pAdditive d mult = pMultitive d prim = pPrimary d dec = pDecimal d chr = case s of (c:s’) -> Parsed c (parse s’) [] -> NoParse
  35. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Derivs expanded 14 14 2 2 ’2’ ’*’ ’(’ ’3’ ’+’ ’4’ ’)’ 4 4 4 ’4’ 7 3 3 3 7 7 7 dvAdditive dvMultitive dvPrimary dvDecimal dvChar Figure 3-3: Derivs data structure produced by parsing the string ‘2*(3+4)’ properly preserves sharing relationships during evaluation, the arrows in the diagram will literally correspond to pointers in the heap, and a given cell in the structure will never be evaluated twice. Shaded boxes represent cells that would never be evaluated at all in the likely case that the dvAdditive result in the leftmost column is the only value ultimately needed by the application. This illustration should make it clear why this algorithm can run in O(n) time under a lazy evaluator for an input string of length n. The top-level parse function is the only Figure : Parsing “2 * (3 + 4)”
  36. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Limitations of Packrat Parsing Language must be unambiguous Limited state Space consumption is large
  37. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Stuff I’m not covering but is still awesome Monadic Packrat Parsing Error Handling Maintaining Input Position Information Stateful Parsers Pappy, the parser generator
  38. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Why I Love It Simple principles, powerful results Written in a very thorough, approachable style Very honest about tradeoffs
  39. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Why I Love It Simple principles, powerful results Written in a very thorough, approachable style Very honest about tradeoffs I hardly use regular expressions for anything complicated anymore. neotoma is almost more popular than yecc in Erlang open-source projects
  40. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Further Reading B. Ford The Packrat Parsing and Parsing Expression Grammars Page http://bford.info/packrat/
  41. Packrat Parsing Sean Cribbs Background Contributions Why I Love It

    Discussion Discussion Questions? Comments? Ideas?