Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Parsers Extensible

Making Parsers Extensible

Here are the slides of my presentation at SLE 2015, in which I explain that current parsers fail at a variety of tasks, and how to make parsers extensible so that users may overcome these hurdles by writing custom extensions.

Avatar for Nicolas Laurent

Nicolas Laurent

October 27, 2015
Tweet

Other Decks in Programming

Transcript

  1. What’s in a grammar formalism? • A fixed set of

    matchers • primitive matchers (characters, tokens) • combinators (sequence, choice, kleene star) • Recursion
  2. Things you can’t do in CFG/PEG • Longest-match choice •

    Mixed repetition matching
 regular (CFG) & greedy (PEG) • Dependent grammars • Things we haven't imagined yet
  3. Dependent Grammars A dependent grammar depends on the sentence behind

    checked.
 • Balanced XML tags (<foo>…</foo>) XML • typedef hack (x*x;) C/C++ • Length-prefixed fields net protocols • User-defined operators OCaml • Significant indentation Python
  4. How we do it • Custom matchers (parsing expressions) •

    Grammar transformations • Custom parse state
  5. A grammar is a graph of parsing expressions A ::=

    B | C | w
 B ::= x y A
 C ::= z w A B C x y z w
  6. Parse State • void parse(Parser parser, ParseState state); • state

    = input, output, or both • examples: input position, parse tree, precedence level, significant indentation level, xml tag names
  7. Why the fuss? • What to do with state modifications

    that happened in branches discarded through backtracking? • Identifying and rolling back state changes • Things that are both inputs and outputs (position, indentation level)
  8. Parse State Rules • Divided between committed and uncommitted state

    • Roughly, committed = input & uncommitted = output • When invoking an expression:
 no uncommitted state • When returning from an expression:
 committed state the same as when invoked.
  9. Input Position as Parse State • committed: int start •

    uncommitted: int end • when invoking: (start0 = start) == end • when returning: start == start0
  10. 01. void parse(Parser parser, ParseState state) 02. int start0 =

    state.start; 03. for (ParsingExpression operand : operands) 04. operand.parse(parser, state); 05. if (state.end != -1) 06. state.start = state.end; 07. else 08. state.start = start0; 10. state.end = -1; 11. return; 12. state.start = start0;
  11. 01. void parse(Parser parser, ParseState state) 02. Snapshot snapshot =

    state.snapshot(); 03. for (ParsingExpression operand : operands) 04. operand.parse(parser, state); 05. if (state.succeeded()) 06. state.commit(); 07. else 08. state.restore(snapshot); 10. state.fail(this); 11. return; 12. state.uncommit(snapshot);
  12. I came here to learn about parsing expressions, but all

    I got was a talk about extensibility :(
  13. Left-Recursion Algorithm Assume a left-recursive node R invoked at position

    p, and a map M: (node, position) —> output. 1. M[R,p] = failure 2. Parse the expression, when encountering left- recursion, return M[R,p]. 3. If more input was consumed than in M[R,p], overwrite M[R,p] and go to (2). Else, finish.
  14. Left-Recursion:
 Custom Parsing Expression • Performs the left-recursion algorithm •

    Wraps one node in the recursive cycle
 (via a graph transformation)
  15. Left-Recursion: Custom State • The M map: (node, position) —>

    output • In reality: a node —> output map tied to the current input position • commit(): discards the map when advancing the input position • snapshot(), uncommit(), restore(): restore the old map
  16. Summary • Parser extensibility is useful. (formalism extensions, dependent grammars,

    optimizations) • It can be implemented with custom matchers, graph transformations and custom parse state. • Shown to solve practical problems
  17. Awesome! Where do I get this stuff? http://github.com/norswap/autumn
 • Combinator

    library (~ grammar interpreter) • Java • User manual soon, pinky promise