Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parsing: How Does it Work?

Parsing: How Does it Work?

Bucharest FP

April 20, 2016
Tweet

More Decks by Bucharest FP

Other Decks in Programming

Transcript

  1. Parsing: How Does it Work? Ionuț G. Stan — Bucharest

    FP — April 2016
  2. • Compilers Overview • Vehicle Language — MiniML • Parsing

    • Lexer • Parser The Plan
  3. Compilers Overview

  4. Compiler Compilers Overview

  5. Source Code Compiler Compilers Overview

  6. Source Code Compiler (fn a => a) 2 Compilers Overview

  7. Source Code Target Language Compiler (fn a => a) 2

    Compilers Overview
  8. Source Code Target Language Compiler (fn a => a) 2

    (function(a){return a;})(2); Compilers Overview
  9. Compilers Overview de T Compiler ) 2 (functi

  10. Parsing de T Compiler ) 2 (functi Parser

  11. Abstract Syntax Tree de T Compiler ) 2 (functi Parser

    Abstract Syntax Tree (AST)
  12. Abstract Syntax Tree de T Compiler ) 2 (functi Parser

    APP FUN a VAR a INT 2 Abstract Syntax Tree (AST)
  13. Code Generation de T Compiler ) 2 (functi Parser CodeGen

    APP FUN a VAR a INT 2 Abstract Syntax Tree (AST)
  14. Many Intermediate Phases de T Compiler ) 2 (functi Parser

    CodeGen ... AST
  15. Type Checking de T Compiler ) 2 (functi Parser CodeGen

    Type Checker AST Typed AST ...
  16. Last Year's Talk de T Compiler ) 2 (functi Parser

    CodeGen Type Checker AST Typed AST Last Year ...
  17. Today's Talk de T Compiler ) 2 (functi Parser CodeGen

    Type Checker AST Typed AST Today ...
  18. Vehicle Language MiniML

  19. 1. Integers: 1, 2, 3, ... 2. Identifiers (letters only):

    foo, bar, baz, etc. 3. Booleans: true and false 4. Anonymous functions (lambdas): fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f 7. Addition and subtraction: a + b, a - c 8. Parenthesized expressions: (expr) MiniML
  20. 9. Single-binding let blocks:
 
 let
 val name = ...


    in
 ...
 end MiniML
  21. let val inc = fn a => a + 1

    in inc 42 end MiniML — Small Example
  22. Parsing

  23. • Purpose: recover structure from text • Traditionally divided in

    two phases: • Lexing: groups characters into words (tokens) • Parsing: groups words into phrases (AST) • Other names for lexer: scanner or tokenizer • Scannerless parsers exist too • Why lexer + parser then? Mostly efficiency Parsing
  24. Lexer

  25. Lexer Lexer

  26. Lexer Lexer (,f,n, ,a, ,=,>, ,a,), ,2 • Expects a

    stream of characters or bytes • Groups them into atomic semantic units: tokens
  27. Lexer Lexer (,f,n, ,a, ,=,>, ,a,), ,2 (,fn,a,=>,a,),2 • Expects

    a stream of characters or bytes • Groups them into atomic semantic units: tokens
  28. val tokens = source.split(" ") Lexer • Grouping can be

    thought of as "split by space" • Why not exactly that?
  29. Lexer • Grouping can be thought of as "split by

    space" • Why not exactly that?
  30. val sum = 1 + 2 ! val sum=1+2 !

    val str = "spaces matter here" ! val str = "spaces /* matter */ here" Lexer • Grouping can be thought of as "split by space" • Why not exactly that? Consider this:
  31. • Lists the rules for grouping characters into tokens •

    Rules specified using regular expressions • Easy to implement with a RegExp library • Not extremely difficult without one, either • Or use a generator, e.g., lex, flex, alex, etc. Lexical Grammar
  32. • integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =,

    +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar
  33. • integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =,

    +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar
  34. • integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =,

    +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar
  35. • integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =,

    +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar
  36. • integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =,

    +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar
  37. Token Representation sealed trait Token ! object Token { case

    class INT(value: Int) extends Token case class VAR(value: String) extends Token case object IF extends Token case object THEN extends Token case object ELSE extends Token case object FN extends Token case object DARROW extends Token case object LET extends Token case object VAL extends Token case object EQUAL extends Token case object IN extends Token case object END extends Token case object LPAREN extends Token case object RPAREN extends Token case object ADD extends Token case object SUB extends Token case object TRUE extends Token case object FALSE extends Token }
  38. Parser

  39. Parser Parser APP FUN a VAR a INT 2 (,fn,a,=>,a,),2

  40. Syntactic Grammar • A grammar tells how tokens can be

    used together in phrases • Lexically and syntactically correct: val a = 1 • Lexically correct, but syntactically incorrect: val val val • You need a grammar before writing any code, so… • Either take it from somewhere, or… • Write it yourself, or… • End up with something like PHP
  41. PHP before 5.3 function wat() { return array('W', 'A', 'T');

    } ! echo wat()[0]; // syntax error, unexpected '['
  42. MiniML — Syntactic Grammar <EXP> ::= <EXP> <EXP> ; function

    application | <EXP> "+" <EXP> | <EXP> "-" <EXP> | "fn" <VAR> "=>" <EXP> | "if" <EXP> "then" <EXP> "else" <EXP> | "let" "val" <VAR> "=" <EXP> "in" <EXP> "end" | "(" <EXP> ")" | <BOOL> | <INT> | <VAR>
  43. AST Representation sealed trait Absyn ! object Absyn { case

    class APP(fn: Absyn, arg: Absyn) extends Absyn case class ADD(a: Absyn, b: Absyn) extends Absyn case class SUB(a: Absyn, b: Absyn) extends Absyn case class IF(test: Absyn, yes: Absyn, no: Absyn) extends Absyn case class FN(param: String, body: Absyn) extends Absyn case class LET(binding: String, value: Absyn, body: Absyn) extends Absyn case class BOOL(value: Boolean) extends Absyn case class INT(value: Int) extends Absyn case class VAR(name: String) extends Absyn }
  44. Parsing Strategies (a) • Two styles: • Top-down parsing: builds

    AST from root • Bottom-up parsing: builds AST from leaves • Top-down is easy to write by hand • Bottom-up is not, but it's used by generators • Parser generators: YACC, ANTLR, Bison, etc.
  45. Parsing Strategies (b) • Today: recursive descent parser (top-down style)

    • Very popular — e.g., Clang uses it for C/C++/Obj-C) • Idea: each grammar production becomes a function • Productions may be mutually recursive; functions too • This is the main difference compared to regexes • Parser combinators are an abstraction over this idea
  46. Recursive Descent Parser <braces> ::= <round> | <square> | ""

    ! <round> ::= "(" <braces> ")" <square> ::= "[" <braces> "]" def braces() = ??? ! def round() = ??? ! def square() = ???
  47. Recursive Descent Parser • Has a few disadvantages: • Can't

    handle left-recursive grammars • Can't handle infix expressions very well: • precedence • associativity
  48. github.com/igstan/bucharestfp-021 Code

  49. Homework! • Write a lexer for JSON • Write a

    recursive descent parser for JSON • It's easier than today's vehicle language, I promise! • Specification: json.org • Should we try a coding dojo for this?
  50. Thank You!

  51. Questions and !