Slide 1

Slide 1 text

Parsing: How Does it Work? Ionuț G. Stan — Bucharest FP — April 2016

Slide 2

Slide 2 text

• Compilers Overview • Vehicle Language — MiniML • Parsing • Lexer • Parser The Plan

Slide 3

Slide 3 text

Compilers Overview

Slide 4

Slide 4 text

Compiler Compilers Overview

Slide 5

Slide 5 text

Source Code Compiler Compilers Overview

Slide 6

Slide 6 text

Source Code Compiler (fn a => a) 2 Compilers Overview

Slide 7

Slide 7 text

Source Code Target Language Compiler (fn a => a) 2 Compilers Overview

Slide 8

Slide 8 text

Source Code Target Language Compiler (fn a => a) 2 (function(a){return a;})(2); Compilers Overview

Slide 9

Slide 9 text

Compilers Overview de T Compiler ) 2 (functi

Slide 10

Slide 10 text

Parsing de T Compiler ) 2 (functi Parser

Slide 11

Slide 11 text

Abstract Syntax Tree de T Compiler ) 2 (functi Parser Abstract Syntax Tree (AST)

Slide 12

Slide 12 text

Abstract Syntax Tree de T Compiler ) 2 (functi Parser APP FUN a VAR a INT 2 Abstract Syntax Tree (AST)

Slide 13

Slide 13 text

Code Generation de T Compiler ) 2 (functi Parser CodeGen APP FUN a VAR a INT 2 Abstract Syntax Tree (AST)

Slide 14

Slide 14 text

Many Intermediate Phases de T Compiler ) 2 (functi Parser CodeGen ... AST

Slide 15

Slide 15 text

Type Checking de T Compiler ) 2 (functi Parser CodeGen Type Checker AST Typed AST ...

Slide 16

Slide 16 text

Last Year's Talk de T Compiler ) 2 (functi Parser CodeGen Type Checker AST Typed AST Last Year ...

Slide 17

Slide 17 text

Today's Talk de T Compiler ) 2 (functi Parser CodeGen Type Checker AST Typed AST Today ...

Slide 18

Slide 18 text

Vehicle Language MiniML

Slide 19

Slide 19 text

1. Integers: 1, 2, 3, ... 2. Identifiers (letters only): foo, bar, baz, etc. 3. Booleans: true and false 4. Anonymous functions (lambdas): fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f 7. Addition and subtraction: a + b, a - c 8. Parenthesized expressions: (expr) MiniML

Slide 20

Slide 20 text

9. Single-binding let blocks:
 
 let
 val name = ...
 in
 ...
 end MiniML

Slide 21

Slide 21 text

let val inc = fn a => a + 1 in inc 42 end MiniML — Small Example

Slide 22

Slide 22 text

Parsing

Slide 23

Slide 23 text

• Purpose: recover structure from text • Traditionally divided in two phases: • Lexing: groups characters into words (tokens) • Parsing: groups words into phrases (AST) • Other names for lexer: scanner or tokenizer • Scannerless parsers exist too • Why lexer + parser then? Mostly efficiency Parsing

Slide 24

Slide 24 text

Lexer

Slide 25

Slide 25 text

Lexer Lexer

Slide 26

Slide 26 text

Lexer Lexer (,f,n, ,a, ,=,>, ,a,), ,2 • Expects a stream of characters or bytes • Groups them into atomic semantic units: tokens

Slide 27

Slide 27 text

Lexer Lexer (,f,n, ,a, ,=,>, ,a,), ,2 (,fn,a,=>,a,),2 • Expects a stream of characters or bytes • Groups them into atomic semantic units: tokens

Slide 28

Slide 28 text

val tokens = source.split(" ") Lexer • Grouping can be thought of as "split by space" • Why not exactly that?

Slide 29

Slide 29 text

Lexer • Grouping can be thought of as "split by space" • Why not exactly that?

Slide 30

Slide 30 text

val sum = 1 + 2 ! val sum=1+2 ! val str = "spaces matter here" ! val str = "spaces /* matter */ here" Lexer • Grouping can be thought of as "split by space" • Why not exactly that? Consider this:

Slide 31

Slide 31 text

• Lists the rules for grouping characters into tokens • Rules specified using regular expressions • Easy to implement with a RegExp library • Not extremely difficult without one, either • Or use a generator, e.g., lex, flex, alex, etc. Lexical Grammar

Slide 32

Slide 32 text

• integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =, +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar

Slide 33

Slide 33 text

• integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =, +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar

Slide 34

Slide 34 text

• integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =, +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar

Slide 35

Slide 35 text

• integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =, +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar

Slide 36

Slide 36 text

• integers: 0|[1-9][0-9]* • identifiers: [a-zA-Z]+! • symbols: =>, =, +, -, (, ) • keywords: if, then, else, let, val, in, end, fn, true, false MiniML — Lexical Grammar

Slide 37

Slide 37 text

Token Representation sealed trait Token ! object Token { case class INT(value: Int) extends Token case class VAR(value: String) extends Token case object IF extends Token case object THEN extends Token case object ELSE extends Token case object FN extends Token case object DARROW extends Token case object LET extends Token case object VAL extends Token case object EQUAL extends Token case object IN extends Token case object END extends Token case object LPAREN extends Token case object RPAREN extends Token case object ADD extends Token case object SUB extends Token case object TRUE extends Token case object FALSE extends Token }

Slide 38

Slide 38 text

Parser

Slide 39

Slide 39 text

Parser Parser APP FUN a VAR a INT 2 (,fn,a,=>,a,),2

Slide 40

Slide 40 text

Syntactic Grammar • A grammar tells how tokens can be used together in phrases • Lexically and syntactically correct: val a = 1 • Lexically correct, but syntactically incorrect: val val val • You need a grammar before writing any code, so… • Either take it from somewhere, or… • Write it yourself, or… • End up with something like PHP

Slide 41

Slide 41 text

PHP before 5.3 function wat() { return array('W', 'A', 'T'); } ! echo wat()[0]; // syntax error, unexpected '['

Slide 42

Slide 42 text

MiniML — Syntactic Grammar ::= ; function application | "+" | "-" | "fn" "=>" | "if" "then" "else" | "let" "val" "=" "in" "end" | "(" ")" | | |

Slide 43

Slide 43 text

AST Representation sealed trait Absyn ! object Absyn { case class APP(fn: Absyn, arg: Absyn) extends Absyn case class ADD(a: Absyn, b: Absyn) extends Absyn case class SUB(a: Absyn, b: Absyn) extends Absyn case class IF(test: Absyn, yes: Absyn, no: Absyn) extends Absyn case class FN(param: String, body: Absyn) extends Absyn case class LET(binding: String, value: Absyn, body: Absyn) extends Absyn case class BOOL(value: Boolean) extends Absyn case class INT(value: Int) extends Absyn case class VAR(name: String) extends Absyn }

Slide 44

Slide 44 text

Parsing Strategies (a) • Two styles: • Top-down parsing: builds AST from root • Bottom-up parsing: builds AST from leaves • Top-down is easy to write by hand • Bottom-up is not, but it's used by generators • Parser generators: YACC, ANTLR, Bison, etc.

Slide 45

Slide 45 text

Parsing Strategies (b) • Today: recursive descent parser (top-down style) • Very popular — e.g., Clang uses it for C/C++/Obj-C) • Idea: each grammar production becomes a function • Productions may be mutually recursive; functions too • This is the main difference compared to regexes • Parser combinators are an abstraction over this idea

Slide 46

Slide 46 text

Recursive Descent Parser ::= | | "" ! ::= "(" ")" ::= "[" "]" def braces() = ??? ! def round() = ??? ! def square() = ???

Slide 47

Slide 47 text

Recursive Descent Parser • Has a few disadvantages: • Can't handle left-recursive grammars • Can't handle infix expressions very well: • precedence • associativity

Slide 48

Slide 48 text

github.com/igstan/bucharestfp-021 Code

Slide 49

Slide 49 text

Homework! • Write a lexer for JSON • Write a recursive descent parser for JSON • It's easier than today's vehicle language, I promise! • Specification: json.org • Should we try a coding dojo for this?

Slide 50

Slide 50 text

Thank You!

Slide 51

Slide 51 text

Questions and !