Parsing: How Does it Work?
Ionuț G. Stan — Bucharest FP — April 2016
Slide 2
Slide 2 text
• Compilers Overview
• Vehicle Language — MiniML
• Parsing
• Lexer
• Parser
The Plan
Slide 3
Slide 3 text
Compilers Overview
Slide 4
Slide 4 text
Compiler
Compilers Overview
Slide 5
Slide 5 text
Source Code
Compiler
Compilers Overview
Slide 6
Slide 6 text
Source Code
Compiler
(fn a => a) 2
Compilers Overview
Slide 7
Slide 7 text
Source Code Target Language
Compiler
(fn a => a) 2
Compilers Overview
Slide 8
Slide 8 text
Source Code Target Language
Compiler
(fn a => a) 2 (function(a){return a;})(2);
Compilers Overview
Slide 9
Slide 9 text
Compilers Overview
de T
Compiler
) 2 (functi
Slide 10
Slide 10 text
Parsing
de T
Compiler
) 2 (functi
Parser
Slide 11
Slide 11 text
Abstract Syntax Tree
de T
Compiler
) 2 (functi
Parser
Abstract Syntax Tree (AST)
Slide 12
Slide 12 text
Abstract Syntax Tree
de T
Compiler
) 2 (functi
Parser
APP
FUN
a VAR
a
INT
2
Abstract Syntax Tree (AST)
Slide 13
Slide 13 text
Code Generation
de T
Compiler
) 2 (functi
Parser CodeGen
APP
FUN
a VAR
a
INT
2
Abstract Syntax Tree (AST)
Slide 14
Slide 14 text
Many Intermediate Phases
de T
Compiler
) 2 (functi
Parser CodeGen
...
AST
Slide 15
Slide 15 text
Type Checking
de T
Compiler
) 2 (functi
Parser CodeGen
Type Checker
AST
Typed AST
...
Slide 16
Slide 16 text
Last Year's Talk
de T
Compiler
) 2 (functi
Parser CodeGen
Type Checker
AST
Typed AST
Last Year
...
Slide 17
Slide 17 text
Today's Talk
de T
Compiler
) 2 (functi
Parser CodeGen
Type Checker
AST
Typed AST
Today
...
Slide 18
Slide 18 text
Vehicle Language
MiniML
Slide 19
Slide 19 text
1. Integers: 1, 2, 3, ...
2. Identifiers (letters only): foo, bar, baz, etc.
3. Booleans: true and false
4. Anonymous functions (lambdas): fn a => a
5. Function application: inc 42
6. If expressions: if cond then t else f
7. Addition and subtraction: a + b, a - c
8. Parenthesized expressions: (expr)
MiniML
Slide 20
Slide 20 text
9. Single-binding let blocks:
let
val name = ...
in
...
end
MiniML
Slide 21
Slide 21 text
let
val inc = fn a => a + 1
in
inc 42
end
MiniML — Small Example
Slide 22
Slide 22 text
Parsing
Slide 23
Slide 23 text
• Purpose: recover structure from text
• Traditionally divided in two phases:
• Lexing: groups characters into words (tokens)
• Parsing: groups words into phrases (AST)
• Other names for lexer: scanner or tokenizer
• Scannerless parsers exist too
• Why lexer + parser then? Mostly efficiency
Parsing
Slide 24
Slide 24 text
Lexer
Slide 25
Slide 25 text
Lexer
Lexer
Slide 26
Slide 26 text
Lexer
Lexer
(,f,n, ,a, ,=,>, ,a,), ,2
• Expects a stream of characters or bytes
• Groups them into atomic semantic units: tokens
Slide 27
Slide 27 text
Lexer
Lexer
(,f,n, ,a, ,=,>, ,a,), ,2
(,fn,a,=>,a,),2
• Expects a stream of characters or bytes
• Groups them into atomic semantic units: tokens
Slide 28
Slide 28 text
val tokens = source.split(" ")
Lexer
• Grouping can be thought of as "split by space"
• Why not exactly that?
Slide 29
Slide 29 text
Lexer
• Grouping can be thought of as "split by space"
• Why not exactly that?
Slide 30
Slide 30 text
val sum = 1 + 2
!
val sum=1+2
!
val str = "spaces matter here"
!
val str = "spaces /* matter */ here"
Lexer
• Grouping can be thought of as "split by space"
• Why not exactly that? Consider this:
Slide 31
Slide 31 text
• Lists the rules for grouping characters into tokens
• Rules specified using regular expressions
• Easy to implement with a RegExp library
• Not extremely difficult without one, either
• Or use a generator, e.g., lex, flex, alex, etc.
Lexical Grammar
Token Representation
sealed trait Token
!
object Token {
case class INT(value: Int) extends Token
case class VAR(value: String) extends Token
case object IF extends Token
case object THEN extends Token
case object ELSE extends Token
case object FN extends Token
case object DARROW extends Token
case object LET extends Token
case object VAL extends Token
case object EQUAL extends Token
case object IN extends Token
case object END extends Token
case object LPAREN extends Token
case object RPAREN extends Token
case object ADD extends Token
case object SUB extends Token
case object TRUE extends Token
case object FALSE extends Token
}
Slide 38
Slide 38 text
Parser
Slide 39
Slide 39 text
Parser
Parser
APP
FUN
a VAR
a
INT
2
(,fn,a,=>,a,),2
Slide 40
Slide 40 text
Syntactic Grammar
• A grammar tells how tokens can be used together in phrases
• Lexically and syntactically correct: val a = 1
• Lexically correct, but syntactically incorrect: val val val
• You need a grammar before writing any code, so…
• Either take it from somewhere, or…
• Write it yourself, or…
• End up with something like PHP
Slide 41
Slide 41 text
PHP before 5.3
function wat() {
return array('W', 'A', 'T');
}
!
echo wat()[0]; // syntax error, unexpected '['
AST Representation
sealed trait Absyn
!
object Absyn {
case class APP(fn: Absyn, arg: Absyn) extends Absyn
case class ADD(a: Absyn, b: Absyn) extends Absyn
case class SUB(a: Absyn, b: Absyn) extends Absyn
case class IF(test: Absyn, yes: Absyn, no: Absyn) extends Absyn
case class FN(param: String, body: Absyn) extends Absyn
case class LET(binding: String, value: Absyn, body: Absyn) extends Absyn
case class BOOL(value: Boolean) extends Absyn
case class INT(value: Int) extends Absyn
case class VAR(name: String) extends Absyn
}
Slide 44
Slide 44 text
Parsing Strategies (a)
• Two styles:
• Top-down parsing: builds AST from root
• Bottom-up parsing: builds AST from leaves
• Top-down is easy to write by hand
• Bottom-up is not, but it's used by generators
• Parser generators: YACC, ANTLR, Bison, etc.
Slide 45
Slide 45 text
Parsing Strategies (b)
• Today: recursive descent parser (top-down style)
• Very popular — e.g., Clang uses it for C/C++/Obj-C)
• Idea: each grammar production becomes a function
• Productions may be mutually recursive; functions too
• This is the main difference compared to regexes
• Parser combinators are an abstraction over this idea
Recursive Descent Parser
• Has a few disadvantages:
• Can't handle left-recursive grammars
• Can't handle infix expressions very well:
• precedence
• associativity
Slide 48
Slide 48 text
github.com/igstan/bucharestfp-021
Code
Slide 49
Slide 49 text
Homework!
• Write a lexer for JSON
• Write a recursive descent parser for JSON
• It's easier than today's vehicle language, I promise!
• Specification: json.org
• Should we try a coding dojo for this?