Let's Write a Parser

Let's Write a Parser

7cefc64f7b1b53513625bf3487ecd16d?s=128

Ionuț G. Stan

May 19, 2016
Tweet

Transcript

  1. 5.

    • Software Developer at Eloquentix • I work mostly with

    Scala • I like FP, programming languages, compilers About Me
  2. 6.

    • Software Developer at Eloquentix • I work mostly with

    Scala • I like FP, programming languages, compilers • I started the Bucharest FP meet-up group About Me
  3. 7.

    • Software Developer at Eloquentix • I work mostly with

    Scala • I like FP, programming languages, compilers • I started the Bucharest FP meet-up group • I occasionally blog on igstan.ro About Me
  4. 8.
  5. 14.

    1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. Vehicle Language: µML
  6. 15.

    1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false Vehicle Language: µML
  7. 16.

    1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a Vehicle Language: µML
  8. 17.

    1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 Vehicle Language: µML
  9. 18.

    1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f Vehicle Language: µML
  10. 19.

    1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f 7. Addition and subtraction: a + b, a - b Vehicle Language: µML
  11. 20.

    1. Integers: 1, 23, 456, etc. 2. Identifiers (only letters):

    inc, cond, a, etc. 3. Booleans: true and false 4. Single-argument anonymous functions: fn a => a 5. Function application: inc 42 6. If expressions: if cond then t else f 7. Addition and subtraction: a + b, a - b 8. Parenthesized expressions: (a + b) Vehicle Language: µML
  12. 31.

    Abstract Syntax Tree T Compiler ) 2 Parser APP FUN

    a VAR a INT 2 Abstract Syntax Tree (AST) uage (funct
  13. 32.

    Code Generation T Compiler ) 2 Parser CodeGen APP FUN

    a VAR a INT 2 Abstract Syntax Tree (AST) uage (funct
  14. 35.

    Last Year's Talk T Compiler ) 2 Parser CodeGen Type

    Checker AST Typed AST Last Year ... uage (funct
  15. 36.

    Today's Talk T Compiler ) 2 Parser CodeGen Type Checker

    AST Typed AST Today ... uage (funct
  16. 46.

    Lexing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  17. 47.

    Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  18. 48.

    Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  19. 49.

    Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  20. 50.

    Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  21. 51.

    Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  22. 52.

    Lexing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  23. 53.
  24. 54.

    Lexing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser uage • Expects a stream of characters or bytes • Groups them into semantically atomic units: tokens! • These are the words of the language! • What are the rules for grouping them, though?
  25. 56.

    • Grouping can be thought of as "split by space"

    • Why not exactly that, though? Consider: Lexing
  26. 57.

    • Grouping can be thought of as "split by space"

    • Why not exactly that, though? Consider: Lexing val sum = 1 + 2 ! val sum=1+2 ! val str = "spaces matter here"
  27. 59.

    • We need rules for grouping characters into tokens •

    These rules form the lexical grammar Lexing
  28. 60.

    • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions Lexing
  29. 61.

    • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations Lexing
  30. 62.

    • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations • Using a RegExp library Lexing
  31. 63.

    • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations • Using a RegExp library • By hand isn't hard either, just a little cumbersome Lexing
  32. 64.

    • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations • Using a RegExp library • By hand isn't hard either, just a little cumbersome • Lexer generators: Lex, Flex, Alex, ANTLR, etc. Lexing
  33. 65.

    • We need rules for grouping characters into tokens •

    These rules form the lexical grammar • Can be defined using regular expressions • Conducive to easy and efficient implementations • Using a RegExp library • By hand isn't hard either, just a little cumbersome • Lexer generators: Lex, Flex, Alex, ANTLR, etc. • Lexing is what you need for syntax definition files Lexing
  34. 66.

    µML — Lexical Grammar integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (,

    ), +, -, =, => keywords if, then, else, let, val, in, end, fn, true, false
  35. 67.

    integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (, ), +, -, =,

    => keywords if, then, else, let, val, in, end, fn, true, false µML — Lexical Grammar
  36. 68.

    integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (, ), +, -, =,

    => keywords if, then, else, let, val, in, end, fn, true, false µML — Lexical Grammar
  37. 69.

    integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (, ), +, -, =,

    => keywords if, then, else, let, val, in, end, fn, true, false µML — Lexical Grammar
  38. 70.

    integers 0|[1-9][0-9]* identifiers [a-zA-Z]+ symbols (, ), +, -, =,

    => keywords if, then, else, let, val, in, end, fn, true, false µML — Lexical Grammar
  39. 71.
  40. 72.
  41. 73.

    Parsing Compiler ) 2 Parser Lexer Tokens ( fn a

    => a ) 2 Parser APP FUN a VAR a INT 2 AST uage
  42. 75.

    • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language Parsing
  43. 76.

    • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 Parsing
  44. 77.

    • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 • Syntactically incorrect: val val val Parsing
  45. 78.

    • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 • Syntactically incorrect: val val val • We must define the structure of phrases Parsing
  46. 79.

    • The lexer recognizes valid words in the language •

    Not all combinations of valid words form valid phrases in a language • Syntactically correct: val a = 1 • Syntactically incorrect: val val val • We must define the structure of phrases • A syntactical grammar achieves that Parsing
  47. 81.
  48. 82.

    • Regular expressions are not powerful enough • REs can't

    recognize nested structures • Because they use a finite amount of memory Parsing
  49. 83.

    • Regular expressions are not powerful enough • REs can't

    recognize nested structures • Because they use a finite amount of memory • Nesting needs a stack to remember the upper structures you're traversing Parsing
  50. 84.

    • Regular expressions are not powerful enough • REs can't

    recognize nested structures • Because they use a finite amount of memory • Nesting needs a stack to remember the upper structures you're traversing • Syntactical grammars express nesting using recursion Parsing
  51. 85.
  52. 88.

    µML — Syntactical Grammar expr = int | var |

    bool | ( expr ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr oper = + | - bool = true | false
  53. 89.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr oper = + | - bool = true | false Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  54. 90.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr oper = + | - bool = true | false Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  55. 91.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  56. 92.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  57. 93.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  58. 94.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  59. 95.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  60. 96.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  61. 97.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  62. 98.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  63. 99.

    expr = int | var | bool | ( expr

    ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | - Here, blue symbols represent tokens coming from the lexer, not keywords. µML — Syntactical Grammar
  64. 101.

    • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 Introducing Precedence
  65. 102.

    • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) Introducing Precedence
  66. 103.

    • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence Introducing Precedence
  67. 104.

    • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence • Grammars convey this by chaining rules in order of precedence Introducing Precedence
  68. 105.

    • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence • Grammars convey this by chaining rules in order of precedence • Doesn't scale with many infix operators Introducing Precedence
  69. 106.

    • Function application has higher precedence over infix expressions in

    ML • double 1 + 2 = (double 1) + 2 • double 1 + 2 ≠ double (1 + 2) • A rule's alternatives don't encode precedence • Grammars convey this by chaining rules in order of precedence • Doesn't scale with many infix operators • Use a special parser for that, e.g., the Shunting Yard algorithm Introducing Precedence
  70. 107.

    Introducing Precedence expr = int | var | bool |

    ( expr ) | fn var => expr | if expr then expr else expr | let val var = expr in expr end | expr oper expr | expr expr bool = true | false oper = + | -
  71. 108.

    Introducing Precedence expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  72. 109.

    Introducing Precedence expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  73. 110.

    Introducing Precedence expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  74. 111.

    Introducing Precedence expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  75. 115.

    • Two styles: • Top-down parsing: builds tree from the

    root • Bottom-up parsing: builds tree from the leaves Parsing Strategies
  76. 116.

    • Two styles: • Top-down parsing: builds tree from the

    root • Bottom-up parsing: builds tree from the leaves • Top-down is easy to write by hand Parsing Strategies
  77. 117.

    • Two styles: • Top-down parsing: builds tree from the

    root • Bottom-up parsing: builds tree from the leaves • Top-down is easy to write by hand • Bottom-up is not, but it's used by generators Parsing Strategies
  78. 118.

    • Two styles: • Top-down parsing: builds tree from the

    root • Bottom-up parsing: builds tree from the leaves • Top-down is easy to write by hand • Bottom-up is not, but it's used by generators • Parser generators: YACC, ANTLR, Bison, etc. Parsing Strategies
  79. 120.

    • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent Recursive Descent Parser
  80. 121.

    • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar Recursive Descent Parser
  81. 122.

    • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function Recursive Descent Parser
  82. 123.

    • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive Recursive Descent Parser
  83. 124.

    • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive • Recursion is the main difference compared to regexes; it needs a stack Recursive Descent Parser
  84. 125.

    • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive • Recursion is the main difference compared to regexes; it needs a stack • Very popular, e.g., Clang uses it for C/C++/Obj-C Recursive Descent Parser
  85. 126.

    • The simplest known parsing strategy; amenable to hand-coding •

    Builds the tree top to bottom, from root to leaves, hence Descent • Parallels the structure of the grammar • Main idea: each grammar production becomes a function • Recursion in the grammar translates to recursion in the code, hence Recursive • Recursion is the main difference compared to regexes; it needs a stack • Very popular, e.g., Clang uses it for C/C++/Obj-C • Parser combinators are an abstraction over this idea Recursive Descent Parser
  86. 127.
  87. 129.

    • The current grammar has a problem • But, it's

    only a problem for our current parsing strategy; others can easily cope with it Removing Left-Recursion
  88. 130.

    • The current grammar has a problem • But, it's

    only a problem for our current parsing strategy; others can easily cope with it • The problem is that some rules are left-recursive, i.e., the rule itself appears as the first symbol on the left Removing Left-Recursion
  89. 131.

    • The current grammar has a problem • But, it's

    only a problem for our current parsing strategy; others can easily cope with it • The problem is that some rules are left-recursive, i.e., the rule itself appears as the first symbol on the left • This is problematic for a recursive descent parser because the structure of function calls follow the structure of rule definitions Removing Left-Recursion
  90. 132.

    • The current grammar has a problem • But, it's

    only a problem for our current parsing strategy; others can easily cope with it • The problem is that some rules are left-recursive, i.e., the rule itself appears as the first symbol on the left • This is problematic for a recursive descent parser because the structure of function calls follow the structure of rule definitions • That means infinite recursion in the parser, which isn't good Removing Left-Recursion
  91. 133.

    expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | - Left-Recursive Grammar
  92. 134.

    expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | - Left-Recursive Grammar
  93. 135.

    expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | app atomic Left-Recursive Grammar
  94. 136.

    expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | atomic atomic | atomic atomic atomic | atomic atomic atomic atomic ... Left-Recursive Grammar
  95. 137.

    expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic | atomic atomic | atomic (atomic atomic) | atomic (atomic (atomic atomic)) ... Left-Recursive Grammar
  96. 138.

    expr = infix | fn var => expr | if

    expr then expr else expr ! infix = app | infix oper infix ! app = atomic { app } Left-Recursive Grammar
  97. 139.

    Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix ! app = atomic { app } ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  98. 140.

    Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | infix oper infix
  99. 141.

    Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app oper infix
  100. 142.

    Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app oper infix | app oper app oper infix
  101. 143.

    Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app oper infix | app oper app oper infix | app oper app oper app oper infix
  102. 144.

    Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app oper infix | app oper app oper infix | app oper app oper app oper infix ...
  103. 145.

    Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app | app (oper infix) | app (oper app (oper infix)) | app (oper app (oper app (oper infix))) ...
  104. 146.

    Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app { oper infix }
  105. 147.

    Removing Left-Recursion expr = infix | fn var => expr

    | if expr then expr else expr ! infix = app { oper infix } ! app = atomic { app } ! 12 14 13 (12 14) 13 ! atomic = int | var | bool | ( expr ) | let val var = expr in expr end bool = true | false oper = + | -
  106. 149.

    • Write a lexer for JSON • Write a recursive

    descent parser for JSON • It's way easier than today's vehicle language • I promise! • Specification: json.org Homework
  107. 150.
  108. 151.