Combinator Parsing

Combinator Parsing

A dive into and a Haskell implementation of Graham Hutton's "Higher Order Functions for Parsing" from the Journal of Functional Programming, 1992.

A9e271fb1622f8dbb6d652993f5a23a7?s=128

Swanand Pagnis

January 20, 2018
Tweet

Transcript

  1. Combinator Parsing By Swanand Pagnis

  2. Higher-Order Functions for Parsing By Graham Hutton

  3. • Abstract & Introduction • Build a parser, one fn

    at a time • Moving beyond toy parsers
  4. Abstract

  5. In combinator parsing, the text of parsers resembles BNF notation.

    We present the basic method, and a number of extensions. We address the special problems presented by whitespace, and parsers with separate lexical and syntactic phases. In particular, a combining form for handling the “offside rule” is given. Other extensions to the basic method include an “into” combining form with many useful applications, and a simple means by which combinator parsers can produce more informative error messages.
  6. • Combinators that resemble BNF notation • Whitespace handling through

    "Offside Rule" • "Into" combining form for advanced parsing • Strategy for better error messages
  7. Introduction

  8. Primitive Parsers • Take input • Process one character •

    Return results and unused input
  9. Combinators • Combine primitives • Define building blocks • Return

    results and unused input
  10. Lexical analysis and syntax • Combine the combinators • Define

    lexical elements • Return results and unused input
  11. input: "from:swiggy to:me" output: [("f", "rom:swiggy to:me")]

  12. input: "42 !=> ans" output: [("4", "2 !=> ans")]

  13. rule: 'a' followed by 'b' input: "abcdef" output: [(('a','b'),"cdef")]

  14. rule: 'a' followed by 'b' input: "abcdef" output: [(('a','b'),"cdef")] Combinator

  15. Language choice

  16. Suggested: Lazy Functional Languages

  17. Miranda: Author's choice

  18. Haskell: An obvious choice.

  19. Racket: Another obvious choice.

  20. Ruby: to so $ for learning

  21. OCaml: Functional, but not lazy.

  22. Haskell %

  23. Simple when stick to fundamental FP • Higher order functions

    • Immutability • Recursive problem solving • Algebraic types
  24. Let's build a parser, one fn at a time

  25. type Parser a b = [a] !-> [(b, [a])]

  26. Types help with abstraction • We'll be dealing with parsers

    and combinators • Parsers are functions, they accept input and return results • Combinators accept parsers and return parsers
  27. A parser is a function that accepts an input and

    returns parsed results and the unused input for each result
  28. Parser is a function type that accepts a list of

    type a and returns all possible results as a list of tuples of type (b, [a])
  29. (Parser Char Number) input: "42 it is!" !-- a is

    a [Char] output: [(42, " it is!")] !-- b is a Number
  30. type Parser a b = [a] !-> [(b, [a])]

  31. Primitive Parsers

  32. succeed !:: b !-> Parser a b succeed v inp

    = [(v, inp)]
  33. Always succeeds Returns "v" for all inputs

  34. failure !:: Parser a b failure inp = []

  35. Always fails Returns "[]" for all inputs

  36. satisfy !:: (a !-> Bool) !-> Parser a a satisfy

    p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure []
  37. satisfy !:: (a !-> Bool) !-> Parser a a satisfy

    p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure [] Guard Clauses, if you want to Google
  38. literal !:: Eq a !=> a !-> Parser a a

    literal x = satisfy (!== x)
  39. match_3 = (literal '3') match_3 "345" !-- !=> [('3',"45")] match_3

    "456" !-- !=> []
  40. succeed failure satisfy literal

  41. Combinators

  42. match_3_or_4 = match_3 `alt` match_4 match_3_or_4 "345" !-- !=> [('3',"45")]

    match_3_or_4 "456" !-- !=> [('4',"56")]
  43. alt !:: Parser a b !-> Parser a b !->

    Parser a b (p1 `alt` p2) inp = p1 inp !++ p2 inp
  44. (p1 `alt` p2) inp = p1 inp !++ p2 inp

    List concatenation
  45. (match_3 `and_then` match_4) "345" # !=> [(('3','4'),"5")]

  46. None
  47. and_then !:: Parser a b !-> Parser a c !->

    Parser a (b, c) (p1 `and_then` p2) inp = [ ((v1, v2), out2) | (v1, out1) !<- p1 inp, (v2, out2) !<- p2 out1 ]
  48. and_then !:: Parser a b !-> Parser a c !->

    Parser a (b, c) (p1 `and_then` p2) inp = [ ((v1, v2), out2) | (v1, out1) !<- p1 inp, (v2, out2) !<- p2 out1 ] List comprehensions
  49. (v11, out11) (v12, out12) (v13, out13) … (v21, out21) (v22,

    out22) … (v31, out31) (v32, out32) … p1 p2
  50. ((v11, v21), out21) ((v11, v22), out22) …

  51. (match_3 `and_then` match_4) "345" # !=> [(('3','4'),"5")]

  52. Manipulating values

  53. match_3 = (literal '3') match_3 "345" !-- !=> [('3',"45")] match_3

    "456" !-- !=> []
  54. (number "42") "42 !=> answer" # !=> [(42, " answer")]

  55. (keyword "for") "for i in 1!..42" # !=> [(:for, "

    i in 1!..42")]
  56. using !:: Parser a b !-> (b !-> c) !->

    Parser a c (p `using` f) inp = [(f v, out) | (v, out) !<- p inp ]
  57. ((string "3") `using` float) "3" # !=> [(3.0, "")]

  58. Levelling up

  59. many !:: Parser a b !-> Parser a [b] many

    p = ((p `and_then` many p) `using` cons) `alt` (succeed [])
  60. 0 or many

  61. (many (literal 'a')) "aab" !=> [("aa","b"),("a","ab"),("","aab")]

  62. (many (literal 'a')) "xyz" !=> [("","xyz")]

  63. some !:: Parser a b !-> Parser a [b] some

    p = ((p `and_then` many p) `using` cons)
  64. 1 or many

  65. (some (literal 'a')) "aab" !=> [("aa","b"),("a","ab")]

  66. (some (literal 'a')) "xyz" !=> []

  67. positive_integer = some (satisfy Data.Char.isDigit) negative_integer = ((literal '-') `and_then`

    positive_integer) `using` cons positive_decimal = (positive_integer `and_then` (((literal '.') `and_then` positive_integer) `using` cons)) `using` join negative_decimal = ((literal '-') `and_then` positive_decimal) `using` cons
  68. number !:: Parser Char [Char] number = negative_decimal `alt` positive_decimal

    `alt` negative_integer `alt` positive_integer
  69. word !:: Parser Char [Char] word = some (satisfy isLetter)

  70. string !:: (Eq a) !=> [a] !-> Parser a [a]

    string [] = succeed [] string (x:xs) = (literal x `and_then` string xs) `using` cons
  71. (string "begin") "begin end" # !=> [("begin"," end")]

  72. xthen !:: Parser a b !-> Parser a c !->

    Parser a c p1 `xthen` p2 = (p1 `and_then` p2) `using` snd
  73. thenx !:: Parser a b !-> Parser a c !->

    Parser a b p1 `thenx` p2 = (p1 `and_then` p2) `using` fst
  74. ret !:: Parser a b !-> c !-> Parser a

    c p `ret` v = p `using` (const v)
  75. succeed, failure, satisfy, literal, alt, and_then, using, string, many, some,

    string, word, number, xthen, thenx, ret
  76. Expression Parser & Evaluator

  77. data Expr = Const Double | Expr `Add` Expr |

    Expr `Sub` Expr | Expr `Mul` Expr | Expr `Div` Expr
  78. (Const 3) `Mul` ((Const 6) `Add` (Const 1))) # !=>

    "3*(6+1)"
  79. parse "3*(6+1)" # !=> (Const 3) `Mul` ((Const 6) `Add`

    (Const 1)))
  80. (Const 3) Mul ((Const 6) `Add` (Const 1))) # !=>

    21
  81. BNF Notation expn !::= expn + expn | expn −

    expn | expn ∗ expn | expn / expn | digit+ | (expn)
  82. Improving a little: expn !::= term + term | term

    − term | term term !::= factor ∗ factor | factor / factor | factor factor !::= digit+ | (expn)
  83. Parsers that resemble BNF

  84. addition = ((term `and_then` ((literal '+') `xthen` term)) `using` plus)

  85. subtraction = ((term `and_then` ((literal '-') `xthen` term)) `using` minus)

  86. multiplication = ((factor `and_then` ((literal '*') `xthen` factor)) `using` times)

  87. division = ((factor `and_then` ((literal '/') `xthen` factor)) `using` divide)

  88. parenthesised_expression = ((nibble (literal '(')) `xthen` ((nibble expn) `thenx`(nibble (literal

    ')'))))
  89. value xs = Const (numval xs) plus (x,y) = x

    `Add` y minus (x,y) = x `Sub` y times (x,y) = x `Mul` y divide (x,y) = x `Div` y
  90. expn = addition `alt` subtraction `alt` term

  91. term = multiplication `alt` division `alt` factor

  92. factor = (number `using` value) `alt` parenthesised_expn

  93. expn "12*(5+(7-2))" # !=> [ (Const 12.0 `Mul` (Const 5.0

    `Add` (Const 7.0 `Sub` Const 2.0)),""), … ]
  94. value xs = Const (numval xs) plus (x,y) = x

    `Add` y minus (x,y) = x `Sub` y times (x,y) = x `Mul` y divide (x,y) = x `Div` y
  95. value = numval plus (x,y) = x + y minus

    (x,y) = x - y times (x,y) = x * y divide (x,y) = x / y
  96. expn "12*(5+(7-2))" # !=> [(120.0,""), (12.0,"*(5+(7-2))"), (1.0,"2*(5+(7-2))")]

  97. expn "(12+1)*(5+(7-2))" # !=> [(130.0,""), (13.0,"*(5+(7-2))")]

  98. Moving beyond toy parsers

  99. Whitespace? (

  100. white = (literal " ") `alt` (literal "\t") `alt` (literal

    "\n")
  101. white = many (any literal " \t\n")

  102. /\s!*/

  103. any p = foldr (alt.p) fail

  104. any p [x1,x2,!!...,xn] = (p x1) `alt` (p x2) `alt`

    !!... `alt` (p xn)
  105. white = many (any literal " \t\n")

  106. nibble p = white `xthen` (p `thenx` white)

  107. The parser (nibble p) has the same behaviour as parser

    p, except that it eats up any white- space in the input string before or afterwards
  108. (nibble (literal 'a')) " a " # !=> [('a',""),('a'," "),('a',"

    ")]
  109. symbol = nibble.string

  110. symbol "$fold" " $fold " # !=> [("$fold", ""), ("$fold",

    " ")]
  111. The Offside Rule

  112. w = x + y where x = 10 y

    = 15 - 5 z = w * 2
  113. w = x + y where x = 10 y

    = 15 - 5 z = w * 2
  114. When obeying the offside rule, every token must lie either

    directly below, or to the right of its first token
  115. i.e. A weak indentation policy

  116. The Offside Combinator

  117. type Pos a = (a, (Integer, Integer))

  118. prelex "3 + \n 2 * (4 + 5)" #

    !=> [('3',(0,0)), ('+',(0,2)), ('2',(1,2)), ('*',(1,4)), … ]
  119. satisfy !:: (a !-> Bool) !-> Parser a a satisfy

    p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure []
  120. satisfy !:: (a !-> Bool) !-> Parser (Pos a) a

    satisfy p [] = failure [] satisfy p (x:xs) | p a = succeed a xs !-- if p(a) is true | otherwise = failure [] where (a, (r, c)) = x
  121. satisfy !:: (a !-> Bool) !-> Parser (Pos a) a

    satisfy p [] = failure [] satisfy p (x:xs) | p a = succeed a xs !-- if p(a) is true | otherwise = failure [] where (a, (r, c)) = x
  122. offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp onside (a, (r, c)) (b, (r', c')) = r' !>= r !&& c' !>= c
  123. offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b
  124. offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)]
  125. (nibble (literal 'a')) " a " # !=> [('a',""),('a'," "),('a',"

    ")]
  126. offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)]
  127. offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp
  128. offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp
  129. offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp onside (a, (r, c)) (b, (r', c')) = r' !>= r !&& c' !>= c
  130. (3 + 2 * (4 + 5)) + (8 *

    10) (3 + 2 * (4 + 5)) + (8 * 10)
  131. (offside expn) (prelex inp_1) # !=> [(21.0,[('+',(2,0)),('(',(2,2)),('8',(2,3)),('*', (2,5)),('1',(2,7)),('0',(2,8)),(')',(2,9))])] (offside expn)

    (prelex inp_2) # !=> [(101.0,[])]
  132. Quick recap before we

  133. ∅ !|> succeed, fail !|> satisfy, literal !|> alt, and_then,

    using !|> many, some !|> string, thenx, xthen, return !|> expression parser & evaluator !|> any, nibble, symbol !|> prelex, offside
  134. Practical parsers

  135. Syntactical analysis Lexical analysis Parse trees

  136. type Parser a b = [a] !-> [(b, [a])] type

    Pos a = (a, (Integer, Integer))
  137. data Tag = Ident | Number | Symbol | Junk

    deriving (Show, Eq) type Token = (Tag, [Char])
  138. (Symbol, "if") (Number, "123")

  139. Parse the string with parser p, & apply token t

    to the result
  140. (p `tok` t) inp = [ (((t, xs), (r, c)),

    out) | (xs, out) !<- p inp] where (x, (r,c)) = head inp
  141. (p `tok` t) inp = [ ((<token>,<pos>),<unused input>) | (xs,

    out) !<- p inp] where (x, (r,c)) = head inp
  142. (p `tok` t) inp = [ (((t, xs), (r, c)),

    out) | (xs, out) !<- p inp] where (x, (r,c)) = head inp
  143. ((string "where") `tok` Symbol) inp # !=> ((Symbol,"where"), (r, c))

  144. many ((p1 `tok` t1) `alt` (p2 `tok` t2) `alt` !!...

    `alt` (pn `tok` tn))
  145. [(p1, t1), (p2, t2), …, (pn, tn)]

  146. lex = many.(foldr op failure) where (p, t) `op` xs

    = (p `tok` t) `alt` xs
  147. None
  148. lex = many.(foldr op failure) where (p, t) `op` xs

    = (p `tok` t) `alt` xs
  149. # Rightmost computation cn = (pn `tok` tn) `alt` failure

  150. # Followed by (pn-1 `tok` tn-1) `alt` cn

  151. many ((p1 `tok` t1) `alt` (p2 `tok` t2) `alt` !!...

    `alt` (pn `tok` tn))
  152. lexer = lex [ ((some (any_of literal " \n\t")), Junk),

    ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
  153. lexer = lex [ ((some (any_of literal " \n\t")), Junk),

    ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
  154. lexer = lex [ ((some (any_of literal " \n\t")), Junk),

    ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
  155. head (lexer (prelex "where x = 10")) # !=> ([((Symbol,"where"),(0,0)),

    ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10)) ],[])
  156. (head.lexer.prelex) "where x = 10" # !=> ([((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)),

    ((Number,"10"),(0,10)) ],[])
  157. (head.lexer.prelex) "where x = 10" # !=> ([((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)),

    ((Number,"10"),(0,10)) ],[]) Function composition
  158. length ((lexer.prelex) "where x = 10") # !=> 198

  159. Conflicts? Ambiguity?

  160. In this case, "where" is a source of conflict. It

    can be a symbol, or identifier.
  161. lexer = lex [ {- 1 -} ((some (any_of literal

    " \n\t")), Junk), {- 2 -} ((string "where"), Symbol), {- 3 -} (word, Ident), {- 4 -} (number, Number), {- 5 -} ((any_of string ["(",")","="]), Symbol)]
  162. Higher priority, higher precedence

  163. Removing Junk

  164. strip !:: [(Pos Token)] !-> [(Pos Token)] strip = filter

    ((!!= Junk).fst.fst)
  165. ((!!= Junk).fst.fst) ((Symbol,"where"),(0,0)) # !=> True ((!!= Junk).fst.fst) ((Junk,"where"),(0,0)) #

    !=> False
  166. (fst.head.lexer.prelex) "where x = 10" # !=> [((Symbol,"where"),(0,0)), ((Junk," "),(0,5)),

    ((Ident,"x"),(0,6)), ((Junk," "),(0,7)), ((Symbol,"="),(0,8)), ((Junk," "),(0,9)), ((Number,"10"),(0,10))]
  167. (strip.fst.head.lexer.prelex) "where x = 10" # !=> [((Symbol,"where"),(0,0)), ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)),

    ((Number,"10"),(0,10))]
  168. Syntax Analysis

  169. characters !|> lexical analysis !|> tokens

  170. tokens !|> syntax analysis !|> parse trees

  171. f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5
  172. f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Script
  173. f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Definition
  174. f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Body
  175. f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Expression
  176. f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Definition
  177. f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Primitives
  178. data Script = Script [Def] data Def = Def Var

    [Var] Expn data Expn = Var Var | Num Double | Expn `Apply` Expn | Expn `Where` [Def] type Var = [Char]
  179. prog = (many defn) `using` Script

  180. defn = ( (some (kind Ident)) `and_then` ((lit "=") `xthen`

    (offside body))) `using` defnFN
  181. body = ( expr `and_then` (((lit "where") `xthen` (some defn))

    `opt` [])) `using` bodyFN
  182. expr = (some prim) `using` (foldl1 Apply)

  183. prim = ((kind Ident) `using` Var) `alt` ((kind Number) `using`

    numFN) `alt` ((lit "(") `xthen` (expr `thenx` (lit ")")))
  184. !-- only allow a kind of tag kind !:: Tag

    !-> Parser (Pos Token) [Char] kind t = (satisfy ((!== t).fst)) `using` snd — only allow a given symbol lit !:: [Char] !-> Parser (Pos Token) [Char] lit xs = (literal (Symbol, xs)) `using` snd
  185. prog = (many defn) `using` Script

  186. defn = ( (some (kind Ident)) `and_then` ((lit "=") `xthen`

    (offside body))) `using` defnFN
  187. body = ( expr `and_then` (((lit "where") `xthen` (some defn))

    `opt` [])) `using` bodyFN
  188. expr = (some prim) `using` (foldl1 Apply)

  189. prim = ((kind Ident) `using` Var) `alt` ((kind Number) `using`

    numFN) `alt` ((lit "(") `xthen` (expr `thenx` (lit ")")))
  190. data Script = Script [Def] data Def = Def Var

    [Var] Expn data Expn = Var Var | Num Double | Expn `Apply` Expn | Expn `Where` [Def] type Var = [Char]
  191. Orange functions are for transforming values.

  192. Use data constructors to generate parse trees

  193. Use evaluation functions to evaluate and generate a value

  194. f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5
  195. Script [ Def "f" ["x","y"] ( ((Var "add" `Apply` Var

    "a") `Apply` Var "b") `Where` [ Def "a" [] (Num 25.0), Def "b" [] ((Var "sub" `Apply` Var "x") `Apply` Var "y")]), Def "answer" [] ( (Var "mult" `Apply` ( (Var "f" `Apply` Num 3.0) `Apply` Num 7.0)) `Apply` Num 5.0)]
  196. Strategy for writing parsers

  197. 1. Identify components i.e. Lexical elements

  198. lexer = lex [ ((some (any_of literal " \n\t")), Junk),

    ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
  199. 2. Structure these elements a.k.a. syntax

  200. defn = ((some (kind Ident)) `and_then` ((lit "=") `xthen` (offside

    body))) `using` defnFN body = (expr `and_then` (((lit "where") `xthen` (some defn)) `opt` [])) `using` bodyFN expr = (some prim) `using` (foldl1 Apply) prim = ((kind Ident) `using` Var) `alt` ((kind Number) `using` numFN) `alt` ((lit "(") `xthen` (expr `thenx` (lit ")")))
  201. 3. BNF notation is very helpful

  202. 4. TDD in the absence of types

  203. Where to, next?

  204. Monadic Parsers Graham Hutton, Eric Meijer

  205. Introduction to FP Philip Wadler

  206. The Dragon Book If your interest is in compilers

  207. Libraries?

  208. Haskell: Parsec, MegaParsec. ✨ OCaml: Angstrom. ✨ Ruby: rparsec, or

    roll you own Elixir: Combine, ExParsec Python: Parsec. ✨
  209. Thank you!

  210. Twitter: @_swanand GitHub: @swanandp