Combinator Parsing

Combinator Parsing

A dive into and a Haskell implementation of Graham Hutton's "Higher Order Functions for Parsing" from the Journal of Functional Programming, 1992.

A9e271fb1622f8dbb6d652993f5a23a7?s=128

Swanand Pagnis

January 20, 2018
Tweet

Transcript

  1. 3.

    • Abstract & Introduction • Build a parser, one fn

    at a time • Moving beyond toy parsers
  2. 5.

    In combinator parsing, the text of parsers resembles BNF notation.

    We present the basic method, and a number of extensions. We address the special problems presented by whitespace, and parsers with separate lexical and syntactic phases. In particular, a combining form for handling the “offside rule” is given. Other extensions to the basic method include an “into” combining form with many useful applications, and a simple means by which combinator parsers can produce more informative error messages.
  3. 6.

    • Combinators that resemble BNF notation • Whitespace handling through

    "Offside Rule" • "Into" combining form for advanced parsing • Strategy for better error messages
  4. 10.

    Lexical analysis and syntax • Combine the combinators • Define

    lexical elements • Return results and unused input
  5. 22.
  6. 23.

    Simple when stick to fundamental FP • Higher order functions

    • Immutability • Recursive problem solving • Algebraic types
  7. 26.

    Types help with abstraction • We'll be dealing with parsers

    and combinators • Parsers are functions, they accept input and return results • Combinators accept parsers and return parsers
  8. 27.

    A parser is a function that accepts an input and

    returns parsed results and the unused input for each result
  9. 28.

    Parser is a function type that accepts a list of

    type a and returns all possible results as a list of tuples of type (b, [a])
  10. 29.

    (Parser Char Number) input: "42 it is!" !-- a is

    a [Char] output: [(42, " it is!")] !-- b is a Number
  11. 36.

    satisfy !:: (a !-> Bool) !-> Parser a a satisfy

    p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure []
  12. 37.

    satisfy !:: (a !-> Bool) !-> Parser a a satisfy

    p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure [] Guard Clauses, if you want to Google
  13. 38.

    literal !:: Eq a !=> a !-> Parser a a

    literal x = satisfy (!== x)
  14. 43.

    alt !:: Parser a b !-> Parser a b !->

    Parser a b (p1 `alt` p2) inp = p1 inp !++ p2 inp
  15. 44.
  16. 46.
  17. 47.

    and_then !:: Parser a b !-> Parser a c !->

    Parser a (b, c) (p1 `and_then` p2) inp = [ ((v1, v2), out2) | (v1, out1) !<- p1 inp, (v2, out2) !<- p2 out1 ]
  18. 48.

    and_then !:: Parser a b !-> Parser a c !->

    Parser a (b, c) (p1 `and_then` p2) inp = [ ((v1, v2), out2) | (v1, out1) !<- p1 inp, (v2, out2) !<- p2 out1 ] List comprehensions
  19. 49.

    (v11, out11) (v12, out12) (v13, out13) … (v21, out21) (v22,

    out22) … (v31, out31) (v32, out32) … p1 p2
  20. 56.

    using !:: Parser a b !-> (b !-> c) !->

    Parser a c (p `using` f) inp = [(f v, out) | (v, out) !<- p inp ]
  21. 59.

    many !:: Parser a b !-> Parser a [b] many

    p = ((p `and_then` many p) `using` cons) `alt` (succeed [])
  22. 60.
  23. 63.

    some !:: Parser a b !-> Parser a [b] some

    p = ((p `and_then` many p) `using` cons)
  24. 64.
  25. 67.

    positive_integer = some (satisfy Data.Char.isDigit) negative_integer = ((literal '-') `and_then`

    positive_integer) `using` cons positive_decimal = (positive_integer `and_then` (((literal '.') `and_then` positive_integer) `using` cons)) `using` join negative_decimal = ((literal '-') `and_then` positive_decimal) `using` cons
  26. 68.
  27. 70.

    string !:: (Eq a) !=> [a] !-> Parser a [a]

    string [] = succeed [] string (x:xs) = (literal x `and_then` string xs) `using` cons
  28. 72.

    xthen !:: Parser a b !-> Parser a c !->

    Parser a c p1 `xthen` p2 = (p1 `and_then` p2) `using` snd
  29. 73.

    thenx !:: Parser a b !-> Parser a c !->

    Parser a b p1 `thenx` p2 = (p1 `and_then` p2) `using` fst
  30. 74.

    ret !:: Parser a b !-> c !-> Parser a

    c p `ret` v = p `using` (const v)
  31. 77.

    data Expr = Const Double | Expr `Add` Expr |

    Expr `Sub` Expr | Expr `Mul` Expr | Expr `Div` Expr
  32. 81.

    BNF Notation expn !::= expn + expn | expn −

    expn | expn ∗ expn | expn / expn | digit+ | (expn)
  33. 82.

    Improving a little: expn !::= term + term | term

    − term | term term !::= factor ∗ factor | factor / factor | factor factor !::= digit+ | (expn)
  34. 89.

    value xs = Const (numval xs) plus (x,y) = x

    `Add` y minus (x,y) = x `Sub` y times (x,y) = x `Mul` y divide (x,y) = x `Div` y
  35. 93.

    expn "12*(5+(7-2))" # !=> [ (Const 12.0 `Mul` (Const 5.0

    `Add` (Const 7.0 `Sub` Const 2.0)),""), … ]
  36. 94.

    value xs = Const (numval xs) plus (x,y) = x

    `Add` y minus (x,y) = x `Sub` y times (x,y) = x `Mul` y divide (x,y) = x `Div` y
  37. 95.

    value = numval plus (x,y) = x + y minus

    (x,y) = x - y times (x,y) = x * y divide (x,y) = x / y
  38. 102.
  39. 107.

    The parser (nibble p) has the same behaviour as parser

    p, except that it eats up any white- space in the input string before or afterwards
  40. 112.

    w = x + y where x = 10 y

    = 15 - 5 z = w * 2
  41. 113.

    w = x + y where x = 10 y

    = 15 - 5 z = w * 2
  42. 114.

    When obeying the offside rule, every token must lie either

    directly below, or to the right of its first token
  43. 118.

    prelex "3 + \n 2 * (4 + 5)" #

    !=> [('3',(0,0)), ('+',(0,2)), ('2',(1,2)), ('*',(1,4)), … ]
  44. 119.

    satisfy !:: (a !-> Bool) !-> Parser a a satisfy

    p [] = failure [] satisfy p (x:xs) | p x = succeed x xs !-- if p(x) is true | otherwise = failure []
  45. 120.

    satisfy !:: (a !-> Bool) !-> Parser (Pos a) a

    satisfy p [] = failure [] satisfy p (x:xs) | p a = succeed a xs !-- if p(a) is true | otherwise = failure [] where (a, (r, c)) = x
  46. 121.

    satisfy !:: (a !-> Bool) !-> Parser (Pos a) a

    satisfy p [] = failure [] satisfy p (x:xs) | p a = succeed a xs !-- if p(a) is true | otherwise = failure [] where (a, (r, c)) = x
  47. 122.

    offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp onside (a, (r, c)) (b, (r', c')) = r' !>= r !&& c' !>= c
  48. 124.

    offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)]
  49. 126.

    offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)]
  50. 127.

    offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp
  51. 128.

    offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp
  52. 129.

    offside !:: Parser (Pos a) b !-> Parser (Pos a)

    b offside p inp = [(v, inpOFF) | (v, []) !<- (p inpON)] where inpON = takeWhile (onside (head inp)) inp inpOFF = drop (length inpON) inp onside (a, (r, c)) (b, (r', c')) = r' !>= r !&& c' !>= c
  53. 130.

    (3 + 2 * (4 + 5)) + (8 *

    10) (3 + 2 * (4 + 5)) + (8 * 10)
  54. 133.

    ∅ !|> succeed, fail !|> satisfy, literal !|> alt, and_then,

    using !|> many, some !|> string, thenx, xthen, return !|> expression parser & evaluator !|> any, nibble, symbol !|> prelex, offside
  55. 136.

    type Parser a b = [a] !-> [(b, [a])] type

    Pos a = (a, (Integer, Integer))
  56. 137.

    data Tag = Ident | Number | Symbol | Junk

    deriving (Show, Eq) type Token = (Tag, [Char])
  57. 140.

    (p `tok` t) inp = [ (((t, xs), (r, c)),

    out) | (xs, out) !<- p inp] where (x, (r,c)) = head inp
  58. 141.

    (p `tok` t) inp = [ ((<token>,<pos>),<unused input>) | (xs,

    out) !<- p inp] where (x, (r,c)) = head inp
  59. 142.

    (p `tok` t) inp = [ (((t, xs), (r, c)),

    out) | (xs, out) !<- p inp] where (x, (r,c)) = head inp
  60. 147.
  61. 152.

    lexer = lex [ ((some (any_of literal " \n\t")), Junk),

    ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
  62. 153.

    lexer = lex [ ((some (any_of literal " \n\t")), Junk),

    ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
  63. 154.

    lexer = lex [ ((some (any_of literal " \n\t")), Junk),

    ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
  64. 155.

    head (lexer (prelex "where x = 10")) # !=> ([((Symbol,"where"),(0,0)),

    ((Ident,"x"),(0,6)), ((Symbol,"="),(0,8)), ((Number,"10"),(0,10)) ],[])
  65. 160.

    In this case, "where" is a source of conflict. It

    can be a symbol, or identifier.
  66. 161.

    lexer = lex [ {- 1 -} ((some (any_of literal

    " \n\t")), Junk), {- 2 -} ((string "where"), Symbol), {- 3 -} (word, Ident), {- 4 -} (number, Number), {- 5 -} ((any_of string ["(",")","="]), Symbol)]
  67. 166.

    (fst.head.lexer.prelex) "where x = 10" # !=> [((Symbol,"where"),(0,0)), ((Junk," "),(0,5)),

    ((Ident,"x"),(0,6)), ((Junk," "),(0,7)), ((Symbol,"="),(0,8)), ((Junk," "),(0,9)), ((Number,"10"),(0,10))]
  68. 171.

    f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5
  69. 172.

    f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Script
  70. 173.

    f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Definition
  71. 174.

    f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Body
  72. 175.

    f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Expression
  73. 176.

    f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Definition
  74. 177.

    f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5 Primitives
  75. 178.

    data Script = Script [Def] data Def = Def Var

    [Var] Expn data Expn = Var Var | Num Double | Expn `Apply` Expn | Expn `Where` [Def] type Var = [Char]
  76. 180.
  77. 183.

    prim = ((kind Ident) `using` Var) `alt` ((kind Number) `using`

    numFN) `alt` ((lit "(") `xthen` (expr `thenx` (lit ")")))
  78. 184.

    !-- only allow a kind of tag kind !:: Tag

    !-> Parser (Pos Token) [Char] kind t = (satisfy ((!== t).fst)) `using` snd — only allow a given symbol lit !:: [Char] !-> Parser (Pos Token) [Char] lit xs = (literal (Symbol, xs)) `using` snd
  79. 186.
  80. 189.

    prim = ((kind Ident) `using` Var) `alt` ((kind Number) `using`

    numFN) `alt` ((lit "(") `xthen` (expr `thenx` (lit ")")))
  81. 190.

    data Script = Script [Def] data Def = Def Var

    [Var] Expn data Expn = Var Var | Num Double | Expn `Apply` Expn | Expn `Where` [Def] type Var = [Char]
  82. 194.

    f x y = add a b where a =

    25 b = sub x y answer = mult (f 3 7) 5
  83. 195.

    Script [ Def "f" ["x","y"] ( ((Var "add" `Apply` Var

    "a") `Apply` Var "b") `Where` [ Def "a" [] (Num 25.0), Def "b" [] ((Var "sub" `Apply` Var "x") `Apply` Var "y")]), Def "answer" [] ( (Var "mult" `Apply` ( (Var "f" `Apply` Num 3.0) `Apply` Num 7.0)) `Apply` Num 5.0)]
  84. 198.

    lexer = lex [ ((some (any_of literal " \n\t")), Junk),

    ((string "where"), Symbol), (word, Ident), (number, Number), ((any_of string ["(", ")", "="]), Symbol)]
  85. 200.

    defn = ((some (kind Ident)) `and_then` ((lit "=") `xthen` (offside

    body))) `using` defnFN body = (expr `and_then` (((lit "where") `xthen` (some defn)) `opt` [])) `using` bodyFN expr = (some prim) `using` (foldl1 Apply) prim = ((kind Ident) `using` Var) `alt` ((kind Number) `using` numFN) `alt` ((lit "(") `xthen` (expr `thenx` (lit ")")))
  86. 207.
  87. 208.

    Haskell: Parsec, MegaParsec. ✨ OCaml: Angstrom. ✨ Ruby: rparsec, or

    roll you own Elixir: Combine, ExParsec Python: Parsec. ✨
  88. 209.