Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Time Flies like an Arrow, Fruit Flies like a Banana: Parsers for Great Good

3628a16b0829f76f84f94eb83970762c?s=47 Hsing-Hui Hsu
December 11, 2015

Time Flies like an Arrow, Fruit Flies like a Banana: Parsers for Great Good

RubyKaigi 2015
Tokyo, Japan

3628a16b0829f76f84f94eb83970762c?s=128

Hsing-Hui Hsu

December 11, 2015
Tweet

More Decks by Hsing-Hui Hsu

Other Decks in Programming

Transcript

  1. Hsing-Hui Hsu 徐⾏行慧 @SoManyHs github.com/Elffers

  2. Time flies like an arrow; Fruit flies like a banana:

    Parsers for Great Good Hsing-Hui Hsu 徐⾏行慧 @SoManyHs
  3. or: How I Accidentally a Computer Science, and So Can

    You!
  4. None
  5. parser.rb

  6. parser.rb …wat.

  7. None
  8. parser.y

  9. parser.y

  10. Let’s play Mad Libs!

  11. The young man _________(verb).

  12. The young man drank.

  13. The young man drank. subject + verb (intr.)

  14. The young man ________(verb) ________(noun - direct object).

  15. The young man drank (verb) ________(noun - direct object).

  16. The young man drank sake.

  17. The young man drank sake. subject + verb + object

  18. The young man _________(noun).

  19. The young man the boat. ⛵

  20. The young man the boat. subject + direct object

  21. The young man the boat. subject + direct object NOT

    GRAMMATICAL! %
  22. The young man the boat. subject + verb + direct

    object
  23. Garden Path Sentences

  24. [[Time] [flies [like [an arrow]]]] ; [[fruit flies] [like [a

    banana]]]. [[時は][[[⽮矢]のように]過ぎ去る]]; [[ミバエは][[バナナ]を好む]]。 Time flies like an arrow; fruit flies like a banana. 時は⽮矢のように過ぎ去る; ミバエはバナ ナを好む。 ____________________________________
  25. [[The prime] [number [few]]]. [[原始⼈人は][[少ない数しか]数えられない]]。 The prime number few. 原始⼈人は少ない数しか数えられない。

    _______________________________________
  26. The man who hunts ducks out on weekends. 男は週末ごとに狩りをしにこっそり出かける。 ___________________________________________

    [[The man who] [hunts [ducks out [on weekends]]]]. [[男は][[[週末ごとに]狩りをしに]こっそり出かける]]。
  27. The woman who whistles tunes pianos. この⼜⼝口笛を吹く⼥女はピアノの調律をする。 ______________________________________ [[The [woman

    who] [whistles]] [tunes [pianos]]]. [[この[[⼜⼝口笛を吹く]⼥女は]] [[ピアノ]の調律をする]]。
  28. 先⽣生がお酒を飲んだ⽣生徒を注意した。 The teacher advised the student who has been drunk

    not to drink. 先⽣生がお酒を飲んだ “The teacher drank sake” お酒を飲んだ (drank sake) is describing ⽣生徒 (student), and the teacher is actually doing 注意した (advising).
  29. GRANDMOTHER OF EIGHT MAKES HOLE IN ONE 8⼈人の孫を持つお婆さんがホールインワンを達成する

  30. COMPLAINTS ABOUT NBA REFEREES GROWING UGLY

  31. MILK DRINKERS ARE TURNING TO POWDER

  32. Grammar Rules

  33. Sentence = Subject + Predicate Predicate = Verb + Stuff

  34. None
  35. (Extended) Backus-Naur Form: • Metalanguage notation used to describe a

    language by a set of production rules • Each rule is expressed with terminal and non-terminal symbols
  36. Production (a.k.a rewrite) rules are expressed as:
 Left-hand side →

    Right-hand side Non-terminal → sequence of terminals and non-terminals (Extended) Backus-Naur Form:
  37. Sentence → Subj Pred Pred → Verb Stuff BNF for

    English Sentences
  38. “The young man drank sake”/ “The young man the boat”

    1. S → NP VP 2. NP → Art NP 3. NP → Adj N 4. NP → N 5. VP → V NP 6. Art → “The” 7. Art → “a” 8. Adj → “young” 9. N → “man” | “young” | “boat” | “sake” 10. V → “man” | “drank”
  39. Non-terminals = {S, NP, VP N, V, Art } Terminals

    = {“the”, “a”, “young”, “man”, “boat”, “sake”, “drank”}
  40. The young man the boat S → NP VP 㱺

    Art N VP 㱺 The young VP 㱺 The young V NP 㱺 The young man NP 㱺 The young man Art N 㱺 The young man the N 㱺 The young man the boat
  41. The young man drank sake S → NP VP 㱺

    Art NP VP 㱺 The NP VP 㱺 The Adj N VP 㱺 The young N VP 㱺 The young man VP 㱺 The young man V NP 㱺 The young man drank NP 㱺 The young man drank N 㱺 The young man drank sake
  42. But what does this have to do with computers?

  43. Source
 code Lexer Tokens Parser Syntax Tree Compiler Native Code

  44. Input Lexer Tokens Parser Output

  45. Lexing (Tokenizing)

  46. Math! • Addition: 3 + 7 • Subtraction: 3 -

    7 • Multiplication: 3 * 7
  47. Math Rules 1. Expr → Num Op Num 2. Num

    → /\d+/ 3. Op → /[+ - *]/
  48. def tokenize input ss = StringScanner.new input tokens = []

    while not ss.eos? case when ss.scan(/\d+/) token = Token::Num.new(ss.matched.to_i) tokens.push token when ss.scan(/[+*-]/) token = Token::Op.new(ss.matched) tokens.push token when ss.scan(/\s+/) #ignore else raise ParseError end end tokens end end
  49. tokenize(“3 + 7”) =>[Num(3), Op(+), Num(7)]

  50. Parser.parse(tokens) => Tree + 3 7

  51. class Parser
 def initialize tokens
 @tokens = tokens
 end def

    parse
 left = @tokens.get
 head = @tokens.get
 right = @tokens.get
 Parser::Tree.new(head,
 left,
 right)
 end
 end
  52. Slightly harder math

  53. 2 * (3 + 7)

  54. Slightly Harder Math Rules 1. Expr → Num Op Expr


    | (Expr)
 | Num 2. Num → /\d+/ 3. Op → /[+ - *]/
  55. tokenize(“2 * (3 + 7)”) => [2, *, (, 3,

    +, 7, ) ]
  56. * 2 + 3 7

  57. Top Down (with 1-token lookahead)

  58. 2 * (3 + 7) Current Token Next token

  59. 2 * (3 + 7) Current Token Next token 2

  60. 2 * (3 + 7) Current Token Next token 2

    *
  61. 2 * (3 + 7) Current Token Next token 2

    * Rule: Expr → Num Op Expr
  62. 2 * (3 + 7) Current Token Next token 2

    * Rule: Expr → Num Op Expr 2
  63. 2 Current Token Next token * (3 + 7)

  64. 2 Current Token Next token * * (3 + 7)

  65. 2 Current Token Next token * ( * (3 +

    7)
  66. 2 Rule: Expr → Num Op Expr *Expr → (Expr)

    Current Token Next token * ( * (3 + 7)
  67. 2 Rule: Expr → Num Op Expr *Expr → (Expr)

    * 2 Expr Current Token Next token * ( * (3 + 7)
  68. (3 + 7) Current Token Next token * 2 Expr

  69. (3 + 7) Current Token Next token ( * 2

    Expr
  70. (3 + 7) Current Token Next token ( 3 *

    2 Expr
  71. (3 + 7) Current Token Next token ( 3 *

    2 Expr Rule: Expr → (Expr) *Expr → Num
 *Expr → Num Op Expr
  72. (3 + 7) Current Token Next token ( 3 *

    2 Expr (Expr) Rule: Expr → (Expr) *Expr → Num
 *Expr → Num Op Expr
  73. 3 + 7) Current Token Next token * 2 (Expr)

  74. 3 + 7) Current Token Next token 3 * 2

    (Expr)
  75. 3 + 7) Current Token Next token 3 * 2

    (Expr) Expr → Num Rule:
  76. 3 + 7) Current Token Next token 3 * 2

    (Expr) Expr → Num Op Expr Expr → Num Rule:
  77. 3 + 7) Current Token Next token 3 + *

    2 (Expr) Expr → Num Op Expr Expr → Num Rule:
  78. 3 + 7) Current Token Next token 3 + *

    2 (Expr) Expr → Num Op Expr Rule:
  79. 3 + 7) Current Token Next token 3 + *

    2 (Expr) 3 Expr → Num Op Expr Rule:
  80. + 7) Current Token Next token * 2 Expr 3

  81. + 7) Current Token Next token + * 2 Expr

    3
  82. + 7) Current Token Next token + 7 * 2

    Expr 3
  83. + 7) Current Token Next token + 7 Rule:
 Expr

    → Num Op Expr * 2 Expr 3
  84. + 7) Current Token Next token + 7 Rule:
 Expr

    → Num Op Expr * 2 Expr 3 +
  85. 7) Current Token Next token * 2 + 3

  86. 7) Current Token Next token 7 * 2 + 3

  87. 7) Current Token Next token 7 ) * 2 +

    3
  88. 7) Current Token Next token 7 ) * 2 +

    3 Rule:
 Expr → Num Expr → (Expr)
  89. 7) Current Token Next token 7 ) * 2 +

    3 7 * 2 + 3 Rule:
 Expr → Num Expr → (Expr)
  90. “2 * (3 + 7)” 2 * (3 + 7)

    Num * (3 + 7) Expr * (3 + 7) Expr Op (3 + 7) Expr Op (Expr) Expr Op Expr Expr
  91. Recursive Descent

  92. Problems with Recursive Descent parsers Inefficient Possibility of infinite recursion,

    e.g.
 Expr → Expr Op Expr Limitations on grammar rules
  93. Bottom-Up (with a stack to remember things)

  94. Bottom-Up (with a stack to remember things) a.k.a. shift-reduce parsing

  95. 2 * (3 + 7) Stack

  96. 2 * (3 + 7) Stack 2

  97. 2 * (3 + 7) Rule: Num → 2 Stack

    2
  98. 2 * (3 + 7) Rule: Num → 2 Stack

    2 Num
  99. 2 * (3 + 7) Rule: Num → 2 Stack

    2 2 Num
  100. * (3 + 7) Stack 2 Num

  101. * (3 + 7) Stack * 2 Num

  102. * (3 + 7) Rule: Op → * Stack *

    2 Num
  103. * (3 + 7) Rule: Op → * Stack *

    Op 2 Num
  104. * (3 + 7) Rule: Op → * Stack *

    Op 2 * Num
  105. (3 + 7) Stack Op 2 * Num

  106. (3 + 7) Stack ( Op 2 * Num

  107. (3 + 7) Rule: Expr → (Expr) Stack ( Op

    2 * Num
  108. 3 + 7) Stack 2 * ( Op Num

  109. 3 + 7) Stack 3 2 * ( Op Num

  110. 3 + 7) Rule: Num → 3 Stack 3 2

    * ( Op Num
  111. 3 + 7) Rule: Num → 3 Stack 3 Num

    2 * ( Op Num
  112. 3 + 7) Rule: Num → 3 Stack 3 Num

    3 2 * ( Op Num
  113. + 7) Stack 3 2 * ( Op Num Num

  114. + 7) Stack + 3 2 * ( Op Num

    Num
  115. + 7) Rule: Op → + Stack + 3 2

    * ( Op Num Num
  116. + 7) Rule: Op → + Stack + Op 3

    2 * ( Op Num Num
  117. + 7) Rule: Op → + Stack + Op 3

    2 * + ( Op Num Num
  118. 7) Stack 3 2 * + ( Op Num Num

    Op
  119. 7) Stack 3 2 * + 7 ( Op Num

    Num Op
  120. 7) Rule:
 Num → 7 Expr → Num Stack 3

    2 * + 7 ( Op Num Num Op
  121. 7) Rule:
 Num → 7 Expr → Num Stack 3

    2 * + Expr ( Op Num Num Op
  122. 7) Rule:
 Num → 7 Expr → Num Stack 3

    2 * + 7 Expr ( Op Num Num Op
  123. ) Stack 3 2 * + 7 Op Num Expr

    ( Op Num
  124. ) Rule: Expr → Num Op Expr Stack 3 2

    * + 7 Op Num Expr ( Op Num
  125. ) Rule: Expr → Num Op Expr Stack 3 2

    * + 7 Op Num Expr ( Op Num
  126. ) Rule: Expr → Num Op Expr Stack 3 2

    * + 7 Expr ( Op Num
  127. ) Rule: Expr → Num Op Expr Stack 3 2

    * + 7 Expr ( Op Num
  128. ) Rule: Expr → Num Op Expr Stack 2 *

    Expr + 3 7 ( Op Num
  129. ) Stack 2 * + 3 7 ( Op Num

    Expr
  130. ) Stack 2 * ) + 3 7 ( Op

    Num Expr
  131. ) Rule: Expr → (Expr) Stack 2 * ) +

    3 7 ( Op Num Expr
  132. <eos> Rule: Expr → (Expr) Stack 2 * ( Expr

    ) + 3 7 Op Num
  133. <eos> Rule: Expr → (Expr) Stack 2 * + 3

    7 Op Num ( Expr )
  134. <eos> Rule: Expr → (Expr) Stack 2 * + 3

    7 Op Num Expr
  135. <eos> Stack 2 * + 3 7 Rule: Expr →

    Num Op Expr Op Num Expr
  136. <eos> Stack 2 * + 3 7 Rule: Expr →

    Num Op Expr Op Num Expr
  137. <eos> Stack 2 * + 3 7 Rule: Expr →

    Num Op Expr Expr
  138. <eos> Stack 2 * + 3 7 Rule: Expr →

    Num Op Expr Expr
  139. Top-down: Recursive Descent Bottom-up: Shift-reduce

  140. mo’ rules, mo’ problems

  141. Yacc/Racc/bison to the rescue!

  142. Grammar.y Parser Generator Parser Output

  143. class CalcParser options no_result_var rule expr : NUM OP NUM

    { val[0].send(val[1],val[2]) } end # tokenizer goes here
  144. $ racc calc.y => calc.tab.rb

  145. class CalcParser < Racc::Parser module_eval(<<'...end calc.y/module_eval...', 'calc.y', 10) #tokenizer deleted

    for space reasons ...end calc.y/module_eval... ##### State transition tables begin ### racc_action_table = [ 2, 3, 4, 5, 6 ] racc_action_check = [ 0, 1, 2, 3, 4 ] racc_action_pointer = [ -2, 1, -1, 3, 2, nil, nil ] racc_action_default = [ -2, -2, -2, -2, -2, 7, -1 ] racc_goto_table = [ 1 ] racc_goto_check = [ 1 ]
  146. $ echo 2 * (3 + 7) > calc.rb $

    ruby -y calc.rb
  147. Starting parse Entering state 0 Reducing stack by rule 1

    (line 855): -> $$ = nterm $@1 () Stack now 0 Entering state 2 Reading a token: Next token is token tINTEGER () Shifting token tINTEGER () Entering state 41 Reducing stack by rule 499 (line 4302): $1 = token tINTEGER () -> $$ = nterm numeric () Stack now 0 2 Entering state 109 Reducing stack by rule 448 (line 3811) $ ruby -y calc.rb
  148. $ man ruby

  149. $ man ruby

  150. $ man ruby

  151. But HHH, when would I use a parser?

  152. String validation (URLs, email addresses) Emails Logfiles Formatted document

  153. No, but really, why would I use a parser? I

    can just use regexes!
  154. (?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a- z0-9%])|www\d{0,3}[.]|[a- z0-9.\-]+[.][a-z]{2,4}/)(?:[^ \s()<>]+|\(([^\s()<>]+|(\([^ \s()<>]+\)))*\))+(?:\(([^\s()<>]+| (\([^\s()<>]+\)))*\)|[^\s`!()\[\] {};:'".,<>?«»“”‘’])) validate this!

  155. Language Hierarchy (Chomsky) Type 0: Unrestricted (natural languages) Type 1:

    Context-sensitive (<hand wave>) Type 2: Context-free (computer languages) Type 3: Regular (regular expressions)
  156. Regular languages • Left-hand side = Single non-terminal • Right-hand

    side = terminal, sometimes with a non-terminal EITHER preceding OR following
 
 e.g. A → x
 A → Bx
 A → nil
  157. Context-free languages • Presence of a stack to remember if

    a symbol has occurred before (e.g. shift-reduce) • More flexible grammar rules: right hand side can be a sequence of terminals and non-terminals
  158. Most “languages” aren’t regular!

  159. “ab” language • ab • aabb • aaaaabbbbb • anbn

    Valid sentences: • aaaaaa • abb • aab • ababab Invalid sentences:
  160. A → “a” B → “b” S → AB AB

    → AABB … %
  161. What’s the grammar?

  162. S → aSb S → nil

  163. RFC 2617 See https://github.com/drbrain/net-http-digest_auth

  164. Bundler vs Rubygems parser (for resolving gem dependencies in Gemfile.lock)

  165. NAME_VERSION = '(?! )(.*?)(?: \(([^-]*)(?:-(.*))?\))?' NAME_VERSION_2 = /^ {2}#{NAME_VERSION}(!)?$/ def

    parse_dependency(line) if line =~ NAME_VERSION_2 name = $1 version = $2 pinned = $4 # … @dependencies << dep end # No error handling for corrupted Lockfiles end Bundler : regular expression matching
  166. def parse_DEPENDENCIES while not @tokens.empty? and :text == peek.type do

    token = get :text requirements = [] case peek[0] when :l_paren then get :l_paren loop do op = get(:requirement).value version = get(:text).value
 # Meaningful ParseError raised for unexpected tokens ... Rubygems: Recursive Descent parser
  167. It doesn’t take much to break regular expressions Parsers are

    awesome! More accurate! Faster! …but hard to write. Good thing we have parser generators! Conclusion
  168. • Recursive Descent parser:
 http://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and- predictive-parsers/
 • Shift-reduce parser:
 http://cons.mit.edu/sp14/ocw/L03.pdf

    • Constructing Language Processors for Little Languages, Randy M. Kaplan (ISBN-13: 978-0471597537) • Ruby Under a Microscope, Pat Shaughnessy (ISBN-13: 978-1593275273)
 • Parser generators: • ANTLR (http://www.antlr.org/) • http://theorangeduck.com/page/you-could-have-invented- parser-combinators
  169. Hsing-Hui Hsu 徐⾏行慧 @SoManyHs github.com/Elffers