Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Time Flies like an Arrow, Fruit Flies like a Banana: Parsers for Great Good

Hsing-Hui Hsu
December 11, 2015

Time Flies like an Arrow, Fruit Flies like a Banana: Parsers for Great Good

RubyKaigi 2015
Tokyo, Japan

Hsing-Hui Hsu

December 11, 2015
Tweet

More Decks by Hsing-Hui Hsu

Other Decks in Programming

Transcript

  1. Time flies like an arrow; Fruit flies like a banana:

    Parsers for Great Good Hsing-Hui Hsu 徐⾏行慧 @SoManyHs
  2. [[Time] [flies [like [an arrow]]]] ; [[fruit flies] [like [a

    banana]]]. [[時は][[[⽮矢]のように]過ぎ去る]]; [[ミバエは][[バナナ]を好む]]。 Time flies like an arrow; fruit flies like a banana. 時は⽮矢のように過ぎ去る; ミバエはバナ ナを好む。 ____________________________________
  3. The man who hunts ducks out on weekends. 男は週末ごとに狩りをしにこっそり出かける。 ___________________________________________

    [[The man who] [hunts [ducks out [on weekends]]]]. [[男は][[[週末ごとに]狩りをしに]こっそり出かける]]。
  4. The woman who whistles tunes pianos. この⼜⼝口笛を吹く⼥女はピアノの調律をする。 ______________________________________ [[The [woman

    who] [whistles]] [tunes [pianos]]]. [[この[[⼜⼝口笛を吹く]⼥女は]] [[ピアノ]の調律をする]]。
  5. 先⽣生がお酒を飲んだ⽣生徒を注意した。 The teacher advised the student who has been drunk

    not to drink. 先⽣生がお酒を飲んだ “The teacher drank sake” お酒を飲んだ (drank sake) is describing ⽣生徒 (student), and the teacher is actually doing 注意した (advising).
  6. (Extended) Backus-Naur Form: • Metalanguage notation used to describe a

    language by a set of production rules • Each rule is expressed with terminal and non-terminal symbols
  7. Production (a.k.a rewrite) rules are expressed as:
 Left-hand side →

    Right-hand side Non-terminal → sequence of terminals and non-terminals (Extended) Backus-Naur Form:
  8. “The young man drank sake”/ “The young man the boat”

    1. S → NP VP 2. NP → Art NP 3. NP → Adj N 4. NP → N 5. VP → V NP 6. Art → “The” 7. Art → “a” 8. Adj → “young” 9. N → “man” | “young” | “boat” | “sake” 10. V → “man” | “drank”
  9. Non-terminals = {S, NP, VP N, V, Art } Terminals

    = {“the”, “a”, “young”, “man”, “boat”, “sake”, “drank”}
  10. The young man the boat S → NP VP 㱺

    Art N VP 㱺 The young VP 㱺 The young V NP 㱺 The young man NP 㱺 The young man Art N 㱺 The young man the N 㱺 The young man the boat
  11. The young man drank sake S → NP VP 㱺

    Art NP VP 㱺 The NP VP 㱺 The Adj N VP 㱺 The young N VP 㱺 The young man VP 㱺 The young man V NP 㱺 The young man drank NP 㱺 The young man drank N 㱺 The young man drank sake
  12. Math Rules 1. Expr → Num Op Num 2. Num

    → /\d+/ 3. Op → /[+ - *]/
  13. def tokenize input ss = StringScanner.new input tokens = []

    while not ss.eos? case when ss.scan(/\d+/) token = Token::Num.new(ss.matched.to_i) tokens.push token when ss.scan(/[+*-]/) token = Token::Op.new(ss.matched) tokens.push token when ss.scan(/\s+/) #ignore else raise ParseError end end tokens end end
  14. class Parser
 def initialize tokens
 @tokens = tokens
 end def

    parse
 left = @tokens.get
 head = @tokens.get
 right = @tokens.get
 Parser::Tree.new(head,
 left,
 right)
 end
 end
  15. Slightly Harder Math Rules 1. Expr → Num Op Expr


    | (Expr)
 | Num 2. Num → /\d+/ 3. Op → /[+ - *]/
  16. 2 * (3 + 7) Current Token Next token 2

    * Rule: Expr → Num Op Expr
  17. 2 * (3 + 7) Current Token Next token 2

    * Rule: Expr → Num Op Expr 2
  18. 2 Rule: Expr → Num Op Expr *Expr → (Expr)

    Current Token Next token * ( * (3 + 7)
  19. 2 Rule: Expr → Num Op Expr *Expr → (Expr)

    * 2 Expr Current Token Next token * ( * (3 + 7)
  20. (3 + 7) Current Token Next token ( 3 *

    2 Expr Rule: Expr → (Expr) *Expr → Num
 *Expr → Num Op Expr
  21. (3 + 7) Current Token Next token ( 3 *

    2 Expr (Expr) Rule: Expr → (Expr) *Expr → Num
 *Expr → Num Op Expr
  22. 3 + 7) Current Token Next token 3 * 2

    (Expr) Expr → Num Rule:
  23. 3 + 7) Current Token Next token 3 * 2

    (Expr) Expr → Num Op Expr Expr → Num Rule:
  24. 3 + 7) Current Token Next token 3 + *

    2 (Expr) Expr → Num Op Expr Expr → Num Rule:
  25. 3 + 7) Current Token Next token 3 + *

    2 (Expr) Expr → Num Op Expr Rule:
  26. 3 + 7) Current Token Next token 3 + *

    2 (Expr) 3 Expr → Num Op Expr Rule:
  27. 7) Current Token Next token 7 ) * 2 +

    3 Rule:
 Expr → Num Expr → (Expr)
  28. 7) Current Token Next token 7 ) * 2 +

    3 7 * 2 + 3 Rule:
 Expr → Num Expr → (Expr)
  29. “2 * (3 + 7)” 2 * (3 + 7)

    Num * (3 + 7) Expr * (3 + 7) Expr Op (3 + 7) Expr Op (Expr) Expr Op Expr Expr
  30. Problems with Recursive Descent parsers Inefficient Possibility of infinite recursion,

    e.g.
 Expr → Expr Op Expr Limitations on grammar rules
  31. 7) Rule:
 Num → 7 Expr → Num Stack 3

    2 * + Expr ( Op Num Num Op
  32. 7) Rule:
 Num → 7 Expr → Num Stack 3

    2 * + 7 Expr ( Op Num Num Op
  33. ) Rule: Expr → Num Op Expr Stack 3 2

    * + 7 Op Num Expr ( Op Num
  34. ) Rule: Expr → Num Op Expr Stack 3 2

    * + 7 Op Num Expr ( Op Num
  35. <eos> Stack 2 * + 3 7 Rule: Expr →

    Num Op Expr Op Num Expr
  36. <eos> Stack 2 * + 3 7 Rule: Expr →

    Num Op Expr Op Num Expr
  37. class CalcParser options no_result_var rule expr : NUM OP NUM

    { val[0].send(val[1],val[2]) } end # tokenizer goes here
  38. class CalcParser < Racc::Parser module_eval(<<'...end calc.y/module_eval...', 'calc.y', 10) #tokenizer deleted

    for space reasons ...end calc.y/module_eval... ##### State transition tables begin ### racc_action_table = [ 2, 3, 4, 5, 6 ] racc_action_check = [ 0, 1, 2, 3, 4 ] racc_action_pointer = [ -2, 1, -1, 3, 2, nil, nil ] racc_action_default = [ -2, -2, -2, -2, -2, 7, -1 ] racc_goto_table = [ 1 ] racc_goto_check = [ 1 ]
  39. Starting parse Entering state 0 Reducing stack by rule 1

    (line 855): -> $$ = nterm $@1 () Stack now 0 Entering state 2 Reading a token: Next token is token tINTEGER () Shifting token tINTEGER () Entering state 41 Reducing stack by rule 499 (line 4302): $1 = token tINTEGER () -> $$ = nterm numeric () Stack now 0 2 Entering state 109 Reducing stack by rule 448 (line 3811) $ ruby -y calc.rb
  40. Language Hierarchy (Chomsky) Type 0: Unrestricted (natural languages) Type 1:

    Context-sensitive (<hand wave>) Type 2: Context-free (computer languages) Type 3: Regular (regular expressions)
  41. Regular languages • Left-hand side = Single non-terminal • Right-hand

    side = terminal, sometimes with a non-terminal EITHER preceding OR following
 
 e.g. A → x
 A → Bx
 A → nil
  42. Context-free languages • Presence of a stack to remember if

    a symbol has occurred before (e.g. shift-reduce) • More flexible grammar rules: right hand side can be a sequence of terminals and non-terminals
  43. “ab” language • ab • aabb • aaaaabbbbb • anbn

    Valid sentences: • aaaaaa • abb • aab • ababab Invalid sentences:
  44. NAME_VERSION = '(?! )(.*?)(?: \(([^-]*)(?:-(.*))?\))?' NAME_VERSION_2 = /^ {2}#{NAME_VERSION}(!)?$/ def

    parse_dependency(line) if line =~ NAME_VERSION_2 name = $1 version = $2 pinned = $4 # … @dependencies << dep end # No error handling for corrupted Lockfiles end Bundler : regular expression matching
  45. def parse_DEPENDENCIES while not @tokens.empty? and :text == peek.type do

    token = get :text requirements = [] case peek[0] when :l_paren then get :l_paren loop do op = get(:requirement).value version = get(:text).value
 # Meaningful ParseError raised for unexpected tokens ... Rubygems: Recursive Descent parser
  46. It doesn’t take much to break regular expressions Parsers are

    awesome! More accurate! Faster! …but hard to write. Good thing we have parser generators! Conclusion
  47. • Recursive Descent parser:
 http://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and- predictive-parsers/
 • Shift-reduce parser:
 http://cons.mit.edu/sp14/ocw/L03.pdf

    • Constructing Language Processors for Little Languages, Randy M. Kaplan (ISBN-13: 978-0471597537) • Ruby Under a Microscope, Pat Shaughnessy (ISBN-13: 978-1593275273)
 • Parser generators: • ANTLR (http://www.antlr.org/) • http://theorangeduck.com/page/you-could-have-invented- parser-combinators