Slide 1

Slide 1 text

Hsing-Hui Hsu 徐⾏行慧 @SoManyHs github.com/Elffers

Slide 2

Slide 2 text

Time flies like an arrow; Fruit flies like a banana: Parsers for Great Good Hsing-Hui Hsu 徐⾏行慧 @SoManyHs

Slide 3

Slide 3 text

or: How I Accidentally a Computer Science, and So Can You!

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

parser.rb

Slide 6

Slide 6 text

parser.rb …wat.

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

parser.y

Slide 9

Slide 9 text

parser.y

Slide 10

Slide 10 text

Let’s play Mad Libs!

Slide 11

Slide 11 text

The young man _________(verb).

Slide 12

Slide 12 text

The young man drank.

Slide 13

Slide 13 text

The young man drank. subject + verb (intr.)

Slide 14

Slide 14 text

The young man ________(verb) ________(noun - direct object).

Slide 15

Slide 15 text

The young man drank (verb) ________(noun - direct object).

Slide 16

Slide 16 text

The young man drank sake.

Slide 17

Slide 17 text

The young man drank sake. subject + verb + object

Slide 18

Slide 18 text

The young man _________(noun).

Slide 19

Slide 19 text

The young man the boat. ⛵

Slide 20

Slide 20 text

The young man the boat. subject + direct object

Slide 21

Slide 21 text

The young man the boat. subject + direct object NOT GRAMMATICAL! %

Slide 22

Slide 22 text

The young man the boat. subject + verb + direct object

Slide 23

Slide 23 text

Garden Path Sentences

Slide 24

Slide 24 text

[[Time] [flies [like [an arrow]]]] ; [[fruit flies] [like [a banana]]]. [[時は][[[⽮矢]のように]過ぎ去る]]; [[ミバエは][[バナナ]を好む]]。 Time flies like an arrow; fruit flies like a banana. 時は⽮矢のように過ぎ去る; ミバエはバナ ナを好む。 ____________________________________

Slide 25

Slide 25 text

[[The prime] [number [few]]]. [[原始⼈人は][[少ない数しか]数えられない]]。 The prime number few. 原始⼈人は少ない数しか数えられない。 _______________________________________

Slide 26

Slide 26 text

The man who hunts ducks out on weekends. 男は週末ごとに狩りをしにこっそり出かける。 ___________________________________________ [[The man who] [hunts [ducks out [on weekends]]]]. [[男は][[[週末ごとに]狩りをしに]こっそり出かける]]。

Slide 27

Slide 27 text

The woman who whistles tunes pianos. この⼜⼝口笛を吹く⼥女はピアノの調律をする。 ______________________________________ [[The [woman who] [whistles]] [tunes [pianos]]]. [[この[[⼜⼝口笛を吹く]⼥女は]] [[ピアノ]の調律をする]]。

Slide 28

Slide 28 text

先⽣生がお酒を飲んだ⽣生徒を注意した。 The teacher advised the student who has been drunk not to drink. 先⽣生がお酒を飲んだ “The teacher drank sake” お酒を飲んだ (drank sake) is describing ⽣生徒 (student), and the teacher is actually doing 注意した (advising).

Slide 29

Slide 29 text

GRANDMOTHER OF EIGHT MAKES HOLE IN ONE 8⼈人の孫を持つお婆さんがホールインワンを達成する

Slide 30

Slide 30 text

COMPLAINTS ABOUT NBA REFEREES GROWING UGLY

Slide 31

Slide 31 text

MILK DRINKERS ARE TURNING TO POWDER

Slide 32

Slide 32 text

Grammar Rules

Slide 33

Slide 33 text

Sentence = Subject + Predicate Predicate = Verb + Stuff

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

(Extended) Backus-Naur Form: • Metalanguage notation used to describe a language by a set of production rules • Each rule is expressed with terminal and non-terminal symbols

Slide 36

Slide 36 text

Production (a.k.a rewrite) rules are expressed as:
 Left-hand side → Right-hand side Non-terminal → sequence of terminals and non-terminals (Extended) Backus-Naur Form:

Slide 37

Slide 37 text

Sentence → Subj Pred Pred → Verb Stuff BNF for English Sentences

Slide 38

Slide 38 text

“The young man drank sake”/ “The young man the boat” 1. S → NP VP 2. NP → Art NP 3. NP → Adj N 4. NP → N 5. VP → V NP 6. Art → “The” 7. Art → “a” 8. Adj → “young” 9. N → “man” | “young” | “boat” | “sake” 10. V → “man” | “drank”

Slide 39

Slide 39 text

Non-terminals = {S, NP, VP N, V, Art } Terminals = {“the”, “a”, “young”, “man”, “boat”, “sake”, “drank”}

Slide 40

Slide 40 text

The young man the boat S → NP VP 㱺 Art N VP 㱺 The young VP 㱺 The young V NP 㱺 The young man NP 㱺 The young man Art N 㱺 The young man the N 㱺 The young man the boat

Slide 41

Slide 41 text

The young man drank sake S → NP VP 㱺 Art NP VP 㱺 The NP VP 㱺 The Adj N VP 㱺 The young N VP 㱺 The young man VP 㱺 The young man V NP 㱺 The young man drank NP 㱺 The young man drank N 㱺 The young man drank sake

Slide 42

Slide 42 text

But what does this have to do with computers?

Slide 43

Slide 43 text

Source
 code Lexer Tokens Parser Syntax Tree Compiler Native Code

Slide 44

Slide 44 text

Input Lexer Tokens Parser Output

Slide 45

Slide 45 text

Lexing (Tokenizing)

Slide 46

Slide 46 text

Math! • Addition: 3 + 7 • Subtraction: 3 - 7 • Multiplication: 3 * 7

Slide 47

Slide 47 text

Math Rules 1. Expr → Num Op Num 2. Num → /\d+/ 3. Op → /[+ - *]/

Slide 48

Slide 48 text

def tokenize input ss = StringScanner.new input tokens = [] while not ss.eos? case when ss.scan(/\d+/) token = Token::Num.new(ss.matched.to_i) tokens.push token when ss.scan(/[+*-]/) token = Token::Op.new(ss.matched) tokens.push token when ss.scan(/\s+/) #ignore else raise ParseError end end tokens end end

Slide 49

Slide 49 text

tokenize(“3 + 7”) =>[Num(3), Op(+), Num(7)]

Slide 50

Slide 50 text

Parser.parse(tokens) => Tree + 3 7

Slide 51

Slide 51 text

class Parser
 def initialize tokens
 @tokens = tokens
 end def parse
 left = @tokens.get
 head = @tokens.get
 right = @tokens.get
 Parser::Tree.new(head,
 left,
 right)
 end
 end

Slide 52

Slide 52 text

Slightly harder math

Slide 53

Slide 53 text

2 * (3 + 7)

Slide 54

Slide 54 text

Slightly Harder Math Rules 1. Expr → Num Op Expr
 | (Expr)
 | Num 2. Num → /\d+/ 3. Op → /[+ - *]/

Slide 55

Slide 55 text

tokenize(“2 * (3 + 7)”) => [2, *, (, 3, +, 7, ) ]

Slide 56

Slide 56 text

* 2 + 3 7

Slide 57

Slide 57 text

Top Down (with 1-token lookahead)

Slide 58

Slide 58 text

2 * (3 + 7) Current Token Next token

Slide 59

Slide 59 text

2 * (3 + 7) Current Token Next token 2

Slide 60

Slide 60 text

2 * (3 + 7) Current Token Next token 2 *

Slide 61

Slide 61 text

2 * (3 + 7) Current Token Next token 2 * Rule: Expr → Num Op Expr

Slide 62

Slide 62 text

2 * (3 + 7) Current Token Next token 2 * Rule: Expr → Num Op Expr 2

Slide 63

Slide 63 text

2 Current Token Next token * (3 + 7)

Slide 64

Slide 64 text

2 Current Token Next token * * (3 + 7)

Slide 65

Slide 65 text

2 Current Token Next token * ( * (3 + 7)

Slide 66

Slide 66 text

2 Rule: Expr → Num Op Expr *Expr → (Expr) Current Token Next token * ( * (3 + 7)

Slide 67

Slide 67 text

2 Rule: Expr → Num Op Expr *Expr → (Expr) * 2 Expr Current Token Next token * ( * (3 + 7)

Slide 68

Slide 68 text

(3 + 7) Current Token Next token * 2 Expr

Slide 69

Slide 69 text

(3 + 7) Current Token Next token ( * 2 Expr

Slide 70

Slide 70 text

(3 + 7) Current Token Next token ( 3 * 2 Expr

Slide 71

Slide 71 text

(3 + 7) Current Token Next token ( 3 * 2 Expr Rule: Expr → (Expr) *Expr → Num
 *Expr → Num Op Expr

Slide 72

Slide 72 text

(3 + 7) Current Token Next token ( 3 * 2 Expr (Expr) Rule: Expr → (Expr) *Expr → Num
 *Expr → Num Op Expr

Slide 73

Slide 73 text

3 + 7) Current Token Next token * 2 (Expr)

Slide 74

Slide 74 text

3 + 7) Current Token Next token 3 * 2 (Expr)

Slide 75

Slide 75 text

3 + 7) Current Token Next token 3 * 2 (Expr) Expr → Num Rule:

Slide 76

Slide 76 text

3 + 7) Current Token Next token 3 * 2 (Expr) Expr → Num Op Expr Expr → Num Rule:

Slide 77

Slide 77 text

3 + 7) Current Token Next token 3 + * 2 (Expr) Expr → Num Op Expr Expr → Num Rule:

Slide 78

Slide 78 text

3 + 7) Current Token Next token 3 + * 2 (Expr) Expr → Num Op Expr Rule:

Slide 79

Slide 79 text

3 + 7) Current Token Next token 3 + * 2 (Expr) 3 Expr → Num Op Expr Rule:

Slide 80

Slide 80 text

+ 7) Current Token Next token * 2 Expr 3

Slide 81

Slide 81 text

+ 7) Current Token Next token + * 2 Expr 3

Slide 82

Slide 82 text

+ 7) Current Token Next token + 7 * 2 Expr 3

Slide 83

Slide 83 text

+ 7) Current Token Next token + 7 Rule:
 Expr → Num Op Expr * 2 Expr 3

Slide 84

Slide 84 text

+ 7) Current Token Next token + 7 Rule:
 Expr → Num Op Expr * 2 Expr 3 +

Slide 85

Slide 85 text

7) Current Token Next token * 2 + 3

Slide 86

Slide 86 text

7) Current Token Next token 7 * 2 + 3

Slide 87

Slide 87 text

7) Current Token Next token 7 ) * 2 + 3

Slide 88

Slide 88 text

7) Current Token Next token 7 ) * 2 + 3 Rule:
 Expr → Num Expr → (Expr)

Slide 89

Slide 89 text

7) Current Token Next token 7 ) * 2 + 3 7 * 2 + 3 Rule:
 Expr → Num Expr → (Expr)

Slide 90

Slide 90 text

“2 * (3 + 7)” 2 * (3 + 7) Num * (3 + 7) Expr * (3 + 7) Expr Op (3 + 7) Expr Op (Expr) Expr Op Expr Expr

Slide 91

Slide 91 text

Recursive Descent

Slide 92

Slide 92 text

Problems with Recursive Descent parsers Inefficient Possibility of infinite recursion, e.g.
 Expr → Expr Op Expr Limitations on grammar rules

Slide 93

Slide 93 text

Bottom-Up (with a stack to remember things)

Slide 94

Slide 94 text

Bottom-Up (with a stack to remember things) a.k.a. shift-reduce parsing

Slide 95

Slide 95 text

2 * (3 + 7) Stack

Slide 96

Slide 96 text

2 * (3 + 7) Stack 2

Slide 97

Slide 97 text

2 * (3 + 7) Rule: Num → 2 Stack 2

Slide 98

Slide 98 text

2 * (3 + 7) Rule: Num → 2 Stack 2 Num

Slide 99

Slide 99 text

2 * (3 + 7) Rule: Num → 2 Stack 2 2 Num

Slide 100

Slide 100 text

* (3 + 7) Stack 2 Num

Slide 101

Slide 101 text

* (3 + 7) Stack * 2 Num

Slide 102

Slide 102 text

* (3 + 7) Rule: Op → * Stack * 2 Num

Slide 103

Slide 103 text

* (3 + 7) Rule: Op → * Stack * Op 2 Num

Slide 104

Slide 104 text

* (3 + 7) Rule: Op → * Stack * Op 2 * Num

Slide 105

Slide 105 text

(3 + 7) Stack Op 2 * Num

Slide 106

Slide 106 text

(3 + 7) Stack ( Op 2 * Num

Slide 107

Slide 107 text

(3 + 7) Rule: Expr → (Expr) Stack ( Op 2 * Num

Slide 108

Slide 108 text

3 + 7) Stack 2 * ( Op Num

Slide 109

Slide 109 text

3 + 7) Stack 3 2 * ( Op Num

Slide 110

Slide 110 text

3 + 7) Rule: Num → 3 Stack 3 2 * ( Op Num

Slide 111

Slide 111 text

3 + 7) Rule: Num → 3 Stack 3 Num 2 * ( Op Num

Slide 112

Slide 112 text

3 + 7) Rule: Num → 3 Stack 3 Num 3 2 * ( Op Num

Slide 113

Slide 113 text

+ 7) Stack 3 2 * ( Op Num Num

Slide 114

Slide 114 text

+ 7) Stack + 3 2 * ( Op Num Num

Slide 115

Slide 115 text

+ 7) Rule: Op → + Stack + 3 2 * ( Op Num Num

Slide 116

Slide 116 text

+ 7) Rule: Op → + Stack + Op 3 2 * ( Op Num Num

Slide 117

Slide 117 text

+ 7) Rule: Op → + Stack + Op 3 2 * + ( Op Num Num

Slide 118

Slide 118 text

7) Stack 3 2 * + ( Op Num Num Op

Slide 119

Slide 119 text

7) Stack 3 2 * + 7 ( Op Num Num Op

Slide 120

Slide 120 text

7) Rule:
 Num → 7 Expr → Num Stack 3 2 * + 7 ( Op Num Num Op

Slide 121

Slide 121 text

7) Rule:
 Num → 7 Expr → Num Stack 3 2 * + Expr ( Op Num Num Op

Slide 122

Slide 122 text

7) Rule:
 Num → 7 Expr → Num Stack 3 2 * + 7 Expr ( Op Num Num Op

Slide 123

Slide 123 text

) Stack 3 2 * + 7 Op Num Expr ( Op Num

Slide 124

Slide 124 text

) Rule: Expr → Num Op Expr Stack 3 2 * + 7 Op Num Expr ( Op Num

Slide 125

Slide 125 text

) Rule: Expr → Num Op Expr Stack 3 2 * + 7 Op Num Expr ( Op Num

Slide 126

Slide 126 text

) Rule: Expr → Num Op Expr Stack 3 2 * + 7 Expr ( Op Num

Slide 127

Slide 127 text

) Rule: Expr → Num Op Expr Stack 3 2 * + 7 Expr ( Op Num

Slide 128

Slide 128 text

) Rule: Expr → Num Op Expr Stack 2 * Expr + 3 7 ( Op Num

Slide 129

Slide 129 text

) Stack 2 * + 3 7 ( Op Num Expr

Slide 130

Slide 130 text

) Stack 2 * ) + 3 7 ( Op Num Expr

Slide 131

Slide 131 text

) Rule: Expr → (Expr) Stack 2 * ) + 3 7 ( Op Num Expr

Slide 132

Slide 132 text

Rule: Expr → (Expr) Stack 2 * ( Expr ) + 3 7 Op Num

Slide 133

Slide 133 text

Rule: Expr → (Expr) Stack 2 * + 3 7 Op Num ( Expr )

Slide 134

Slide 134 text

Rule: Expr → (Expr) Stack 2 * + 3 7 Op Num Expr

Slide 135

Slide 135 text

Stack 2 * + 3 7 Rule: Expr → Num Op Expr Op Num Expr

Slide 136

Slide 136 text

Stack 2 * + 3 7 Rule: Expr → Num Op Expr Op Num Expr

Slide 137

Slide 137 text

Stack 2 * + 3 7 Rule: Expr → Num Op Expr Expr

Slide 138

Slide 138 text

Stack 2 * + 3 7 Rule: Expr → Num Op Expr Expr

Slide 139

Slide 139 text

Top-down: Recursive Descent Bottom-up: Shift-reduce

Slide 140

Slide 140 text

mo’ rules, mo’ problems

Slide 141

Slide 141 text

Yacc/Racc/bison to the rescue!

Slide 142

Slide 142 text

Grammar.y Parser Generator Parser Output

Slide 143

Slide 143 text

class CalcParser options no_result_var rule expr : NUM OP NUM { val[0].send(val[1],val[2]) } end # tokenizer goes here

Slide 144

Slide 144 text

$ racc calc.y => calc.tab.rb

Slide 145

Slide 145 text

class CalcParser < Racc::Parser module_eval(<<'...end calc.y/module_eval...', 'calc.y', 10) #tokenizer deleted for space reasons ...end calc.y/module_eval... ##### State transition tables begin ### racc_action_table = [ 2, 3, 4, 5, 6 ] racc_action_check = [ 0, 1, 2, 3, 4 ] racc_action_pointer = [ -2, 1, -1, 3, 2, nil, nil ] racc_action_default = [ -2, -2, -2, -2, -2, 7, -1 ] racc_goto_table = [ 1 ] racc_goto_check = [ 1 ]

Slide 146

Slide 146 text

$ echo 2 * (3 + 7) > calc.rb $ ruby -y calc.rb

Slide 147

Slide 147 text

Starting parse Entering state 0 Reducing stack by rule 1 (line 855): -> $$ = nterm $@1 () Stack now 0 Entering state 2 Reading a token: Next token is token tINTEGER () Shifting token tINTEGER () Entering state 41 Reducing stack by rule 499 (line 4302): $1 = token tINTEGER () -> $$ = nterm numeric () Stack now 0 2 Entering state 109 Reducing stack by rule 448 (line 3811) $ ruby -y calc.rb

Slide 148

Slide 148 text

$ man ruby

Slide 149

Slide 149 text

$ man ruby

Slide 150

Slide 150 text

$ man ruby

Slide 151

Slide 151 text

But HHH, when would I use a parser?

Slide 152

Slide 152 text

String validation (URLs, email addresses) Emails Logfiles Formatted document

Slide 153

Slide 153 text

No, but really, why would I use a parser? I can just use regexes!

Slide 154

Slide 154 text

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a- z0-9%])|www\d{0,3}[.]|[a- z0-9.\-]+[.][a-z]{2,4}/)(?:[^ \s()<>]+|\(([^\s()<>]+|(\([^ \s()<>]+\)))*\))+(?:\(([^\s()<>]+| (\([^\s()<>]+\)))*\)|[^\s`!()\[\] {};:'".,<>?«»“”‘’])) validate this!

Slide 155

Slide 155 text

Language Hierarchy (Chomsky) Type 0: Unrestricted (natural languages) Type 1: Context-sensitive () Type 2: Context-free (computer languages) Type 3: Regular (regular expressions)

Slide 156

Slide 156 text

Regular languages • Left-hand side = Single non-terminal • Right-hand side = terminal, sometimes with a non-terminal EITHER preceding OR following
 
 e.g. A → x
 A → Bx
 A → nil

Slide 157

Slide 157 text

Context-free languages • Presence of a stack to remember if a symbol has occurred before (e.g. shift-reduce) • More flexible grammar rules: right hand side can be a sequence of terminals and non-terminals

Slide 158

Slide 158 text

Most “languages” aren’t regular!

Slide 159

Slide 159 text

“ab” language • ab • aabb • aaaaabbbbb • anbn Valid sentences: • aaaaaa • abb • aab • ababab Invalid sentences:

Slide 160

Slide 160 text

A → “a” B → “b” S → AB AB → AABB … %

Slide 161

Slide 161 text

What’s the grammar?

Slide 162

Slide 162 text

S → aSb S → nil

Slide 163

Slide 163 text

RFC 2617 See https://github.com/drbrain/net-http-digest_auth

Slide 164

Slide 164 text

Bundler vs Rubygems parser (for resolving gem dependencies in Gemfile.lock)

Slide 165

Slide 165 text

NAME_VERSION = '(?! )(.*?)(?: \(([^-]*)(?:-(.*))?\))?' NAME_VERSION_2 = /^ {2}#{NAME_VERSION}(!)?$/ def parse_dependency(line) if line =~ NAME_VERSION_2 name = $1 version = $2 pinned = $4 # … @dependencies << dep end # No error handling for corrupted Lockfiles end Bundler : regular expression matching

Slide 166

Slide 166 text

def parse_DEPENDENCIES while not @tokens.empty? and :text == peek.type do token = get :text requirements = [] case peek[0] when :l_paren then get :l_paren loop do op = get(:requirement).value version = get(:text).value
 # Meaningful ParseError raised for unexpected tokens ... Rubygems: Recursive Descent parser

Slide 167

Slide 167 text

It doesn’t take much to break regular expressions Parsers are awesome! More accurate! Faster! …but hard to write. Good thing we have parser generators! Conclusion

Slide 168

Slide 168 text

• Recursive Descent parser:
 http://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and- predictive-parsers/
 • Shift-reduce parser:
 http://cons.mit.edu/sp14/ocw/L03.pdf • Constructing Language Processors for Little Languages, Randy M. Kaplan (ISBN-13: 978-0471597537) • Ruby Under a Microscope, Pat Shaughnessy (ISBN-13: 978-1593275273)
 • Parser generators: • ANTLR (http://www.antlr.org/) • http://theorangeduck.com/page/you-could-have-invented- parser-combinators

Slide 169

Slide 169 text

Hsing-Hui Hsu 徐⾏行慧 @SoManyHs github.com/Elffers