What is Parser - Speaker Deck

What is Parser

by yui-knk

Slide 1

Slide 1 text

September 7, 2024 Fukuoka RubyistKaigi 04 @yui-knk Yuichiro Kaneko What is Parser

Slide 2

Slide 2 text

About Me The world is now in the great age of parsers. People are setting sail into the vast sea of parsers. - RubyKaigi 2023 LT- Yuichiro Kaneko https://twitter.com/kakutani/status/1657762294431105025/

Slide 3

Slide 3 text

Self Introduction • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf (Twitter) • Treasure Data • Engineering Manager of Applications Backend

Slide 4

Slide 4 text

In OSS world • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf (Twitter) • CRuby committer, mainly develop parser generator and parser • Lrama LALR (1) parser generator (2023, Ruby 3.3) • The Bison Slayer • Ripper Rearchitecture (2024, Ruby 3.4) • Code positions to RNode (2018, Ruby 2.6) • RubyVM::AbstractSyntaxTree (2018, Ruby 2.6)

Slide 5

Slide 5 text

What is Parser Parser generator consists of three parts, Frontend, Backend and Code Generator. Each component is independent from others so that we need to touch only necessary components when new feature is enhanced. - BuriKaigi 2024 in Toyama -

Slide 6

Slide 6 text

What parser does • Parser gives the structure to input string • Really ? Class Method Method Assignment @name Call name capitalize

Slide 7

Slide 7 text

What parser does • Parser gives the structure to input string bytes 636c61737320477265657465720a2020646566 20696e697469616c697a65286e616d65290a20 202020406e616d65203d206e616d652e636170 6974616c697a650a2020656e640a0a2020646 5662073616c7574650a2020202070757473202 248656c6c6f20237b406e616d657d21220a202 0656e640a656e640a

Slide 8

Slide 8 text

What lexer does • Cut bytes into chunks (tokens) 636c61737320477265657465720a2020646566 20696e697469616c697a65286e616d65290a20 202020406e616d65203d206e616d652e636170 6974616c697a650a2020656e640a0a2020646 5662073616c7574650a2020202070757473202 248656c6c6f20237b406e616d657d21220a202 0656e640a656e640a class Greeter def

Slide 9

Slide 9 text

Parser & Lexer • Lexer generates tokens from bytes • Parser gives structure to tokens Class Method Method Assignment @name Call name capitalize class Greeter def Lexer Parser …

Slide 10

Slide 10 text

Very easy.

Slide 11

Slide 11 text

Very easy, right ?

Slide 12

Slide 12 text

I will guide you to the frontline

Slide 13

Slide 13 text

Theory of formal language Defeat calamities by more powerful theory, abstraction and Refactoring - RubyKaigi 2024 in Okinawa -

Slide 14

Slide 14 text

Language • A (formal) language is a subset of words • Some words belong to Ruby language • Others don’t

Slide 15

Slide 15 text

Ruby • Even so these codes are transcendental and imbroglio codes, they belong to Ruby language. https://github.com/tric/trick2022/blob/master/01-tompng/entry.rb https://github.com/tric/trick2022/blob/master/06-mame/entry.rb

Slide 16

Slide 16 text

Not Ruby • At a glance, this code seems Ruby code, however it doesn’t belong to Ruby language.

Slide 17

Slide 17 text

Grammar • (Ruby) language is a infinite set of words • Grammar is a finite set of rules which define language Grammar Language …

Slide 18

Slide 18 text

Grammar • Grammar provides structure to the language + 1 2 3 * * + 1 2 3 Correct Wrong

Slide 19

Slide 19 text

Grammar class and automaton • Chomsky hierarchy • Four formal grammar classes consist hierarchy • There are correspondences between grammars and automatons Regular Context-free Context-sensitive Recursively enumerable Linear-bounded non-deterministic Turing machine Non-deterministic pushdown automaton Finite-state automaton Turing machine

Slide 20

Slide 20 text

Use appropriate grammar class • With great power comes great difficulties • Context-sensitive grammar is more difficult to read and design production rules than context-free grammar S → abc | aSBc cB → Bc bB → bb Production rules for {anbncn : n ≥ 1} S → aSBc → aabcBc → aabBcc → aabbcc S → aSBc → aaSBcBc → aaabcBcBc → aaabcBBcc → aaabBcBcc → aaabBBccc → aaabbBccc → aaabbbccc Generate “aabbcc” Generate “aaabbbccc” https://ja.wikipedia.org/wiki/%E6%96%87%E8%84%88%E4%BE%9D%E5%AD%98%E6%96%87%E6%B3%95 Multiple terminals and nonterminals appear

Slide 21

Slide 21 text

Context-free grammar (CFG) • Context-free grammar is readable • Then you can read it and try it CFG Single nonterminal appears

Slide 22

Slide 22 text

if + class • The code raise NoMethodError however it’s syntactically valid $ ruby -c test.rb Syntax OK $ ruby test.rb test.rb:3:in '': unde fi ned method '+' for nil (NoMethodError) end + class C ^

Slide 23

Slide 23 text

Context-free grammar (CFG) • Context-free grammar is widely used in programing languages • To be accurate, deterministic context-free language (DCFL) • DCFL is a subset of CFG • LR parser analyses DCFL in linear time

Slide 24

Slide 24 text

In Chomsky hierarchy Context-free Context-sensitive Recursively enumerable Linear-bounded non-deterministic Turing machine Non-deterministic pushdown automaton Turing machine DCFL Regular Finite-state automaton Deterministic pushdown automaton

Slide 25

Slide 25 text

Why LR parser? • LR parser • Can handle large range of languages • Major parser algorithm • To be precise, LR-attributed grammar • I believe grammar easy for human is close to LR grammar • LL parser • Has has less power than LR parser • PEG • It’s difficult to create Error Tolerant parser • A rule failure doesn’t imply a parsing failure like in context free grammars

Slide 26

Slide 26 text

How to create parser? • Use parser generator • Lrama (CRuby) • Bison (Perl, PHP, PostgreSQL) • ANTLR (Hive, Trino) • Hand written parser • Go, Rust, C# • Prism

Slide 27

Slide 27 text

Why LR parser generator is the best? • LR parser generator gives accurate feedback for grammar • BNF is very declarative • No gap between grammar and parser implementation • LR parser is based on theory of computer science

Slide 28

Slide 28 text

RubyKaigi 2024 • Check slides and video for more detail • https://rubykaigi.org/2024/presentations/spikeolaf.html

Slide 29

Slide 29 text

Actually context-free grammar? • Sometimes it’s discussed that Ruby grammar is CFG or not • This is a trick used in TRICK 2022 • This is NOT CFG because existence of the variable affects the following codes https://www.slideshare.net/mametter/trick-2022-results

Slide 30

Slide 30 text

However • Current LR parser can parse such codes • Ruby committers have hacked parser but NOT hacked LR parser algorithm • There must be some tricks somewhere

Slide 31

Slide 31 text

LR-attributed grammar (LR ଐੑจ๏) • The key concept is LR-attributed grammar • LR parser can handle LR-attributed grammar

Slide 32

Slide 32 text

Attribute Grammar (ଐੑจ๏) • Attribute grammars were invented by Donald Knuth and Peter Wegner • Original paper is Knuth, Donald E. (1968) "Semantics of context-free languages" • “An attribute grammar is a formal way to supplement a formal grammar with semantic information processing.” • https://en.wikipedia.org/wiki/Attribute_grammar

Slide 33

Slide 33 text

Static semantic analysis • Use cases • Check variable declarations and usages • Type checking • Check control flow function f1() { var i = 1; i + j; } function f1() { var i = 1; var j = 2; i + j; } Error: Not declared variable “j” is used

Slide 34

Slide 34 text

Check variable declarations and usages • This language has a semantic: “variable should be declared before used” • Represent the semantic formally in a grammar

Slide 35

Slide 35 text

decl: 'var' ident '=' integer ';' {{ decl.var_list[ident.value] = integer.value }} expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list raise "#{ident_1.value} is not declared" unless vars[ident_1.value] raise "#{ident_2.value} is not declared" unless vars[ident_2.value] expr.value = vars[ident_1.value] + vars[ident_2.value] }} decls: decls decl {{ decls.var_list = decls.var_list.merge(decl.var_list) }} decls: decl {{ decls.var_list = decl.var_list }} func_body: decls expr {{ expr.var_list = decls.var_list }} Add an identi fi er to a list Check identi fi ers are declared Use identi fi er’s value Merge identi fi er lists to one Pass identi fi er list to expr so that we can access identi fi er list in expr * Only important production rules and semantic rules Copy identi fi er list

Slide 36

Slide 36 text

Syntax Tree • Create syntax tree from input string decls decls func_body expr + decl j = 2 decl i = 1 function f1() { var i = 1; var j = 2; i + j; } ident i ident j

Slide 37

Slide 37 text

Analyze dependency of the variable list • In “expr”, “ident_1” and “ident_2” need variable list of “expr” decls decls func_body expr + decl j = 2 decl i = 1 expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list … expr.value = vars[ident_1.value] + vars[ident_2.value] }} ident i ident j

Slide 38

Slide 38 text

Analyze dependency of the variable list • In “func_body”, “expr” need variable list of “decls” decls decls func_body expr + decl j = 2 decl i = 1 func_body: decls expr {{ expr.var_list = decls.var_list }} ident i ident j

Slide 39

Slide 39 text

Analyze dependency of the variable list • In “decls”, “decls” need variable list of “decls” and “decl” decls decls func_body expr + decl j = 2 decl i = 1 decls: decls decl {{ decls.var_list = decls.var_list.merge( decl.var_list ) }} ident i ident j

Slide 40

Slide 40 text

Analyze dependency of the variable list • In “decls”, “decls” need variable list of “decls” and “decl” decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j decls: decls decl {{ decls.var_list = decls.var_list.merge( decl.var_list ) }}

Slide 41

Slide 41 text

Create attribute evaluator • Inverse dependency direction to get attribute evaluator decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; }

Slide 42

Slide 42 text

How attribute evaluator works • Visit “i = 1” then update the list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1}

Slide 43

Slide 43 text

How attribute evaluator works • Visit “j = 2” then update the list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2}

Slide 44

Slide 44 text

How attribute evaluator works • Visit “i + j” with the variable list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2} (3) list = {i: 1, j: 2}

Slide 45

Slide 45 text

How attribute evaluator works • Resolve “i” and “j” with the variable list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2} (3) list = {i: 1, j: 2} (4) list = {i: 1, j: 2} (5) list = {i: 1, j: 2}

Slide 46

Slide 46 text

Semantically invalid code • Failed to resolve “j” because it’s not declared decls func_body expr + decl i = 1 ident i ident j function f1() { var i = 1; // var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1} (3) list = {i: 1} (4) Error!!!

Slide 47

Slide 47 text

Automatic generation • Studied attribute evaluator auto-generation from semantic rules Grammar fi le with attributes Parser generator Attribute evaluator generator Parser Attribute evaluator Program

Slide 48

Slide 48 text

Inherited & Synthesized • Attribute is divided into two groups • Inherited Attribute (ܧঝଐੑ): Attribute calculated based on a parent and siblings • Synthesized Attribute (߹੒ଐੑ): Attribute calculated based on children decl: 'var' ident '=' integer ';' {{ decl.var_list[ident.value] = integer.value }} decls: decls decl {{ decls.var_list = decls.var_list.merge(decl.var_list) }} expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list … expr.value = vars[ident_1.value] + vars[ident_2.value] }} var_list is synthesized attribute var_list is synthesized attribute var_list is inherited attribute

Slide 49

Slide 49 text

Inherited & Synthesized • In decls, var list is Synthesized Attribute • In expr, var list is Inherited Attribute • Inherited Attribute allows to pass from parent to children decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j (ˑ) list = {i: 1} (ˑ) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2}

Slide 50

Slide 50 text

Attribute grammar can be complex • Dependency is a graph not tree • It may be circular • It may require exponential time for calculation • Subset of attribute grammar • L-attributed grammar • LR-attributed grammar • S-attributed grammar

Slide 51

Slide 51 text

How LR parser works • Mental model of LR parser is that some automatons are managed by a stack • Generate automatons from each rule program : class_def class_def : "class" id body "end" body : method_def method_def : "def" id "end" M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 P1 P2 method_def def end id class_def C4 class id body end

Slide 52

Slide 52 text

How LR parser works • At the beginning, one automaton exists on the stack class A def m end end P1 P2 class_def

Slide 53

Slide 53 text

How LR parser works • Parser read “class” then new automaton is pushed onto the stack class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end

Slide 54

Slide 54 text

How LR parser works • Parser read “A” then current automaton state is updated class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end

Slide 55

Slide 55 text

How LR parser works • Parser read “def” then new automatons are pushed onto the stack class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end

Slide 56

Slide 56 text

How LR parser works • Parser read “m” and “end” then current automaton reaches to the accepting state class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end

Slide 57

Slide 57 text

How LR parser works • Pop the current automaton then move next automaton state to “B2” • Next automaton also reaches to the accepting state class A def m end end P1 P2 class_def B1 B2 C1 C2 C3 C5 method_def C4 class id body end

Slide 58

Slide 58 text

How LR parser works • Pop the current automaton then move next automaton state to “C4” class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end

Slide 59

Slide 59 text

How LR parser works • Parser read “end” then current automaton reaches to the accepting state class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end

Slide 60

Slide 60 text

How LR parser works • Pop the current automaton then move next automaton state to “P2” • Reaches to the accepting state and no input lefts then program is accepted class A def m end end P1 P2 class_def

Slide 61

Slide 61 text

How LR parser works (2) • Program has one method definition (defn) or one singleton method definition (defs) program: defn | defe defn: "def" id "end" defs: "def" "self" "." id “end" M1 M2 M3 M4 S1 S2 S3 S5 P1 P2 def end id defn / defs S4 def S6 self . id end

Slide 62

Slide 62 text

How LR parser works (2) • At the beginning, one automaton exists on the stack P1 P2 defn / defs def m end

Slide 63

Slide 63 text

How LR parser works (2) • Parser read “def” then … • Which automatons the parser should put to the stack? M1 M2 M3 M4 S1 S2 S3 S5 P1 P2 def end id defn / defs S4 def S6 self . id end def m end Option 1 Option 2

Slide 64

Slide 64 text

How LR parser works (2) • Merge these two automatons to one automaton M1 M2 M3 M4 S1 S2 S3 S5 def end id S4 def S6 self . id end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end

Slide 65

Slide 65 text

How LR parser works (2) • Parser read “def” then push new merged automaton on the stack P1 P2 defn / defs def m end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end

Slide 66

Slide 66 text

How LR parser works (2) • LR Parser can postpone the decision of automaton def m end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def self.m end

Slide 67

Slide 67 text

LR-attributed grammar • LR-attributed grammar is an attribute grammar which LR parser can evaluate when the parser parse codes • Condition #1: All attribute dependencies are left-to-right direction • Condition #2: All inherited attributes in the same state has unique values

Slide 68

Slide 68 text

#1: left-to-right direction • “in_class” & “in_def” inherited attributes can be handled by LR parser • Class can not be defined in def scope • Variable list also can be handled class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end in_class = true in_def = true

Slide 69

Slide 69 text

#2: Unique values for the same state • “in_def” inherited attribute can be decided just after “def” D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def m end def self.m end in_def = true

Slide 70

Slide 70 text

#2: Unique values for the same state • “in_singleton_def” inherited attribute can’t be decided just after “def” • The attribute doesn’t exist in Ruby • The attribute can be decided just after “self” D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def m end def self.m end in_singleton_def = false in_singleton_def = true in_singleton_def = false in_singleton_def = true

Slide 71

Slide 71 text

LR-attributed grammar • LR-attributed grammar enables LR parsing to handle inherited attributes • Inherited attributes carries contexts from top to bottom • In short, LR parser can manage contexts with some limitation • The direction of context flow is left-to-right • I believe this is reasonable because we read codes from top-to- bottom and left-to-right • If multiple production rules are expected, the value of contexts should be unique • I believe this is reasonable to reduce cognitive cost of human

Slide 72

Slide 72 text

Leverage Theory • It’s not currently popular to generate attribute evaluators from semantic rules • However attribute grammar theory tells us what parsers can do and not

Slide 73

Slide 73 text

Summary • What parsers do • Parser distinguishes valid input and invalid input • Parser gives structure to the input correctly • Grammar defines the boundary of the language and structure of the language • Use appropriate grammar class • With great power comes great difficulties • LR-attributed grammar is the foundations theory of current Ruby parser

Slide 74

Slide 74 text

Programing Language Ruby Users can focus on writing grammar - RubyKaigi 2023 in Matsumoto -

Slide 75

Slide 75 text

Use case oriented approach • The parser’s use case is Ruby • Understand Ruby syntax characteristics to understand what are important aspects of parser

Slide 76

Slide 76 text

Simple Syntax !! https://github.com/ruby/ruby

Slide 77

Slide 77 text

https://www.ruby-lang.org/en/about/

Slide 78

Slide 78 text

Ruby Syntax • Ruby syntax is designed for programmers not for machines • What is the key properties of good programing language for programmers? • However sometimes it’s difficult for programmers to understand which sentences are connected

Slide 79

Slide 79 text

Grammar rule con fl ict • It’s not unusual to design the grammar whose grammar rule conflicts • Example: “Dangling else” • https://en.wikipedia.org/wiki/Dangling_else // Rules if a then s if b then s1 else s2 // Code if a then if b then s else s2 // #1 (if a then (if b then s else s2)) // #2 (if a then (if b then s) else s2)

Slide 80

Slide 80 text

Grammar with conflict • For example, infix operator is the cause of conflict • In many language, + has lower precedence * because of arithmetic operators we know + 1 2 3 * * + 1 2 3 #1 #2

Slide 81

Slide 81 text

Grammar without conflict • For example, Polish Notation has no conflict

Slide 82

Slide 82 text

Polish Notation + () • Polish Notation + () seems to be good idea • But Ruby didn’t choice this direction +

Slide 83

Slide 83 text

Con fl ict is design matter • https://bugs.ruby-lang.org/issues/19392 • Endless method definition with “or”

Slide 84

Slide 84 text

Con fl ict is design matter • We can change the precedence locally • Then it’s not the limitation of parser but the design of grammar • In the discussion, consistency of precedence between “=” & “and” are kept

Slide 85

Slide 85 text

Change precedence in some scopes • I implemented “change precedence declaration” as PoC • Within { … }, + has higher precedence than * https://github.com/ruby/lrama/pull/254

Slide 86

Slide 86 text

Flash point of con fl ict • If the rule’s start and end are clear, the chance of conflict will decrease • Informally: I consider how left context is powerful enough to minimize the rule candidates • In change precedence case • The appearance of “{“ on the left is enough powerful to distinguish normal expressions and inverse precedences expressions • The appearance of “}” determines the end of inverse precedences expressions

Slide 87

Slide 87 text

Case #1: Method de fi nition • Start is clear because method definition always starts with “def” • End is clear because method definition always ends with “end”

Slide 88

Slide 88 text

Case #1: Method de fi nition • Start is clear because method definition always starts with “def” • End is clear because method definition always ends with “end” until Ruby 2.7.0 • Endless method definition is introduced from Ruby 3.0.0

Slide 89

Slide 89 text

Case #2: Modi fi er • As explained, infix operator is the cause of conflict • Design the precedence based on human cognitive ability • E.g. ‘+’ < ‘*’ • Modifier has similar characteristic with infix operator

Slide 90

Slide 90 text

Case #3: parentheses • Parentheses are great • Start is clear because the rule starts with “(” • End is clear because the rule starts with “)” • Why do you omit parentheses ???

Slide 91

Slide 91 text

Polish Notation + () • What do you think about Polish Notation + () ? +

Slide 92

Slide 92 text

Ruby Syntax complexities • “The Big Five parse.y calamities” in RubyKaigi 2024 • Today’s topic is “Lex State” https://speakerdeck.com/yui_knk/the-grand-strategy-of-ruby-parser?slide=58

Slide 93

Slide 93 text

What’s Lex State ? • The state of lexer • In textbooks, lexer and parser are completely separated components • However both of them are tightly coupled with in Ruby • Sometimes it’s called “Monstrous lex_state”

Slide 94

Slide 94 text

Why lex_state is needed • In general lexer check input text in the longest match manner otherwise longer one never matches • E.g. Check “||” then check “|”

Slide 95

Slide 95 text

Why lex_state is needed • However in some cases, shorter token should be returned • “|” for block parameter is two “|”

Slide 96

Slide 96 text

EXPR_BEG or not • If lex state is EXPR_BEG then “|” is retuned otherwise “||” is retuned • A lot of conditional branches based on lex state • Too complicated “|” “||” Check lex state

Slide 97

Slide 97 text

Monstrous lex_state • Ruby’s lexer has 13 state bits!

Slide 98

Slide 98 text

Why it’s terrible • “All bugfixes are incompatibilities” • 36:00 ~ https://rubykaigi.org/2019/presentations/nagachika.html

Slide 99

Slide 99 text

Fixing a bug caused other bugs • Fixing [Bug #10653] caused [Bug #11456] and [Bug #11849]

Slide 100

Slide 100 text

• All of them include ‘:’ … ? ‘:’ is di ff i cult true ? 1.tap do |n| p n end : 0 {foo: ("" rescue "")} { label:<<-DOC Some text for a heredoc goes here DOC }

Slide 101

Slide 101 text

Fix [Bug #10653] • By the way, I guess not r51617 but r51616 fixed the issue, right? • The error was unexpected “keyword_do_cond” and COND_PUSH(1) is called after ‘?’ • There is a space between “end” and ‘:’

Slide 102

Slide 102 text

Fix [Bug #10653] • Anyway, r51617 changed the logic from managing where label is disallowed (EXPR_VALUE) to where label is allowed (EXPR_LABEL) {foo: ("" rescue "")} { label:<<-DOC Some text for a heredoc goes here DOC } label label

Slide 103

Slide 103 text

[Bug #11456] • Lex state is “EXPR_ARG|EXPR_LABELED” after label is tokenized • Then it’s NOT IS_BEG() • tLPAREN_ARG is returned • Only expr is allowed after tLPAREN_ARG !!! • expr doesn’t allow modifier rescue {foo: ("" rescue "")} case '(': if (IS_BEG()) { c = tLPAREN; } else if (IS_SPCARG(-1)) { c = tLPAREN_ARG; } paren_nest++; COND_PUSH(0); CMDARG_PUSH(0); lex_state = EXPR_BEG|EXPR_LABEL; return c; primary: tLPAREN_ARG expr rparen Before: EXPR_LABELARG After: EXPR_ARG|EXPR_LABELED

Slide 104

Slide 104 text

Fix [Bug #11456] • r51624 fixed the bug by adding “EXPR_ARG|EXPR_LABELED” to IS_BEG() https://github.com/ruby/ruby/commit/0958af2ad4e83400f35c296e9ed9cf021b1675b4

Slide 105

Slide 105 text

[Bug #11849] • Lex state is “EXPR_ARG|EXPR_LABELED” after label is tokenized • Then it’s IS_ARG() { label:<<-DOC Some text for a heredoc goes here DOC } case '<': last_state = lex_state; c = nextc(); if (c == '<' && !IS_lex_state(EXPR_DOT | EXPR_CLASS) && !IS_END() && (!IS_ARG() || space_seen)) { int token = heredoc_identi fi er(); if (token) return token; } ...

Slide 106

Slide 106 text

Fix [Bug #11849] • r53214 fixed the bug by adding “EXPR_LABELED” check https://github.com/ruby/ruby/commit/9d5abbff9754589483938dc539226c2ad4895140

Slide 107

Slide 107 text

Ruby Syntax changes • 3.2.0 (2022-12-25): Anonymous rest and keyword rest arguments can now be passed as arguments • 3.1.0 (2021-12-25): Anonymous block argument

Slide 108

Slide 108 text

Ruby Syntax changes • 3.0.0 (2020-12-25): Endless method definition • 2.7.0 (2019-12-25): Pattern matching, beginless range • 2.6.0 (2018-12-25): Endless range

Slide 109

Slide 109 text

What will happen by the change? • Proposal for existing grammar • https://bugs.ruby-lang.org/issues/18080

Slide 110

Slide 110 text

Summary • Use Case: Ruby • Ruby syntax is designed for programmers not for machines • Ruby syntax changes • Parser needs to • have theory and mechanism which mitigate implementation complexities • give the language designer feedbacks about syntax changes

Slide 111

Slide 111 text

Fight with implementation complexities We have not leveraged the potential of LR parser - RubyKaigi 2023 in Matsumoto -

Slide 112

Slide 112 text

Monstrous lex_state • Ruby’s lexer has 13 state bits!

Slide 113

Slide 113 text

Parser & Lexer • Assume parser and lexer can be separated Class Method Method Assignment @name Call name capitalize class Greeter def Lexer Parser …

Slide 114

Slide 114 text

Parser & Lexer • However lexer depends on parser in Ruby • Lexer generates different tokens depending on the parser state • Tokens with same length but different identity • Tokens with different length • By the way, parser knows what kind of tokens itself can accept on each parser state

Slide 115

Slide 115 text

PSLR(1) • It seems good idea to integrate parser and lexer then change to manage states on parser side • Joel E. Denny. “PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages”, May 2010. • https://tigerprints.clemson.edu/cgi/viewcontent.cgi? article=1519&context=all_dissertations • PSLR stands for Pseudo-Scannerless Minimal LR

Slide 116

Slide 116 text

PSLR(1) • > Nevertheless, traditional scanner and parser generators attempt to generate loosely coupled scanners and parsers, so the user must maintain these tightly coupled scanner and parser specifications separately but consistently. • > Scanner and parser specifications would be significantly more maintainable if all sub-language transitions were instead computed from a grammar by a parser generator and recognized automatically by the scanner using the parser’s stack.

Slide 117

Slide 117 text

Sub-languages and scopes • The example comes from “Figure 2.6: Scoped Declarations” of the paper • C++0x • and ‘>>’ has higher precedence than ‘>’, vice verse in Lc Lc Lt Y>1)>> x4; Lc Lt Lp Lc : main C++0x language Lt : template argument list sub-language Lp : parenthesized expression sub-sub-language %lex-prec ’>’ -< ’>>’ for Lc and Lp %lex-prec ’>>’ -< ’>’ for Lt Y>1)>> x4;

Slide 118

Slide 118 text

Sub-languages and scopes • In Ruby case, ‘|’ has higher precedence than ‘||’ in Lbp obj.m do || end Lrb Lbp Lrb : main ruby Lbp : block parameters %lex-prec ’|’ -< ’||’ for Lrb %lex-prec ’||’ -< ’|’ for Lbp

Slide 119

Slide 119 text

Scanner con fl ict • Identity conflict: Tokens with same length but different identity • E.g. do, do_cond, do_block, do_LAMBDA • Length conflict: Tokens with different length • E.g. ‘|’, ‘||’

Slide 120

Slide 120 text

How to specify Sub-languages scopes • Specify nonterminals as a scope of sub-languages • See: “3.7 Scoped Declarations” obj.m do |var| expr end method_call brace_block do |var| expr end k_do do_body k_end |var| expr opt_block_ param bodystmt | var | block_ param

Slide 121

Slide 121 text

How to specify Sub-languages scopes • With in “opt_block_param”, ‘|’ has higher precedence than ‘||’ %nterm opt_block_param { %lex-prec ‘||’ < ‘|’ } %% primary: method_call brace_block brace_block: k_do do_body k_end do_body: opt_block_param bodystmt opt_block_param: block_param_def block_param_def: '|' opt_bv_decl '|' | '|' block_param opt_bv_decl '|'

Slide 122

Slide 122 text

How it works • Collecting tokens before “opt_block_param” -> “do” • Collecting tokens which are the last token of “opt_block_param” -> ‘|’ • Parser update the lexer precedence to sub-language mode after “do” and restore it after the second ‘|’ primary: method_call brace_block brace_block: k_do do_body k_end do_body: opt_block_param bodystmt opt_block_param: block_param_def block_param_def: '|' opt_bv_decl '|' | '|' … '|'

Slide 123

Slide 123 text

How it works • Some states are marked • ‘||’ is separated to two ‘|’ in marked states primary: method_call ● brace_block brace_block: ● k_do do_body k_end brace_block: k_do ● do_body k_end do_body: ● opt_block_param bodystmt opt_block_param: ● block_param_def block_param_def: ● ‘|' block_param ‘|' block_param_def: ‘|’ ● block_param ‘|' block_param_def: ‘|’ block_param ● ‘|' block_param_def: ‘|’ block_param ‘|' ● do_body: opt_block_param ● bodystmt ... primary: method_call brace_block ●

Slide 124

Slide 124 text

Scope con fl ict • If contradictional lexer precedence are defined, the parser state has scope conflict • Split the state again so that each state doesn’t have contradictional lexer precedence • In this case, the states can be separated because one follows “{” and other follows “do” %nterm opt_block_param { %lex-prec ‘||’ < ‘|’ } %nterm brace_body { %lex-prec ‘||’ > ‘|’ } %% brace_block: k_do do_body k_end do_body: opt_block_param bodystmt brace_block: '{' brace_body '}' brace_body: opt_block_param compstmt

Slide 125

Slide 125 text

IELR • IELR can split such state • IELR is more powerful than LALR • PSLR is an extension of IELR • Both PSLR and IELR are invented by Joel E. Denny

Slide 126

Slide 126 text

ʮφϯτΧLRʯΛ੔ཧ͢Δ / Clarifying LR Algorithms https://speakerdeck.com/junk0612/clarifying-lr-algorithms?slide=5

Slide 127

Slide 127 text

RubyKaigi 2024 • Check slides and video for more detail • https://rubykaigi.org/2024/presentations/junk0612.html

Slide 128

Slide 128 text

Reconsider lex_state • Reconsider block parameter syntax • “||” is not accepted after “do” • “||” is not accepted after “var” • No scanner conflict • It’s enough for parser to pass acceptable token list to lexer obj.m do |var| expr end // After “do” do_body: … ● opt_block_param bodystmt opt_block_param: ● none opt_block_param: ● block_param_def block_param_def: ● '|' opt_bv_decl ‘|' block_param_def: ● '|' block_param opt_bv_decl ‘|' // After “var” $@23: ε ● [‘='] f_eq: ● $@23 ‘=' f_opt_primary_value: f_arg_asgn ● f_eq primary_value f_arg_item: f_arg_asgn ● ['|', '\n', ',', ';']

Slide 129

Slide 129 text

Reconsider modi fi er if • “if” will be • keyword_if if the lex state is EXPR_BEG • modifier_if if the lex state is not EXPR_BEG “if” keyword_if modi fi er_if if cond then … … if cond EXPR_BEG ! EXPR_BEG

Slide 130

Slide 130 text

Checking states table • If I checked correctly, no state accepts both keyword_if and modifier_if • If the state accepts keyword_if, it doesn’t accept modifier_if • If the state accepts modifier_if, it doesn’t accept keyword_if • Always current state knows how to handle “if”

Slide 131

Slide 131 text

EXPR and if • After the operator, state is EXPR_BEG. Then “if … end” is accepted • After the number, state is EXPR_END. Then modifier if is accepted • It’s clear which type of if can be written 1 + 2 BEG END BEG END 1 + if true; 1 else 2 end 1 + 2 if true EXPR_BEG ! EXPR_BEG

Slide 132

Slide 132 text

Hypothesis • #1: In Ruby, the end of nonterminal symbol is powerful enough to distinguish which tokens are accepted • #2: A lot of token types can be determined on parser side • If so, sub-language model is not the best mental model in Ruby

Slide 133

Slide 133 text

I forget command like control syntax • Tweak parse.y to replace modifier_if with keyword_if • These grammar rules have conflict return if … return (if …) (return) if … keyword_if modifier_if

Slide 134

Slide 134 text

Insight • modifier_if or keyword_if • It’s clear in a sentence with operator • It’s not clear just after control syntax • If the relation between modifier_if and keyword_if are specified, parser inform conflicts to us • How conflicts are resolved in the language is important insight when new syntax is added

Slide 135

Slide 135 text

Summary • In Ruby, how to extract token depends on the surrounding sentences • lex_state is complicated • Need to mitigate the complexities for further syntax extensions • Tight communication between scanner and parser will reduce the complexities • Explicitly declaration of conflict resolution recodes what the language designer decided • Able to refer to the past decisions when similar pattern appears

Slide 136

Slide 136 text

Give feedbacks to the language designer It’s fun to hack parser generator - RubyKaigi 2024 LT in Okinawa -

Slide 137

Slide 137 text

What will happen by the change? • Proposal for existing grammar • https://bugs.ruby-lang.org/issues/18080

Slide 138

Slide 138 text

It’s possible to implement • > but nobu said it's hard to support because of parse.y limitation. • No, it’s possible!! • https://github.com/yui-knk/ruby/tree/bugs_18080

Slide 139

Slide 139 text

Need to consider these patterns • There is an argument or not • The arguments are sounded by parentheses or not • There is block or not • The symbol of pattern matching, `in` or `=>`

Slide 140

Slide 140 text

Need to consider these patterns • There is one combination which is suspicious

Slide 141

Slide 141 text

Con fl icts with existing grammar • There is no block • The arguments are not sounded by parentheses • The symbol of pattern matching is `=>`

Slide 142

Slide 142 text

LR parser generator knows this issue • S/R or R/R conflict detection is a friend for programming language designer

Slide 143

Slide 143 text

Why this issue is di ff i cult to detect? • Need to check all combination of grammar rules • Discussion of grammar and implementation of parser are localized

Slide 144

Slide 144 text

Combination of grammar rules • A lot of rules are optional • Argument is optional • Parentheses around arguments are optional • Block is optional • (The symbol of pattern matching, `in` or `=>) • Need to discuss grammar rules as group • E.g. “a == b”, “1 + 2” and “1..2” are in same “arg” group • If change “arg” rules, need to consider the impact on “expr” and “stmt” too

Slide 145

Slide 145 text

Localized discussion & implementation • Examples in a ticket is simple • Parser implementation is a combination of parts • Parser generator: combination of rules • Recursive Descent Parser: combination of functions, e.g. “parse_pattern_matching”, “parse_arguments”

Slide 146

Slide 146 text

Localized discussion & implementation • Localized discussion and implementation are good practice • Divide the difficulties • However it requires mechanism to integrate these parts • LR parser generator has the mechanism, conflict detection • Hand written parser doesn’t have such mechanism • Parser generator works as checker/linter for grammar • Can not keep soundness of grammar without the help from computer science

Slide 147

Slide 147 text

RubyKaigi 2024 • Check slides and video for more detail • https://rubykaigi.org/2024/presentations/spikeolaf.html

Slide 148

Slide 148 text

Lexer level con fl ict • Current parser doesn’t warn lexer level conflict • Because parser doesn’t know relationship between keyword_if and modifier_if • However it conflicts on some points from programmers viewpoint • The detection is helpful for syntax discussion return if … return (if …) (return) if … keyword_if modifier_if

Slide 149

Slide 149 text

Endless range • Endless range literal is cutting-edge syntax • Traditional range ends with EXPR_END however endless range ends with EXPR_BEG 1 … 2 BEG END BEG END 1 … BEG END BEG

Slide 150

Slide 150 text

Endless range • Concerning lex state sensitive tokens • ‘%’: is interpreted as a start of % string literal if EXPR_BEG • ‘||’: is divided into two ‘|’ if EXPR_BEG

Slide 151

Slide 151 text

Endless range • However it might not matter • ‘..’ and ‘…’ has relatively low precedence

Slide 152

Slide 152 text

Endless range • “and” also doesn’t matter • “and” has lower precedence than “…”

Slide 153

Slide 153 text

Endless range • “rescue” might matter ? • Parser generator could help [Feature #12912] discussion more

Slide 154

Slide 154 text

Last battle with Space • I think space and newline are the most mysterious syntax part of Ruby • “space_seen” variable • ‘\n’ token and tIGNORED_NL token • How to include space and newline into parser context is open problem

Slide 155

Slide 155 text

Summary • It’s difficult for human to understand the combination • Can not keep soundness of grammar without the help from computer science • PSLR is key concept for checking soundness of lexer state sensitive grammar

Slide 156

Slide 156 text

What parser generates I want to know truth of syntax tree design - Osaka RubyKaigi 04 -

Slide 157

Slide 157 text

Recap Osaka RubyKaigi 04 https://yui-knk.hatenablog.com/entry/2024/08/23/113543

Slide 158

Slide 158 text

Use cases of Syntax Tree and it’s design in 10 mins

Slide 159

Slide 159 text

Syntax Tree • Parser generates Syntax Tree for other libraries and components • Compiler, Type System, LSP, Linter and Code Formatter • Therefore what’s the use case of Syntax Tree ? • How to satisfy the use cases ?

Slide 160

Slide 160 text

Use cases • They want to execute codes • compile.c • Type System • They want to analyze codes • LSP (ruby-lsp) • Linter & Code Formatter (RuboCop)

Slide 161

Slide 161 text

Code analysis • Need token information (Syntax Highlight) • Need to analyze comments (LSP DocumentLink) • Need to walk through parent node from child node (LSP SelectionRange) • Want to rewrite codes (LSP & Code Formatter) • This is the most difficult use case, right now

Slide 162

Slide 162 text

Code rewriting • Style::IfInsideElse Cop • Unnest if inside if https://github.com/rubocop/rubocop/blob/v1.65.1/lib/rubocop/cop/style/if_inside_else.rb#L10-L29

Slide 163

Slide 163 text

How RuboCop rewrite codes if condition_a action_a else if condition_b action_b else action_c end end if condition_a action_a elsif condition_b if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end if condition_a action_a elsif condition_b action_b else action_c end 1. else to elsif 2. Delete “if condition_b” 3. Delete end 4.Delete duplicated “action_b”

Slide 164

Slide 164 text

Problem of TreeRewriter #1 • Implementation is complex • TreeRewriter doesn’t edit source code directly • Create TreeRewriter::Action instances, store the actions then apply the changes at once Action. :replace (2, 0)-(2, 4) “elsif condition_b” Action. :replace (3, 2)-(3, 16) “action_b” Action. :replace (7, 0)-(7, 6) “” Action. :replace (4, 0)-(4, 13) “”

Slide 165

Slide 165 text

Why Action is needed #1 • It’s costly to edit string every time • In both cases, need to move/copy sub-strings after “else” if condition_a action_a else action_b end if condition_a action_a action_b end Delete else if condition_a action_a else action_b end if condition_a action_a elsif action_b end Replace with elsif

Slide 166

Slide 166 text

Why Action is needed #2 • Directly editing the code affects the rest of nodes if condition_a action_a else action_b end Parser::Source::Bu ff er if condition_a action_a action_b end Parser::Source::Bu ff er NODE_VCALL action_b Range (3, 2)-(3, 10) Delete else

Slide 167

Slide 167 text

Problem of TreeRewriter #2 • Rewriting operations are complicated • Need to understand current status of each step if condition_a action_a else if condition_b action_b else action_c end end if condition_a action_a elsif condition_b if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end if condition_a action_a elsif condition_b action_b else action_c end 1. else to elsif 2. Delete “if condition_b” 3. Delete end 4.Delete duplicated “action_b”

Slide 168

Slide 168 text

Rewriting Syntax Tree • Can leverage Tree Structure • Change NODE_IF to NODE_ELSIF then delete NODE_ELSE NODE_IF condition_a action_a NODE_ELSE NODE_IF condition_b action_b NODE_ELSE action_c NODE_IF condition_a action_a NODE_ELSIF condition_b action_b action_c

Slide 169

Slide 169 text

Generate source code from Syntax Tree • Once rewrite the syntax tree, rendering source code from Syntax Tree • However AST doesn’t have spaces, newline and so on… NODE_IF condition_a action_a NODE_ELSIF condition_b action_b action_c

Slide 170

Slide 170 text

Other di ffi culties #1 • Need to pass range information for new node • Calculation is still based on text oriented approach • Can not fully leverage the syntax tree transformation if condition_a action_a else if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b else action_c end Start is same with else The last line is “action_c line - 1” The last column is “tail of action_c - 2”

Slide 171

Slide 171 text

Other di ffi culties #2 • All nodes following the updated node are affected NODE_CLASS condition_a action_a NODE_ELSIF NODE_DEF NODE_IF NODE_DEF expr1 expr2 expr2 expr2 condition_b action_b action_c Update!!

Slide 172

Slide 172 text

Migration Tool • Code rewriting is not only for LSP and code formatter • But also migration tool like Transpec http://yujinakayama.me/transpec/

Slide 173

Slide 173 text

How to solve the problem

Slide 174

Slide 174 text

Concrete Syntax Tree • Concrete Syntax Tree (CST) for code restoration • CST preserves information which AST omits, e.g. spaces, newlines, parentheses • AST focuses on semantics, CST focuses on Syntax • Implementation • Introduce data structure for token • Keep information on token which lexer omitted • Node has child nodes and tokens

Slide 175

Slide 175 text

Slide 176

Slide 176 text

Trivia • Trivia is information which lexer omits • Spaces, Newlines, comments and so on Trivia (comment) Trivia (spaces) Trivia (new line)

Slide 177

Slide 177 text

Node, Token and Trivia • Token has trailing trivia and leading trivia • Node holds nodes and tokens NODE_IF IF cond action_a END Token NODE Legend space (1) NL (1) + space (2) NL (1) Trivia

Slide 178

Slide 178 text

Syntax Tree to code • Dump codes with Depth-first search to get the whole codes Token NODE Legend NODE_IF IF cond action_a END space (1) NL (1) + space (2) NL (1) Trivia

Slide 179

Slide 179 text

Red Green Tree • Red Green Tree is editable Syntax Tree • Invented by C# (Roslyn) • Swift (SwiftSyntax) and rust-analyzer (LSP) uses this • Represent Syntax Tree with Red Node and Green Node • Let’s read swift-syntax • https://github.com/swiftlang/swift-syntax

Slide 180

Slide 180 text

Red Green Tree • Green Node • has reference to chide elements • has width • Red Node • has reference to parent elements • has offset Token Green NODE Legend Red NODE NODE_IF width: 90 IF width: 3 NODE_IF width: 56 condition_a width: 11 action_a width: 11 NODE_ELSE width: 61 END width: 4 ELSE width: 5 NODE_IF o ff set: 0 NODE_ELSE o ff set: 25 NODE_IF o ff set: 30

Slide 181

Slide 181 text

Recap • Execute codes: compile.c • Analyze codes: LSP, Linter & Code Formatter • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • Concrete Syntax Tree !! • Editable Syntax Tree • Red Green Tree !!

Slide 182

Slide 182 text

Problem • How to keep edited Syntax Tree correct ? • If it’s not correct, hopefully want to auto correct • Syntax Tree rewriting can create Syntax Tree which parser never generates + 1 2 3 * * + 1 2 3 parse Rewrite Dump

Slide 183

Slide 183 text

Open Problem • Simple approach: Parse the dump code and compare syntax tree with the syntax tree before dump • Grammar might know which syntax tree parser can generate Grammar fi le Parser generator Parser Syntax Tree Checker

Slide 184

Slide 184 text

Summary • Research Syntax Tree use case • Code rewriting is the most difficult use case, right now • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • How to keep edited Syntax Tree correct • Open Problem

Slide 185

Slide 185 text

Conclusion The world is now in the great age of parsers. People are setting sail into the vast sea of parsers. - RubyKaigi 2023 LT in Matsumoto -

Slide 186

Slide 186 text

Summary • Grammar defines the boundary of the language and structure of the language • What parsers do • Parser distinguishes valid input and invalid input • Parser gives structure to the input correctly • LR-attributed grammar is the foundations theory of current Ruby parser

Slide 187

Slide 187 text

Slide 188

Slide 188 text

Slide 189

Slide 189 text

What is Grammar • The ruler of lexer, parser and syntax tree • By grammar, we can reveal what Ruby is • It’s very interesting to expose the secret of Ruby syntax from grammar • I want to reveal what is the key of Ruby’s programmer friendly syntax • Hypothesis: • Programmers • can understand expression beginning and ending • feel it’s natural that conditional branch follows flow control keywords, like “return” • can understand space sensitive grammar • sometimes fails to understand precedence of non-arithmetic operator