Slide 1

Slide 1 text

Ruby's Line Breaks April 16, 2025 in RubyKaigi 2025 @yui-knk Yuichiro Kaneko

Slide 2

Slide 2 text

✦ Yuichiro Kaneko ✦ yui-knk (GitHub) / spikeolaf (Twitter) ✦ Treasure Data ✦ Engineering Manager of Applications Backend About me

Slide 3

Slide 3 text

PR: We are Gold sponsor!

Slide 4

Slide 4 text

Day 3: Drink up 🍻

Slide 5

Slide 5 text

TD and Ruby committers twitter: @nalsh GitHub: @nurse twitter: @k_tsj GitHub: @k-tsj twitter: @ spikeolaf GitHub: @yui-knk twitter: @mineroaoki GitHub: @aamine twitter: @nahi GitHub: @nahi Applications Backend

Slide 6

Slide 6 text

Attendees from TD @spikeolaf @nalsh @k_tsj @makimoto @ citystar (GH) @ybiquitous @yokoto @tomog105 Anne Ju

Slide 7

Slide 7 text

✦ Yuichiro Kaneko ✦ yui-knk (GitHub) / spikeolaf (Twitter) ✦ CRuby committer, mainly develop parser generator and parser ✦ Lrama LALR (1) parser generator (2023, Ruby 3.3) ✦ The Bison Slayer ✦ The parser monster ✦ The dawn bringer of the parser world ✦ Ripper Rearchitecture (2024, Ruby 3.4) ✦ Code positions to RNode (2018, Ruby 2.6) ✦ RubyVM::AbstractSyntaxTree (2018, Ruby 2.6) About me

Slide 8

Slide 8 text

Introduction

Slide 9

Slide 9 text

✦ This code consists two method dispatches ✦ Calls `#p` method with 1 ✦ Calls `#p` method with 2 ✦ In Ruby, you can use line breaks to separate meaningful chunks of code Line Breaks in Ruby Grammar

Slide 10

Slide 10 text

✦ This code consists one method dispatch ✦ `Integer#+` is called for `1` with `2` Line Breaks in Ruby Grammar

Slide 11

Slide 11 text

✦ There are signi fi cant difference between these two codes ✦ #1: Due to the line break, the fi rst and second lines are interpreted as distinct method calls ✦ #2: The two lines are combined into a single line of code ✦ In other words, the second code behaves as if the line break is ignored Line Breaks are the question

Slide 12

Slide 12 text

✦ Hypothesis: In principle, a statement terminates if a line break is present at a point where a statement can be completed Principles of Line Breaks in Ruby https://x.com/tanaka_akr/status/1870679443376947467

Slide 13

Slide 13 text

✦ The fi rst line of this code becomes a complete statement as a method call without arguments ✦ Therefore this code is two method calls, not a single method call with an argument ✦ The principle holds true #1: method call w/o args !=

Slide 14

Slide 14 text

✦ The fi rst line of this code isn’t a complete statement because syntax error is raised ✦ Therefore this code is one method call with two arguments ✦ The principle holds true #2: method call w/ args == Syntax Error

Slide 15

Slide 15 text

✦ “1 +” isn’t a complete statement ✦ Therefore this code is interpreted as “1 + 2” ✦ The principle holds true #3: binary operator Syntax Error ==

Slide 16

Slide 16 text

✦ “-” isn’t a complete statement ✦ Therefore this code is interpreted as ‘-“str”’ ✦ The principle holds true #4: unary operator Syntax Error ==

Slide 17

Slide 17 text

✦ Because the ternary operator's scope includes the expression following the colon, it will not be able to be separated in the middle of that expression ✦ The principle holds true #5: ternary operator == Syntax Error

Slide 18

Slide 18 text

✦ So far, the hypothesis seems to be correct ✦ Hypothesis: In principle, a statement terminates if a line break is present at a point where a statement can be complete ✦ However, are there truly no counterexamples in every part of Ruby's grammar …? No counterexamples exist …?

Slide 19

Slide 19 text

✦ The relationships of Language, Grammar, Automaton and Parser ✦ Investigate how line breaks are treated ✦ Verify the hypothesis ✦ Understand the principles of line breaks in Ruby grammar Today’s topics

Slide 20

Slide 20 text

Language, Grammar, Automaton and Parser

Slide 21

Slide 21 text

✦ A (formal) language is a subset of words ✦ Some words belong to Ruby language ✦ Others don’t What is Language?

Slide 22

Slide 22 text

✦ Even so these codes are transcendental and imbroglio codes (TRICK), they belong to Ruby language. Ruby https://github.com/tric/trick2022/blob/master/01-tompng/entry.rb https://github.com/tric/trick2022/blob/master/06-mame/entry.rb

Slide 23

Slide 23 text

✦ At a glance, this code seems Ruby code, however it doesn’t belong to Ruby language Not Ruby

Slide 24

Slide 24 text

✦ (Ruby) language is a in fi nite set of words ✦ Grammar is a fi nite set of rules which de fi ne language What is Grammar? Grammar Language …

Slide 25

Slide 25 text

✦ Grammar provides structure to the language Grammar and structure + 1 2 3 * * + 1 2 3 Correct Wrong

Slide 26

Slide 26 text

✦ Finite automaton is 5-tuple ✦ is a fi nite set of states ✦ is a fi nite set of input symbols ✦ is a transition function, from state to state by input symbol ✦ is an initial state ✦ is a set of accepting states (Q, Σ, δ, q0 , F) Q Σ δ q0 F What is Automaton?

Slide 27

Slide 27 text

✦ Let's consider a vending machine that sells juice at ¥110 ✦ Put in a 100-yen coin and 10-yen coin ✦ Press the “purchase” button Vending machine

Slide 28

Slide 28 text

✦ In the case of a vending machine, ✦ There are 3 inputs, 100-yen coin, 10-yen coin and pressing the “purchase” button ✦ Initial state is that no coins are inserted ✦ Accepting state is that 100-yen coin and 10-yen coins are inserted then the “purchase” button is pressed Vending machine as automaton q0 ¥100 q1 q2 ¥10 q3 q4 ¥100 ¥10 “purchase"

Slide 29

Slide 29 text

✦ Only limited inputs are accepted in each state ✦ For example, on ✦ ¥10 coin is accepted ✦ ¥100 coin is not accepted ✦ “purchase” button is not accepted q1 Vending machine as automaton q0 ¥100 q1 q2 ¥10 q3 q4 ¥100 ¥10 “purchase"

Slide 30

Slide 30 text

✦ There are known methods to convert a Non-deterministic Finite Automaton (NFA) to a Deterministic Finite Automaton (DFA) ✦ It is known that the minimum DFA is unique ✦ It's possible to combine two automata into a single automaton Theory of Automaton

Slide 31

Slide 31 text

✦ If you attended Fujinami-san's talk this morning, you likely heard precisely about research on automaton Theory of Automaton https://speakerdeck.com/makenowjust/make-parsers-compatible-using-automata-learning

Slide 32

Slide 32 text

✦ @.bookstore at RubyKaigi 2025 ✦ “ܭࢉཧ࿦ͷجૅɹ[ݪஶୈ3൛] 1.ΦʔτϚτϯͱݴޠ” ✦ “നͱࠇͷͱͼΒ: ΦʔτϚτϯͱܗࣜݴޠΛΊ͙Δ๯ݥ” ✦ “ਖ਼نදݱٕज़ೖ໳――࠷৽Τϯδϯ࣮૷ͱཧ࿦తഎܠ” Books for Automaton

Slide 33

Slide 33 text

✦ Is the vending machine related to a parser? ✦ No, parser is not a vending machine, but parser is an automaton Parser as fi nite-state automaton

Slide 34

Slide 34 text

✦ Let’s think about a grammar which only includes single class de fi nition ✦ “class”, identi fi er (id) then “end” ✦ This can be represented as an automaton with four states, taking “class” id and “end” as inputs Parser for very simple grammar class A end program : "class" id "end" P1 P2 P3 P4 class id end Grammar Language Automaton

Slide 35

Slide 35 text

✦ For example, on P1 ✦ “class” is accepted then goes to state P2 ✦ id is not accepted then syntax error ✦ “end” is not accepted then syntax error How the parser works program : "class" id "end" P1 P2 P3 P4 class id end Grammar Automaton

Slide 36

Slide 36 text

✦ For example, on P2 ✦ “class” is not accepted then syntax error ✦ id is accepted then goes to state P3 ✦ “end” is not accepted then syntax error How the parser works program : "class" id "end" P1 P2 P3 P4 class id end Grammar Automaton

Slide 37

Slide 37 text

✦ For example, on P4 ✦ “class” is not accepted then syntax error ✦ id is not accepted then syntax error ✦ “end” is not accepted then syntax error ✦ End of Input is accepted then the parsing process completed without errors How the parser works program : "class" id "end" P1 P2 P3 P4 class id end Grammar Automaton

Slide 38

Slide 38 text

✦ Let’s allow arbitrary levels of nesting for class de fi nitions Parser for complex grammar class A class B … end end program : class_def class_def : "class" id body "end" body : class_def | /* empty */ Grammar Language Automaton B1 B2 C1 C2 C3 C5 P1 P2 class_def class_def C4 class id body end B1 /* empty */

Slide 39

Slide 39 text

✦ When parsing nested class de fi nitions, a new automaton corresponding to the class de fi nition is created ✦ Parsed until “class A”, so that the second automaton is on C3 Automaton with a stack class A class B end end Input C1 C2 C3 C5 P1 P2 class_def C4 class id body end

Slide 40

Slide 40 text

✦ To parse “class B end” as the body of class A, create new automatons Automaton with a stack class A class B end end Input B1 B2 C1 C2 C3 C5 P1 P2 class_def class_def C4 class id body end C1 C2 C3 C5 C4 class id body end

Slide 41

Slide 41 text

✦ After reading to “class B end”, the bottom automaton enters the accepting state Automaton with a stack class A class B end end Input B1 B2 C1 C2 C3 C5 P1 P2 class_def class_def C4 class id body end C1 C2 C3 C5 C4 class id body end

Slide 42

Slide 42 text

✦ When the automaton reaches the accepting state, it is popped from the stack, and the original automaton transitions to the next state (C4) Automaton with a stack class A class B end end Input C1 C2 C3 C5 P1 P2 class_def C4 class id body end

Slide 43

Slide 43 text

✦ Parser is an automaton that takes tokens such as “class”, id, and “end” as input ✦ A fi nite automaton with a stack is called a pushdown automaton ✦ By using a stack, pushdown automaton can handle languages with in fi nite nesting ✦ Implementation of LR parser is pushdown automaton LR Parser as pushdown automaton

Slide 44

Slide 44 text

✦ LR parsers have two main operations ✦ Shift: Moves the automaton to the next state ✦ Reduce: Pops the automaton that has reached the accepting state from the stack LR parser actions C1 C2 C3 C5 C4 C1 C2 C3 C5 C4 Shift Reduce

Slide 45

Slide 45 text

✦ How does LR parser choose the correct automaton when multiple automatons are applicable? ✦ “body : class_def” is correct for left,“body : /* empty */” is correct for right Chose correct automaton class A end B1 B2 C1 C2 C3 C5 class_def C4 class id body end C1 C2 C3 C5 C4 class id body end class A class B end end B1 /* empty */

Slide 46

Slide 46 text

✦ Determine the correct rule by looking at the next token ✦ In the right case, next token is “end” ✦ In the case of the empty string rule, “end” can be shifted after the automaton is popped ✦ The set of tokens that can follow a certain rule is called a lookahead set Lookahead set class A end C1 C2 C3 C5 C4 class id body end B1 /* empty */ Next token is “end” Match with new token Input

Slide 47

Slide 47 text

✦ Chomsky hierarchy ✦ Four formal grammar classes consist hierarchy ✦ There are correspondences between grammars and automatons Grammar class and automaton Regular Context-free Context-sensitive Recursively enumerable Linear-bounded non-deterministic Turing machine Non-deterministic pushdown automaton Finite-state automaton Turing machine

Slide 48

Slide 48 text

✦ Grammar determines the scope of the language ✦ Grammar can be converted to an automaton ✦ Parser is an automaton that takes tokens as input Language, Grammar, Automaton and Parser class A … end program : class_def class_def : "class" id body "end" … Grammar Language Automaton = Parser C1 C2 C3 C5 C4 class id body end Determine Convert

Slide 49

Slide 49 text

How Line Breaks are treated?

Slide 50

Slide 50 text

✦ A lexer is the component that divides input string into meaningful chunks, which are called tokens Parser and Lexer C1 C2 C3 C5 P1 P2 class_def C4 class id body end class A class B end end Input class A class B end end Tokens Parser Lexer

Slide 51

Slide 51 text

✦ It's important to understand that the lexer intelligently handles line breaks, sometimes ignoring them and sometimes not #1: Lexer ignores Line Breaks method_1 arg 1 + 2 Input method_1 ‘\n’ arg 1 + 2 Tokens Not ignored Ignored

Slide 52

Slide 52 text

✦ On the other hand, regarding the grammar, statements are separated by line breaks (‘\n’) by a rule Grammar for statements

Slide 53

Slide 53 text

✦ Lexer returns or ignores ‘\n’ based on lex state How lexer works Returns ‘\n’ token Ignores ‘\n’ Returns ‘\n’ token Checks lex state

Slide 54

Slide 54 text

✦ 13 lex state fl ags exits! ✦ EXPR_BEG: BEGinning of expression ✦ EXPR_END: END of expression ✦ EXPR_ENDARG: END of ARGument ✦ EXPR_ENDFN: END of Function NAME ✦ EXPR_ARG: ARGument ✦ EXPR_CMDARG: CoMmanD ARGument ✦ EXPR_MID: MIDdle of expression ✦ EXPR_FNAME: immediate after “def” keyword, might be Function NAME ✦ EXPR_DOT: immediate after DOT (dot includes ‘.’ ‘&.’ ‘::’) ✦ EXPR_CLASS: immediate after “class” keyword ✦ EXPR_LABEL: label is possible, label is `a:` ✦ EXPR_LABELED: immediate after label ✦ EXPR_FITEM: just before fi tem. fi tem is token after undef or alias ✦ Only written the typical meanings, there are also exceptions They are lex state fl ags !

Slide 55

Slide 55 text

✦ EXPR_CLASS means immediate after “class” keyword ✦ The lexer ignores line breaks when EXPR_CLASS, so this code works without any issues EXPR_CLASS ==

Slide 56

Slide 56 text

✦ EXPR_DOT means immediate after DOT ✦ Dot includes ‘.’ ‘&.’ ‘::’ ✦ The lexer ignores line breaks when EXPR_DOT, so this code works without any issues EXPR_DOT ==

Slide 57

Slide 57 text

✦ Ruby's grammar has the concepts of EXPR_BEG and EXPR_END ✦ In the '1 + 2' example, the EXPR_BEG and EXPR_END states are repeated like this EXPR_BEG and EXPR_END 1 + 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END

Slide 58

Slide 58 text

✦ Lexer ignores line breaks when EXPR_BEG ✦ Therefore, this code is equivalent to “1 + 2” EXPR_BEG 1 + 2 Input 1 + 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END ‘\n’ 1 + 2 ==

Slide 59

Slide 59 text

✦ Lexer emits line break tokens when EXPR_END ✦ Therefore, in this code, it's treated as two lines of code: “1”, and “+2”, rather than "1 + 2” EXPR_END 1 + 2 Input 1 ‘\n’ 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END 1 + 2 != + EXPR_BEG

Slide 60

Slide 60 text

✦ Lex states can be thought of as automaton that take tokens as input ✦ While lexers are typically described as automaton that take characters as input, Ruby's lexer is an automaton in a dual sense Lexer is an automaton Automaton for characters ‘|’ ‘|’ ‘=’ |= || | ||= Otherwise ‘=’ Otherwise Automaton for tokens BEG END tINTEGER +, \n +@

Slide 61

Slide 61 text

✦ What is this exception case? How lexer works again What is this? Returns ‘\n’ token Understand!

Slide 62

Slide 62 text

✦ ‘\n’ is needed after mandatory keyword arguments for method de fi nition without parentheses ✦ Ref: [Bug #9669] Method de fi nition w/o parentheses ‘\n’ is needed !=

Slide 63

Slide 63 text

✦ Even though the lexer intelligently ignores line breaks, there are situations where that doesn't work ✦ Lex state is EXPR_END after “arg” then lexer emits line breaks ✦ For example, “1+ 2” is “arg” ✦ In such cases, need to write line break token in the grammar #2: Grammar allows Line Breaks ‘\n’ is needed in grammar rules EXPR_END EXPR_END

Slide 64

Slide 64 text

✦ In a 2006 mailing list discussion, they were talking about ignoring line breaks where expressions cannot end ✦ [ruby-dev: 29206] Another evidence https://public-inbox.org/ruby-dev/[email protected]/

Slide 65

Slide 65 text

✦ Personally, I feel it's too dif fi cult, why we can maintain this… ✦ Then I asked other committers what they think about lex state ✦ “This is impossible to understand, isn't it?” by usa ✦ “I'm keeping my distance because it's scary” by ko1 ✦ “It's hard to believe anyone could completely understand and work with all of lex state” by akr Monstrous Lex state

Slide 66

Slide 66 text

✦ lex_stateͬͯεΫϦϓτͷͲ͜ΛಡΜͰΔ͔ʹΑͬͯෳࡶʹঢ়ଶ ͕มΘͬͯී௨ͷਓؒʹ௥͍੾ΕΔ΋ͷ͡Όͳ͍ؾ͕͢ΔΜ͚ͩ Ͳɺ͜ΕΛࢻతͱ͍͏͔จֶతʹදݱ͢ΔͳΒԿͯݺ΂͹͍͍ʁ ✦ lex_state changes so intricately based on the script's reading position that it seems beyond human comprehension. How could we describe this poetically or in a literary sense? Q&A session with Gemini

Slide 67

Slide 67 text

✦ ʮίʔυʹӅ͞ΕͨʰؾʱͷྲྀΕʯ ✦ ཧ༝: ໨ʹ͸ݟ͑ͳ͍͚ΕͲɺίʔυશମͷҙຯͷྲྀΕ΍ߏ଄ʢؾ ຺ʣΛܗ࡞ΓɺղऍΛํ޲෇͚͍ͯΔɻ౦༸ࢥ૝తͳਆൿੑΛࠐ Ίͯɻ ✦ “The hidden fl ow of Qi within the code” ✦ Reason: An unseen force that forms the semantic fl ow and structure of the code, directing its interpretation, with a touch of Eastern mysticism Lex state is Qi

Slide 68

Slide 68 text

✦ ͲΕ΋ૉఢͰ͕͢ɺlex_stateΛׂͱ͍͍΋ͷͱͱΒ͑ͯΔදݱ͕ ଟ͍ͱࢥ͍·͢ɻͲͪΒ͔ͱ͍͏ͱअѱͳࠇຐज़ͷྨͳؾ͕͢Δ ͷͰɺͦ͏͍͏ํ޲Ͱߟ͑ͯ΋Β͑·͔͢ ✦ While these are nice, but they describe lex_state quite favorably. I perceive it as a form of malevolent black magic. So could you consider it from that perspective?

Slide 69

Slide 69 text

✦ ʮ։ൃऀͷཧੑΛḝΉɺ໊ঢ়͕͍ͨ͠ঢ়ଶભҠʯ ✦ χϡΞϯεɿ ͦͷෳࡶ͕͞ਓؒͷཧղΛ௒͑ɺਂ௥͍͢Δͱਫ਼ਆ Λফ໣ͤ͞ΔΑ͏ͳɺ৮Εͯ͸ͳΒͳ͍ڰؾɻʢΫτΡϧϑਆ࿩త ͳڹ͖ʣ ✦ “An indescribable state transition that erodes the developer's sanity” ✦ Nuance: A madness that transcends human comprehension, a forbidden insanity that drains the mind upon close inspection. (With a Cthulhu Mythos vibe.) An indescribable state transition

Slide 70

Slide 70 text

✦ ʮύʔαʔͷਂ෵ʹજΉɺࠞಱʢΧΦεʣ΁ͷ൶ʯ ✦ χϡΞϯεɿ ͦͷෳࡶͳ࢓૊Έ͕ɺݴޠղऍͷ࠷΋Ԟਂ͘ɺਓؒ ʹ͸ཧղෆೳͳྖҬʢਂ෵ɾࠞಱʣ΁ͱܨ͕͍ͬͯΔɺఈ஌Εͳ ͍ෆؾຯ͞ɻ ✦ “A doorway to chaos lurking in the parser's abyss” ✦ Nuance: The unfathomable eeriness of its intricate mechanism, which leads to the deepest, incomprehensible realms of language interpretation (abyss and chaos). A doorway to chaos

Slide 71

Slide 71 text

✦ The independence of the parser and lexer ✦ Even though we've seen so many examples of the parser and lexer cooperating What makes it so dif fi cult?

Slide 72

Slide 72 text

✦ It needs to set the state immediately before the binary plus operator to be EXPR_END ✦ Grammar rule for the binary plus operator expression is “arg : arg + arg” ✦ Therefore, lex state should be EXPR_END when it comes to the end of arg Where should be EXPR_END? 1 + 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END arg + arg EXPR_BEG EXPR_END EXPR_BEG EXPR_END

Slide 73

Slide 73 text

✦ That means that it's only necessary to transition to EXPR_END for all the tokens that come at the end of the 'arg' rule ✦ In relation to the First set, could we refer to them as the Last set? ✦ These tokens can be the last token of arg ✦ tINTEGER ✦ tSTRING_END ✦ “end” Last set of rules

Slide 74

Slide 74 text

✦ To determine the last set, expand the rule until its fi nal symbol is a token ✦ “arg” includes “primary” ✦ “primary” can be expanded to tINTEGER and so on ✦ “primary” ends with ‘]’, “end” and so on ✦ Need to set the state to EXPR_END immediately after all the tokens obtained in this way How to know Last set arg : arg ‘+’ arg | arg '-' arg | … | primary primary : literal | tLBRACK aref_args ‘]' | k_if … k_end | … simple_numeric : tINTEGER | tFLOAT | tRATIONAL | tIMAGINARY literal : numeric | symbol

Slide 75

Slide 75 text

✦ Now, let's focus on the tokens themselves ✦ For example, ‘^’ (caret) ✦ This is a binary operator used to calculate the exclusive OR for integers ✦ As it's a binary operator, similar to ‘+’, it should transition to EXPR_BEG immediately after ‘^’ The context of the token

Slide 76

Slide 76 text

✦ The caret is now used not only as a binary operator but also as a unary operator, speci fi cally as a pin operator in pattern matching ✦ For now, even in the unary operator case, simply transitioning to EXPR_BEG works well ✦ However, if we change the lex state transition for the caret in the future, we'll need to consider both cases ✦ Lex state management is also dif fi cult because a single token can be used in grammatically very different places ‘^’ is not always binary operator

Slide 77

Slide 77 text

✦ Another reason why lex state is so dif fi cult is that a state is used for various purposes ✦ For example, EXPR_BEG is not only used to determine whether to ignore or emit line break but also whether to treat two bars as a set or as separate bars Lex state overloading This is “||” (!EXPR_BEG) This is ‘|’ and ‘|’ (EXPR_BEG)

Slide 78

Slide 78 text

✦ Fixing [Bug #10653] caused [Bug #11456] and [Bug #11849] ✦ “All bug fi xes are incompatibilities” in RubyKaigi 2019 Fixing a bug caused other bugs

Slide 79

Slide 79 text

✦ Ruby's parser handles line breaks by having the lexer decide whether to ignore or emit it, depending on the context ✦ This context-dependent behavior is managed by something called lex state ✦ It’s clear that lex state is incredibly hard to manage, and it can be described as “a doorway to chaos” A doorway to chaos lurking in the parser's abyss

Slide 80

Slide 80 text

Day 0: Interview for nobu Nakada-san, what do you think about Lex state? Well, that's a necessary evil, isn't it?

Slide 81

Slide 81 text

Day 0: Interview for nobu Could you tell me how much you know about Lex state behavior? Hmm, I'm at about 50% understanding at the moment.

Slide 82

Slide 82 text

Day 0: Interview for nobu What's your usual method when you need to modify things related to Lex I check all the areas that might be affected, but the last part is just a feeling.

Slide 83

Slide 83 text

Overcome chaos to restore order

Slide 84

Slide 84 text

✦ Can subtraction be written in every place where addition can be written in Ruby? ✦ Yes, it's clear from the grammar ✦ Not only subtraction but also all binary operators can be checked by the grammar Grammar as Order arg : arg '+' arg | arg '-' arg | arg '*' arg | arg '/' arg | arg '^' arg | arg tCMP arg ... Grammar

Slide 85

Slide 85 text

✦ Is it possible to write the same element for both the default value of a formal argument (“expr1”) and the actual argument (“expr2”)? ✦ Yes, it's also clear from the grammar that both of them allow “arg_value” Grammar as Order f_args : f_arg ',' f_optarg(arg_value) ',' … call_args : args opt_block_arg args : arg_value Grammar

Slide 86

Slide 86 text

✦ What are the problems if lex state is chaotic? ✦ Today's theme is the fundamental principles of line breaks in Ruby's grammar ✦ However, some parts of Ruby's grammar are controlled by lex state. This means it's dif fi cult to fi nd fundamental principles from chaos Problems with being chaotic program : class_def class_def : "class" id body "end" … Grammar Automaton = Parser Lex state “A doorway to chaos” Order Chaos Chaos

Slide 87

Slide 87 text

✦ Defeat chaos by modeling lex state ✦ Modeling is simplifying an object by focusing on its important properties to make it easier to comprehensively understand the structure of the object What’s modeling and why

Slide 88

Slide 88 text

✦ Parser and lex states are automatons that take tokens as input ✦ Parser is an automaton that takes tokens such as “class”, id, and “end” as input ✦ Lex states can be thought of as automaton that take tokens as input Parser and Lex State Parser Lex state BEG END tINTEGER +, \n +@ A1 A2 A3 primary + A4 primary P1 P2 tINTEGER

Slide 89

Slide 89 text

✦ It's possible to combine two automata into a single automaton ✦ Therefore, it's possible to build a new automaton from these two ✦ For example, the state after reading '+' is the A3 state in the parser, and the lex state is EXPR BEG Combine automatons Parser BEG END tINTEGER +, \n +@ A1 A2 A3 primary + A4 primary BEG END BEG END Lex state

Slide 90

Slide 90 text

✦ Extend Lrama’s grammar for describing lexer state Describe lex state transition on grammar fi le Type of states Initial state Aliases Transitions

Slide 91

Slide 91 text

✦ To extract the lex state transitions from an existing implementation, simply focus on the tokens the lexer returns How to extract lex state transition EXPR_ARG if IS_AFTER_OPERATOR EXPR_BEG if ! IS_AFTER_OPERATOR

Slide 92

Slide 92 text

✦ As the lex state is updated within certain grammar rules, introduce new syntax to the grammar that allows specifying lex state ✦ “%ls” stands for Lexer State Transition in grammar Transitions

Slide 93

Slide 93 text

✦ Runs lrama command once lexer state is written ✦ Lex state transitions for each token is shown ✦ For example, “end” token has two transitions ✦ If “end” is method name, transits to EXPR_ENDFN ✦ Otherwise transits to EXPR_END Lex state for tokens EXPR_ENDFN EXPR_END

Slide 94

Slide 94 text

✦ Lex state transitions for each grammar rule is shown ✦ Method de fi nitions always result in the EXPR_END state ✦ Even so lex state after “end” token can be EXPR_ENDFN or EXPR_END Lex state for rules EXPR_END

Slide 95

Slide 95 text

✦ By combining lex state transitions for each token and each rule, the lex state transitions for each parser state becomes clear ✦ For example, the parser immediately after reading the binary operator ‘+’ is always in the EXPR_BEG state Lex state for parser state EXPR_END

Slide 96

Slide 96 text

✦ To test the hypothesis about line breaks, let's look for counterexamples ✦ Hypothesis: In principle, a statement terminates if a line break is present at a point where a statement can be completed Verify the hypothesis https://x.com/tanaka_akr/status/1870679443376947467

Slide 97

Slide 97 text

✦ Grammar allows ‘\n’ for reduce action but lexer ignores ‘\n’ ✦ The lexer state can be EXPR_BEG ✦ Then ‘\n’ is ignored ✦ The grammar state’s lookahead set includes ‘\n’ token #1. Unexpectedly ignores ‘\n’

Slide 98

Slide 98 text

✦ After the dots of an endless range, it becomes EXPR_BEG, and the following line break is ignored ✦ Therefore the code below is not an endless range and “b” but an normal range Endless range == != Based on the principles

Slide 99

Slide 99 text

✦ After the “*” of arguments, it becomes EXPR_BEG, and the following line break is ignored ✦ Therefore the code below is not anonymous arguments ✦ “**” and “&” are same Anonymous arguments == != Based on the principles

Slide 100

Slide 100 text

✦ Since both endless ranges and anonymous arguments are added later, the current behavior of ignoring line breaks is reasonable from compatibility perspective Intentional or not?

Slide 101

Slide 101 text

✦ Grammar allows ‘\n’ for shift action but lexer ignores ‘\n’ ✦ The lexer state can be EXPR_BEG ✦ Then ‘\n’ is ignored ✦ The grammar state shifts ‘\n’ token #2. Unexpectedly ignores ‘\n’

Slide 102

Slide 102 text

✦ After the ‘(’ of ‘not’, it becomes EXPR_BEG | EXPR_LABEL, and the following line break is ignored ✦ Because this ‘\n’ is optional, it doesn't cause any problems if the lexer ignores it Line Breaks before ‘)’ '\n'? ')' ==

Slide 103

Slide 103 text

✦ Grammar doesn’t allow ‘\n’ but lexer emits ‘\n’ ✦ The state can’t be EXPR_BEG ✦ Then ‘\n’ is emitted ✦ The state’s lookahead set doesn’t include ‘\n’ token nor shirt ‘\n’ token #3. Unexpectedly emits ‘\n’

Slide 104

Slide 104 text

✦ A syntax error occurs speci fi cally when a line break is placed between global variables within the alias de fi nition ✦ ‘\n’ is ignored when EXPR_FNAME but emitted when EXPR_END Line Breaks in “alias” Syntax Error unexpected '\n' Ignore ‘\n’ Emit ‘\n’

Slide 105

Slide 105 text

✦ When it's not a global variable, lex state is explicitly set to EXPR_FNAME | EXPR_FITEM then ‘\n’ is ignored ✦ I think this is not intentional Line Breaks in “alias”

Slide 106

Slide 106 text

✦ A syntax error occurs when a line break is placed after “BEGIN” or “END” ✦ Is this intentional? BEGIN and END Syntax Error unexpected '\n' Syntax Error unexpected '\n'

Slide 107

Slide 107 text

✦ Hypothesis: In principle, a statement terminates if a line break is present at a point where a statement can be completed ✦ Exception: A statement doesn’t terminate for endless range and anonymous arguments (#1) ✦ Hypothesis: Ignoring line breaks where expressions cannot end ✦ Exception: A line break is emitted for global variable alias, BEGIN and END (#3) Verify the hypothesis

Slide 108

Slide 108 text

✦ It's dif fi cult to understand the fundamental principles of grammar with the chaos of lex state ✦ The order of grammar is necessary to face chaos ✦ Using the fact that both the parser and lex state are automata that take tokens as input, and that we can create a new automaton by combining two automata ✦ To validate the principles regarding line breaks in Ruby grammar, search for exceptions, and fi nd some exceptions ✦ However, for the most part,the hypothesis seems to be correct Chaos and Order

Slide 109

Slide 109 text

Conclusion

Slide 110

Slide 110 text

✦ I don't want to use the new features of Lrama to manage lex state ✦ I want to remove lex state completely ✦ In principle, a statement terminates if a line break is present at a point where a statement can be completed ✦ Therefore, the parser simply needs to send instructions to the lexer like 'ignore line breaks' or ‘return line break as token' depending on the current parser state Next step arg + arg Ignore ‘\n’ Emit ‘\n’ Ignore ‘\n’ Lexer Emit ‘\n’ Parser

Slide 111

Slide 111 text

✦ Furthermore, it needs to be able to handle the exceptions we've identi fi ed this time by writing instructions on the grammar ✦ Why describe it in the grammar? Because it's the grammar that decides the language and the parser Next step arg : arg tDOT2 arg | arg tDOT2 %ignore-token('\n') arg + Ignore ‘\n’ Emit ‘\n’ Ignore ‘\n’ Lexer Parser

Slide 112

Slide 112 text

✦ Look at line breaks in Ruby's grammar, a character that's surprisingly interesting when you consider it ✦ To understand how line breaks are treated in current Ruby, it is necessary to understand the behavior of lex state, which is a doorway to chaos ✦ Bringing lex state into the ordered systems of grammar and automata makes it possible to understand lex state behavior Ruby's Line Breaks

Slide 113

Slide 113 text

✦ In principle, a statement terminates if a line break is present at a point where a statement can be completed ✦ Exception ✦ A statement doesn’t terminate for endless range and anonymous arguments ✦ A line break is emitted for global variable alias, BEGIN and END Principle and exceptions

Slide 114

Slide 114 text

No content

Slide 115

Slide 115 text

Day 2: 11:10 - 11:40

Slide 116

Slide 116 text

Day 2: 16:20 - 16:50

Slide 117

Slide 117 text

✦ akr, nurse and other committers ✦ ESM, Inc. Parser Club ✦ Dragon Book study group ✦ LR parser gangs ✦ Contributors and supporters for Lrama and parse.y ✦ sakahukamaki Acknowledgements

Slide 118

Slide 118 text

Thank you !!!