Ruby's Line Breaks

Ruby's Line Breaks April 16, 2025 in RubyKaigi 2025 @yui-knk
Yuichiro Kaneko

✦ Yuichiro Kaneko ✦ yui-knk (GitHub) / spikeolaf (Twitter) ✦
Treasure Data ✦ Engineering Manager of Applications Backend About me

PR: We are Gold sponsor!

Day 3: Drink up 🍻

TD and Ruby committers twitter: @nalsh GitHub: @nurse twitter: @k_tsj
GitHub: @k-tsj twitter: @ spikeolaf GitHub: @yui-knk twitter: @mineroaoki GitHub: @aamine twitter: @nahi GitHub: @nahi Applications Backend

Attendees from TD @spikeolaf @nalsh @k_tsj @makimoto @ citystar (GH)
@ybiquitous @yokoto @tomog105 Anne Ju

✦ Yuichiro Kaneko ✦ yui-knk (GitHub) / spikeolaf (Twitter) ✦
CRuby committer, mainly develop parser generator and parser ✦ Lrama LALR (1) parser generator (2023, Ruby 3.3) ✦ The Bison Slayer ✦ The parser monster ✦ The dawn bringer of the parser world ✦ Ripper Rearchitecture (2024, Ruby 3.4) ✦ Code positions to RNode (2018, Ruby 2.6) ✦ RubyVM::AbstractSyntaxTree (2018, Ruby 2.6) About me

Introduction

✦ This code consists two method dispatches ✦ Calls `#p`
method with 1 ✦ Calls `#p` method with 2 ✦ In Ruby, you can use line breaks to separate meaningful chunks of code Line Breaks in Ruby Grammar

✦ This code consists one method dispatch ✦ `Integer#+` is
called for `1` with `2` Line Breaks in Ruby Grammar

✦ There are signi fi cant difference between these two
codes ✦ #1: Due to the line break, the fi rst and second lines are interpreted as distinct method calls ✦ #2: The two lines are combined into a single line of code ✦ In other words, the second code behaves as if the line break is ignored Line Breaks are the question

✦ Hypothesis: In principle, a statement terminates if a line
break is present at a point where a statement can be completed Principles of Line Breaks in Ruby https://x.com/tanaka_akr/status/1870679443376947467

✦ The fi rst line of this code becomes a
complete statement as a method call without arguments ✦ Therefore this code is two method calls, not a single method call with an argument ✦ The principle holds true #1: method call w/o args !=

✦ The fi rst line of this code isn’t a
complete statement because syntax error is raised ✦ Therefore this code is one method call with two arguments ✦ The principle holds true #2: method call w/ args == Syntax Error

✦ “1 +” isn’t a complete statement ✦ Therefore this
code is interpreted as “1 + 2” ✦ The principle holds true #3: binary operator Syntax Error ==

✦ “-” isn’t a complete statement ✦ Therefore this code
is interpreted as ‘-“str”’ ✦ The principle holds true #4: unary operator Syntax Error ==

✦ Because the ternary operator's scope includes the expression following
the colon, it will not be able to be separated in the middle of that expression ✦ The principle holds true #5: ternary operator == Syntax Error

✦ So far, the hypothesis seems to be correct ✦
Hypothesis: In principle, a statement terminates if a line break is present at a point where a statement can be complete ✦ However, are there truly no counterexamples in every part of Ruby's grammar …? No counterexamples exist …?

✦ The relationships of Language, Grammar, Automaton and Parser ✦
Investigate how line breaks are treated ✦ Verify the hypothesis ✦ Understand the principles of line breaks in Ruby grammar Today’s topics

Language, Grammar, Automaton and Parser

✦ A (formal) language is a subset of words ✦
Some words belong to Ruby language ✦ Others don’t What is Language?

✦ Even so these codes are transcendental and imbroglio codes
(TRICK), they belong to Ruby language. Ruby https://github.com/tric/trick2022/blob/master/01-tompng/entry.rb https://github.com/tric/trick2022/blob/master/06-mame/entry.rb

✦ At a glance, this code seems Ruby code, however
it doesn’t belong to Ruby language Not Ruby

✦ (Ruby) language is a in fi nite set of
words ✦ Grammar is a fi nite set of rules which de fi ne language What is Grammar? Grammar Language …

✦ Grammar provides structure to the language Grammar and structure
+ 1 2 3 * * + 1 2 3 Correct Wrong

✦ Finite automaton is 5-tuple ✦ is a fi nite
set of states ✦ is a fi nite set of input symbols ✦ is a transition function, from state to state by input symbol ✦ is an initial state ✦ is a set of accepting states (Q, Σ, δ, q0 , F) Q Σ δ q0 F What is Automaton?

✦ Let's consider a vending machine that sells juice at
¥110 ✦ Put in a 100-yen coin and 10-yen coin ✦ Press the “purchase” button Vending machine

✦ In the case of a vending machine, ✦ There
are 3 inputs, 100-yen coin, 10-yen coin and pressing the “purchase” button ✦ Initial state is that no coins are inserted ✦ Accepting state is that 100-yen coin and 10-yen coins are inserted then the “purchase” button is pressed Vending machine as automaton q0 ¥100 q1 q2 ¥10 q3 q4 ¥100 ¥10 “purchase"

✦ Only limited inputs are accepted in each state ✦
For example, on ✦ ¥10 coin is accepted ✦ ¥100 coin is not accepted ✦ “purchase” button is not accepted q1 Vending machine as automaton q0 ¥100 q1 q2 ¥10 q3 q4 ¥100 ¥10 “purchase"

✦ There are known methods to convert a Non-deterministic Finite
Automaton (NFA) to a Deterministic Finite Automaton (DFA) ✦ It is known that the minimum DFA is unique ✦ It's possible to combine two automata into a single automaton Theory of Automaton

✦ If you attended Fujinami-san's talk this morning, you likely
heard precisely about research on automaton Theory of Automaton https://speakerdeck.com/makenowjust/make-parsers-compatible-using-automata-learning

✦ @.bookstore at RubyKaigi 2025 ✦ “ܭࢉཧ࿦ͷجૅɹ[ݪஶୈ3൛] 1.ΦʔτϚτϯͱݴޠ” ✦ “നͱࠇͷͱͼΒ:
ΦʔτϚτϯͱܗࣜݴޠΛΊ͙Δ๯ݥ” ✦ “ਖ਼نදݱٕज़ೖ໳――࠷৽Τϯδϯ࣮૷ͱཧ࿦తഎܠ” Books for Automaton

✦ Is the vending machine related to a parser? ✦
No, parser is not a vending machine, but parser is an automaton Parser as fi nite-state automaton

✦ Let’s think about a grammar which only includes single
class de fi nition ✦ “class”, identi fi er (id) then “end” ✦ This can be represented as an automaton with four states, taking “class” id and “end” as inputs Parser for very simple grammar class A end program : "class" id "end" P1 P2 P3 P4 class id end Grammar Language Automaton

✦ For example, on P1 ✦ “class” is accepted then
goes to state P2 ✦ id is not accepted then syntax error ✦ “end” is not accepted then syntax error How the parser works program : "class" id "end" P1 P2 P3 P4 class id end Grammar Automaton

✦ For example, on P2 ✦ “class” is not accepted
then syntax error ✦ id is accepted then goes to state P3 ✦ “end” is not accepted then syntax error How the parser works program : "class" id "end" P1 P2 P3 P4 class id end Grammar Automaton

✦ For example, on P4 ✦ “class” is not accepted
then syntax error ✦ id is not accepted then syntax error ✦ “end” is not accepted then syntax error ✦ End of Input is accepted then the parsing process completed without errors How the parser works program : "class" id "end" P1 P2 P3 P4 class id end Grammar Automaton

✦ Let’s allow arbitrary levels of nesting for class de
fi nitions Parser for complex grammar class A class B … end end program : class_def class_def : "class" id body "end" body : class_def | /* empty */ Grammar Language Automaton B1 B2 C1 C2 C3 C5 P1 P2 class_def class_def C4 class id body end B1 /* empty */

✦ When parsing nested class de fi nitions, a new
automaton corresponding to the class de fi nition is created ✦ Parsed until “class A”, so that the second automaton is on C3 Automaton with a stack class A class B end end Input C1 C2 C3 C5 P1 P2 class_def C4 class id body end

✦ To parse “class B end” as the body of
class A, create new automatons Automaton with a stack class A class B end end Input B1 B2 C1 C2 C3 C5 P1 P2 class_def class_def C4 class id body end C1 C2 C3 C5 C4 class id body end

✦ After reading to “class B end”, the bottom automaton
enters the accepting state Automaton with a stack class A class B end end Input B1 B2 C1 C2 C3 C5 P1 P2 class_def class_def C4 class id body end C1 C2 C3 C5 C4 class id body end

✦ When the automaton reaches the accepting state, it is
popped from the stack, and the original automaton transitions to the next state (C4) Automaton with a stack class A class B end end Input C1 C2 C3 C5 P1 P2 class_def C4 class id body end

✦ Parser is an automaton that takes tokens such as
“class”, id, and “end” as input ✦ A fi nite automaton with a stack is called a pushdown automaton ✦ By using a stack, pushdown automaton can handle languages with in fi nite nesting ✦ Implementation of LR parser is pushdown automaton LR Parser as pushdown automaton

✦ LR parsers have two main operations ✦ Shift: Moves
the automaton to the next state ✦ Reduce: Pops the automaton that has reached the accepting state from the stack LR parser actions C1 C2 C3 C5 C4 C1 C2 C3 C5 C4 Shift Reduce

✦ How does LR parser choose the correct automaton when
multiple automatons are applicable? ✦ “body : class_def” is correct for left,“body : /* empty */” is correct for right Chose correct automaton class A end B1 B2 C1 C2 C3 C5 class_def C4 class id body end C1 C2 C3 C5 C4 class id body end class A class B end end B1 /* empty */

✦ Determine the correct rule by looking at the next
token ✦ In the right case, next token is “end” ✦ In the case of the empty string rule, “end” can be shifted after the automaton is popped ✦ The set of tokens that can follow a certain rule is called a lookahead set Lookahead set class A end C1 C2 C3 C5 C4 class id body end B1 /* empty */ Next token is “end” Match with new token Input

✦ Chomsky hierarchy ✦ Four formal grammar classes consist hierarchy
✦ There are correspondences between grammars and automatons Grammar class and automaton Regular Context-free Context-sensitive Recursively enumerable Linear-bounded non-deterministic Turing machine Non-deterministic pushdown automaton Finite-state automaton Turing machine

✦ Grammar determines the scope of the language ✦ Grammar
can be converted to an automaton ✦ Parser is an automaton that takes tokens as input Language, Grammar, Automaton and Parser class A … end program : class_def class_def : "class" id body "end" … Grammar Language Automaton = Parser C1 C2 C3 C5 C4 class id body end Determine Convert

How Line Breaks are treated?

✦ A lexer is the component that divides input string
into meaningful chunks, which are called tokens Parser and Lexer C1 C2 C3 C5 P1 P2 class_def C4 class id body end class A class B end end Input class A class B end end Tokens Parser Lexer

✦ It's important to understand that the lexer intelligently handles
line breaks, sometimes ignoring them and sometimes not #1: Lexer ignores Line Breaks method_1 arg 1 + 2 Input method_1 ‘\n’ arg 1 + 2 Tokens Not ignored Ignored

✦ On the other hand, regarding the grammar, statements are
separated by line breaks (‘\n’) by a rule Grammar for statements

✦ Lexer returns or ignores ‘\n’ based on lex state
How lexer works Returns ‘\n’ token Ignores ‘\n’ Returns ‘\n’ token Checks lex state

✦ 13 lex state fl ags exits! ✦ EXPR_BEG: BEGinning
of expression ✦ EXPR_END: END of expression ✦ EXPR_ENDARG: END of ARGument ✦ EXPR_ENDFN: END of Function NAME ✦ EXPR_ARG: ARGument ✦ EXPR_CMDARG: CoMmanD ARGument ✦ EXPR_MID: MIDdle of expression ✦ EXPR_FNAME: immediate after “def” keyword, might be Function NAME ✦ EXPR_DOT: immediate after DOT (dot includes ‘.’ ‘&.’ ‘::’) ✦ EXPR_CLASS: immediate after “class” keyword ✦ EXPR_LABEL: label is possible, label is `a:` ✦ EXPR_LABELED: immediate after label ✦ EXPR_FITEM: just before fi tem. fi tem is token after undef or alias ✦ Only written the typical meanings, there are also exceptions They are lex state fl ags !

✦ EXPR_CLASS means immediate after “class” keyword ✦ The lexer
ignores line breaks when EXPR_CLASS, so this code works without any issues EXPR_CLASS ==

✦ EXPR_DOT means immediate after DOT ✦ Dot includes ‘.’
‘&.’ ‘::’ ✦ The lexer ignores line breaks when EXPR_DOT, so this code works without any issues EXPR_DOT ==

✦ Ruby's grammar has the concepts of EXPR_BEG and EXPR_END
✦ In the '1 + 2' example, the EXPR_BEG and EXPR_END states are repeated like this EXPR_BEG and EXPR_END 1 + 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END

✦ Lexer ignores line breaks when EXPR_BEG ✦ Therefore, this
code is equivalent to “1 + 2” EXPR_BEG 1 + 2 Input 1 + 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END ‘\n’ 1 + 2 ==

✦ Lexer emits line break tokens when EXPR_END ✦ Therefore,
in this code, it's treated as two lines of code: “1”, and “+2”, rather than "1 + 2” EXPR_END 1 + 2 Input 1 ‘\n’ 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END 1 + 2 != + EXPR_BEG

✦ Lex states can be thought of as automaton that
take tokens as input ✦ While lexers are typically described as automaton that take characters as input, Ruby's lexer is an automaton in a dual sense Lexer is an automaton Automaton for characters ‘|’ ‘|’ ‘=’ |= || | ||= Otherwise ‘=’ Otherwise Automaton for tokens BEG END tINTEGER +, \n +@

✦ What is this exception case? How lexer works again
What is this? Returns ‘\n’ token Understand!

✦ ‘\n’ is needed after mandatory keyword arguments for method
de fi nition without parentheses ✦ Ref: [Bug #9669] Method de fi nition w/o parentheses ‘\n’ is needed !=

✦ Even though the lexer intelligently ignores line breaks, there
are situations where that doesn't work ✦ Lex state is EXPR_END after “arg” then lexer emits line breaks ✦ For example, “1+ 2” is “arg” ✦ In such cases, need to write line break token in the grammar #2: Grammar allows Line Breaks ‘\n’ is needed in grammar rules EXPR_END EXPR_END

✦ In a 2006 mailing list discussion, they were talking
about ignoring line breaks where expressions cannot end ✦ [ruby-dev: 29206] Another evidence https://public-inbox.org/ruby-dev/[email protected]/

✦ Personally, I feel it's too dif fi cult, why
we can maintain this… ✦ Then I asked other committers what they think about lex state ✦ “This is impossible to understand, isn't it?” by usa ✦ “I'm keeping my distance because it's scary” by ko1 ✦ “It's hard to believe anyone could completely understand and work with all of lex state” by akr Monstrous Lex state

✦ lex_stateͬͯεΫϦϓτͷͲ͜ΛಡΜͰΔ͔ʹΑͬͯෳࡶʹঢ়ଶ ͕มΘͬͯී௨ͷਓؒʹ௥͍੾ΕΔ΋ͷ͡Όͳ͍ؾ͕͢ΔΜ͚ͩ Ͳɺ͜ΕΛࢻతͱ͍͏͔จֶతʹදݱ͢ΔͳΒԿͯݺ΂͹͍͍ʁ ✦ lex_state changes so intricately based
on the script's reading position that it seems beyond human comprehension. How could we describe this poetically or in a literary sense? Q&A session with Gemini

✦ ʮίʔυʹӅ͞ΕͨʰؾʱͷྲྀΕʯ ✦ ཧ༝: ໨ʹ͸ݟ͑ͳ͍͚ΕͲɺίʔυશମͷҙຯͷྲྀΕ΍ߏ଄ʢؾ ຺ʣΛܗ࡞ΓɺղऍΛํ޲෇͚͍ͯΔɻ౦༸ࢥ૝తͳਆൿੑΛࠐ Ίͯɻ ✦ “The hidden
fl ow of Qi within the code” ✦ Reason: An unseen force that forms the semantic fl ow and structure of the code, directing its interpretation, with a touch of Eastern mysticism Lex state is Qi

✦ ͲΕ΋ૉఢͰ͕͢ɺlex_stateΛׂͱ͍͍΋ͷͱͱΒ͑ͯΔදݱ͕ ଟ͍ͱࢥ͍·͢ɻͲͪΒ͔ͱ͍͏ͱअѱͳࠇຐज़ͷྨͳؾ͕͢Δ ͷͰɺͦ͏͍͏ํ޲Ͱߟ͑ͯ΋Β͑·͔͢ ✦ While these are nice, but
they describe lex_state quite favorably. I perceive it as a form of malevolent black magic. So could you consider it from that perspective?

✦ ʮ։ൃऀͷཧੑΛḝΉɺ໊ঢ়͕͍ͨ͠ঢ়ଶભҠʯ ✦ χϡΞϯεɿ ͦͷෳࡶ͕͞ਓؒͷཧղΛ௒͑ɺਂ௥͍͢Δͱਫ਼ਆ Λফ໣ͤ͞ΔΑ͏ͳɺ৮Εͯ͸ͳΒͳ͍ڰؾɻʢΫτΡϧϑਆ࿩త ͳڹ͖ʣ ✦ “An indescribable
state transition that erodes the developer's sanity” ✦ Nuance: A madness that transcends human comprehension, a forbidden insanity that drains the mind upon close inspection. (With a Cthulhu Mythos vibe.) An indescribable state transition

✦ ʮύʔαʔͷਂ෵ʹજΉɺࠞಱʢΧΦεʣ΁ͷ൶ʯ ✦ χϡΞϯεɿ ͦͷෳࡶͳ࢓૊Έ͕ɺݴޠղऍͷ࠷΋Ԟਂ͘ɺਓؒ ʹ͸ཧղෆೳͳྖҬʢਂ෵ɾࠞಱʣ΁ͱܨ͕͍ͬͯΔɺఈ஌Εͳ ͍ෆؾຯ͞ɻ ✦ “A doorway
to chaos lurking in the parser's abyss” ✦ Nuance: The unfathomable eeriness of its intricate mechanism, which leads to the deepest, incomprehensible realms of language interpretation (abyss and chaos). A doorway to chaos

✦ The independence of the parser and lexer ✦ Even
though we've seen so many examples of the parser and lexer cooperating What makes it so dif fi cult?

✦ It needs to set the state immediately before the
binary plus operator to be EXPR_END ✦ Grammar rule for the binary plus operator expression is “arg : arg + arg” ✦ Therefore, lex state should be EXPR_END when it comes to the end of arg Where should be EXPR_END? 1 + 2 EXPR_BEG EXPR_END EXPR_BEG EXPR_END arg + arg EXPR_BEG EXPR_END EXPR_BEG EXPR_END

✦ That means that it's only necessary to transition to
EXPR_END for all the tokens that come at the end of the 'arg' rule ✦ In relation to the First set, could we refer to them as the Last set? ✦ These tokens can be the last token of arg ✦ tINTEGER ✦ tSTRING_END ✦ “end” Last set of rules

✦ To determine the last set, expand the rule until
its fi nal symbol is a token ✦ “arg” includes “primary” ✦ “primary” can be expanded to tINTEGER and so on ✦ “primary” ends with ‘]’, “end” and so on ✦ Need to set the state to EXPR_END immediately after all the tokens obtained in this way How to know Last set arg : arg ‘+’ arg | arg '-' arg | … | primary primary : literal | tLBRACK aref_args ‘]' | k_if … k_end | … simple_numeric : tINTEGER | tFLOAT | tRATIONAL | tIMAGINARY literal : numeric | symbol

✦ Now, let's focus on the tokens themselves ✦ For
example, ‘^’ (caret) ✦ This is a binary operator used to calculate the exclusive OR for integers ✦ As it's a binary operator, similar to ‘+’, it should transition to EXPR_BEG immediately after ‘^’ The context of the token

✦ The caret is now used not only as a
binary operator but also as a unary operator, speci fi cally as a pin operator in pattern matching ✦ For now, even in the unary operator case, simply transitioning to EXPR_BEG works well ✦ However, if we change the lex state transition for the caret in the future, we'll need to consider both cases ✦ Lex state management is also dif fi cult because a single token can be used in grammatically very different places ‘^’ is not always binary operator

✦ Another reason why lex state is so dif fi
cult is that a state is used for various purposes ✦ For example, EXPR_BEG is not only used to determine whether to ignore or emit line break but also whether to treat two bars as a set or as separate bars Lex state overloading This is “||” (!EXPR_BEG) This is ‘|’ and ‘|’ (EXPR_BEG)

✦ Fixing [Bug #10653] caused [Bug #11456] and [Bug #11849]
✦ “All bug fi xes are incompatibilities” in RubyKaigi 2019 Fixing a bug caused other bugs

✦ Ruby's parser handles line breaks by having the lexer
decide whether to ignore or emit it, depending on the context ✦ This context-dependent behavior is managed by something called lex state ✦ It’s clear that lex state is incredibly hard to manage, and it can be described as “a doorway to chaos” A doorway to chaos lurking in the parser's abyss

Day 0: Interview for nobu Nakada-san, what do you think
about Lex state? Well, that's a necessary evil, isn't it?

Day 0: Interview for nobu Could you tell me how
much you know about Lex state behavior? Hmm, I'm at about 50% understanding at the moment.

Day 0: Interview for nobu What's your usual method when
you need to modify things related to Lex I check all the areas that might be affected, but the last part is just a feeling.

Overcome chaos to restore order

✦ Can subtraction be written in every place where addition
can be written in Ruby? ✦ Yes, it's clear from the grammar ✦ Not only subtraction but also all binary operators can be checked by the grammar Grammar as Order arg : arg '+' arg | arg '-' arg | arg '*' arg | arg '/' arg | arg '^' arg | arg tCMP arg ... Grammar

✦ Is it possible to write the same element for
both the default value of a formal argument (“expr1”) and the actual argument (“expr2”)? ✦ Yes, it's also clear from the grammar that both of them allow “arg_value” Grammar as Order f_args : f_arg ',' f_optarg(arg_value) ',' … call_args : args opt_block_arg args : arg_value Grammar

✦ What are the problems if lex state is chaotic?
✦ Today's theme is the fundamental principles of line breaks in Ruby's grammar ✦ However, some parts of Ruby's grammar are controlled by lex state. This means it's dif fi cult to fi nd fundamental principles from chaos Problems with being chaotic program : class_def class_def : "class" id body "end" … Grammar Automaton = Parser Lex state “A doorway to chaos” Order Chaos Chaos

✦ Defeat chaos by modeling lex state ✦ Modeling is
simplifying an object by focusing on its important properties to make it easier to comprehensively understand the structure of the object What’s modeling and why

✦ Parser and lex states are automatons that take tokens
as input ✦ Parser is an automaton that takes tokens such as “class”, id, and “end” as input ✦ Lex states can be thought of as automaton that take tokens as input Parser and Lex State Parser Lex state BEG END tINTEGER +, \n +@ A1 A2 A3 primary + A4 primary P1 P2 tINTEGER

✦ It's possible to combine two automata into a single
automaton ✦ Therefore, it's possible to build a new automaton from these two ✦ For example, the state after reading '+' is the A3 state in the parser, and the lex state is EXPR BEG Combine automatons Parser BEG END tINTEGER +, \n +@ A1 A2 A3 primary + A4 primary BEG END BEG END Lex state

✦ Extend Lrama’s grammar for describing lexer state Describe lex
state transition on grammar fi le Type of states Initial state Aliases Transitions

✦ To extract the lex state transitions from an existing
implementation, simply focus on the tokens the lexer returns How to extract lex state transition EXPR_ARG if IS_AFTER_OPERATOR EXPR_BEG if ! IS_AFTER_OPERATOR

✦ As the lex state is updated within certain grammar
rules, introduce new syntax to the grammar that allows specifying lex state ✦ “%ls” stands for Lexer State Transition in grammar Transitions

✦ Runs lrama command once lexer state is written ✦
Lex state transitions for each token is shown ✦ For example, “end” token has two transitions ✦ If “end” is method name, transits to EXPR_ENDFN ✦ Otherwise transits to EXPR_END Lex state for tokens EXPR_ENDFN EXPR_END

✦ Lex state transitions for each grammar rule is shown
✦ Method de fi nitions always result in the EXPR_END state ✦ Even so lex state after “end” token can be EXPR_ENDFN or EXPR_END Lex state for rules EXPR_END

✦ By combining lex state transitions for each token and
each rule, the lex state transitions for each parser state becomes clear ✦ For example, the parser immediately after reading the binary operator ‘+’ is always in the EXPR_BEG state Lex state for parser state EXPR_END

✦ To test the hypothesis about line breaks, let's look
for counterexamples ✦ Hypothesis: In principle, a statement terminates if a line break is present at a point where a statement can be completed Verify the hypothesis https://x.com/tanaka_akr/status/1870679443376947467

✦ Grammar allows ‘\n’ for reduce action but lexer ignores
‘\n’ ✦ The lexer state can be EXPR_BEG ✦ Then ‘\n’ is ignored ✦ The grammar state’s lookahead set includes ‘\n’ token #1. Unexpectedly ignores ‘\n’

✦ After the dots of an endless range, it becomes
EXPR_BEG, and the following line break is ignored ✦ Therefore the code below is not an endless range and “b” but an normal range Endless range == != Based on the principles

✦ After the “*” of arguments, it becomes EXPR_BEG, and
the following line break is ignored ✦ Therefore the code below is not anonymous arguments ✦ “**” and “&” are same Anonymous arguments == != Based on the principles

✦ Since both endless ranges and anonymous arguments are added
later, the current behavior of ignoring line breaks is reasonable from compatibility perspective Intentional or not?

✦ Grammar allows ‘\n’ for shift action but lexer ignores
‘\n’ ✦ The lexer state can be EXPR_BEG ✦ Then ‘\n’ is ignored ✦ The grammar state shifts ‘\n’ token #2. Unexpectedly ignores ‘\n’

✦ After the ‘(’ of ‘not’, it becomes EXPR_BEG |
EXPR_LABEL, and the following line break is ignored ✦ Because this ‘\n’ is optional, it doesn't cause any problems if the lexer ignores it Line Breaks before ‘)’ '\n'? ')' ==

✦ Grammar doesn’t allow ‘\n’ but lexer emits ‘\n’ ✦
The state can’t be EXPR_BEG ✦ Then ‘\n’ is emitted ✦ The state’s lookahead set doesn’t include ‘\n’ token nor shirt ‘\n’ token #3. Unexpectedly emits ‘\n’

✦ A syntax error occurs speci fi cally when a
line break is placed between global variables within the alias de fi nition ✦ ‘\n’ is ignored when EXPR_FNAME but emitted when EXPR_END Line Breaks in “alias” Syntax Error unexpected '\n' Ignore ‘\n’ Emit ‘\n’

✦ When it's not a global variable, lex state is
explicitly set to EXPR_FNAME | EXPR_FITEM then ‘\n’ is ignored ✦ I think this is not intentional Line Breaks in “alias”

✦ A syntax error occurs when a line break is
placed after “BEGIN” or “END” ✦ Is this intentional? BEGIN and END Syntax Error unexpected '\n' Syntax Error unexpected '\n'

✦ Hypothesis: In principle, a statement terminates if a line
break is present at a point where a statement can be completed ✦ Exception: A statement doesn’t terminate for endless range and anonymous arguments (#1) ✦ Hypothesis: Ignoring line breaks where expressions cannot end ✦ Exception: A line break is emitted for global variable alias, BEGIN and END (#3) Verify the hypothesis

✦ It's dif fi cult to understand the fundamental principles
of grammar with the chaos of lex state ✦ The order of grammar is necessary to face chaos ✦ Using the fact that both the parser and lex state are automata that take tokens as input, and that we can create a new automaton by combining two automata ✦ To validate the principles regarding line breaks in Ruby grammar, search for exceptions, and fi nd some exceptions ✦ However, for the most part,the hypothesis seems to be correct Chaos and Order

Conclusion

✦ I don't want to use the new features of
Lrama to manage lex state ✦ I want to remove lex state completely ✦ In principle, a statement terminates if a line break is present at a point where a statement can be completed ✦ Therefore, the parser simply needs to send instructions to the lexer like 'ignore line breaks' or ‘return line break as token' depending on the current parser state Next step arg + arg Ignore ‘\n’ Emit ‘\n’ Ignore ‘\n’ Lexer Emit ‘\n’ Parser

✦ Furthermore, it needs to be able to handle the
exceptions we've identi fi ed this time by writing instructions on the grammar ✦ Why describe it in the grammar? Because it's the grammar that decides the language and the parser Next step arg : arg tDOT2 arg | arg tDOT2 %ignore-token('\n') arg + Ignore ‘\n’ Emit ‘\n’ Ignore ‘\n’ Lexer Parser

✦ Look at line breaks in Ruby's grammar, a character
that's surprisingly interesting when you consider it ✦ To understand how line breaks are treated in current Ruby, it is necessary to understand the behavior of lex state, which is a doorway to chaos ✦ Bringing lex state into the ordered systems of grammar and automata makes it possible to understand lex state behavior Ruby's Line Breaks

✦ In principle, a statement terminates if a line break
is present at a point where a statement can be completed ✦ Exception ✦ A statement doesn’t terminate for endless range and anonymous arguments ✦ A line break is emitted for global variable alias, BEGIN and END Principle and exceptions

Day 2: 11:10 - 11:40

Day 2: 16:20 - 16:50

✦ akr, nurse and other committers ✦ ESM, Inc. Parser
Club ✦ Dragon Book study group ✦ LR parser gangs ✦ Contributors and supporters for Lrama and parse.y ✦ sakahukamaki Acknowledgements

Thank you !!!

Ruby's Line Breaks

Ruby's Line Breaks

More Decks by yui-knk

Other Decks in Programming

Featured

Transcript