Liberating Ruby's Parser from Lexer Hacks

KAMEN RIDER ZEZTZ THEME V2.0 TECH TALK DATE: 2026.04.11 SPEAKER:
ZEZTZ DESIGN TEAM RubyKaigi 2026 — Hakodate, Hokkaido Hakodate Citizen Hall 22 April 2026 Liberating Ruby's Parser from Lexer Hacks DATE: 2026.04.22 SPEAKER: @ydah

Ruby's grammar is ambiguous. …well, that's not quite right.

" Where is Ruby's grammar ambiguous — and why?

lex_state The Mechanism Hiding the Answers 13 bit flags ~100
calls to SET_LEX_STATE scattered across parse.y It has been resolving Ruby's grammatical ambiguities — while hiding their structure

What I Found When I Opened the Box Dismantled lex_state
using a technique called PSLR Ruby's grammatical ambiguities fall into six distinct layers Truly grammatical / LALR limitations / Semantic problems SET_LEX_STATE ~100 → 0

Yudai Takada @ydah https://ydah.net/

CRuby commiter Lrama commiter

SmartHR Product Engineer

Otsu City Traditional Performing Arts Hall Saturday, July 18, 2026

01 SECTION 01 The Opaque Box What lex_state has been
hiding

In most languages, lexing is Context-Free The lexer splits input
into tokens without caring about context.

IN RUBY CODE But Ruby Is Different: * Multiplication a
* b Splat foo * args Rest Parameter def f(*a) Token: ‘*’ Token: tSTAR Token: tSTAR

IN RUBY CODE Same story: { Hash literal {a: 1}
Block foo { |x| x } Command block foo(1) { x } Token: tLBRACE Token: '{' Token: tLBRACE_ARG

IN RUBY CODE And: < < Left shift / append
a < < b Heredoc start < < HEREDOC Token: tLSHFT Token: tSTRING_BEG

lex_state: 13 bit flags enum lex_state_bits { EXPR_BEG_bit, / *
ignore newline, +/- is a sign. * / EXPR_END_bit, / * newline signif i cant, +/- is an operator. * / EXPR_ENDARG_bit, / * ditto, and unbound braces. * / EXPR_ENDFN_bit, / * ditto, and unbound braces. * / EXPR_ARG_bit, / * newline signif i cant, +/- is an operator. * / EXPR_CMDARG_bit, / * newline signif i cant, +/- is an operator. * / EXPR_MID_bit, / * newline signif i cant, +/- is an operator. * / EXPR_FNAME_bit, / * ignore newline, no reserved words. * / EXPR_DOT_bit, / * right after `.', `&.' or ` : : ', no reserved words. * / EXPR_CLASS_bit, / * immediate after `class', no here document. * / EXPR_LABEL_bit, / * flag bit, label is allowed. * / EXPR_LABELED_bit, / * flag bit, just after a label. * / EXPR_FITEM_bit, / * symbol literal as FNAME. * / EXPR_MAX_STATE };

Tells you what — Not why "We're in EXPR_BEG, so
* is a splat." — OK. But why is this EXPR_BEG? Why is * a splat here and multiplication there? lex_state gives the verdict but doesn't distinguish the cause. lex_state

01 Tight Coupling The lexer depends on parser internals. Changes
ripple unpredictably. 02 Poor Maintainability One fix can break something else. Regression-prone. 03 Hidden Semantics Language semantics are buried in lexer-state logic 3 technical problem

" Among the ambiguities lex_state resolves, aren't some of them
not actually ambiguous in the grammar

lex_state Everything Crammed into One Box Genuinely grammatical ambiguity LALR
compression artifacts Semantic issues lex_state doesn't distinguish between them. It treats all three the same way.

Open the Box: See the true causes That was the
motivation behind PSLR.

02 SECTION 02 The Lens PSLR Theory and the 6
Layers of Ambiguity

The Information Gap Lexer → Parser: one-way flow Lexer doesn't
know where the parser is in the grammar Traditional LR Parser lex_state was the workaround — manual flags as a bridge

PSLR Flips the Relationship PSLR(1) = Pseudo-Scannerless Minimal LR(1) Shares
the parser's LALR state with the lexer The lexer can ask before returning a token: "If I return this token, will the parser accept it?" One-way flow → two-way flow

The Theoretical Foundation The Pseudo-Scanner: Only recognizes tokens the current
parser state accepts. A Known Problem: When parser states are merged, pseudo-scanner behavior can break. PSLR(1) Coming Later: his becomes Layer 5 in this talk.

PSLR: The Core Idea If the parser's LALR state already
knows which tokens could come next, there's no need for the lexer to maintain its own state.

A Byproduct The original goal: remove lex_state using PSLR(1). Removing
lex_state revealed different kinds of ambiguity, layer by layer — and that gave me a map. The real value

LALR Parser States = Item Sets A parser state =
a set of positions inside grammar rules. Key point: it often already contains useful information about the next token. In LALR parser

primary → keyword_super • '(' call_args rparen primary → keyword_super
• command_call → keyword_super • command_args From this state, both '(' (leading to '*') and command_args (leading to tSTAR) could follow. Example: State after keyword_super

Disambiguating { The parser's action selection after receiving a token
— repurposed as a question before returning it. This is the core of PSLR. 3 candidates: tLBRACE (hash), '{' (block), tLBRACE_ARG (command block). Look up the action table for each. If exactly one accepted. → that's the answer.

Auto-Generated: yy_state_accepts_token() int yy_state_accepts_token(int yystate, int yychar) { yysymbol_kind_t yytoken
= YYTRANSLATE(yychar); int yyn = yypact[yystate]; if (yypact_value_is_default(yyn)) return 0; yyn += yytoken; if (yyn < 0 | | YYLAST < yyn | | yycheck[yyn] ! = yytoken) return 0; yyn = yytable[yyn]; if (yyn < = 0) return !yytable_value_is_error(yyn); return 1; }

Layer 01 It Wasn't Ambiguous at All

Layer 1 Example: + after def State S after shifting
def: def +(other) @value + other.value end yy_state_accepts_token(S, '+') → 0 (not accepted as binary addition) Accepted only as a method name (fname category) Traditionally: SET_LEX_STATE(EXPR_FNAME) told the lexer "next is a method name.” But the action table already knew.

Layer 1 Discovery These ambiguities were not ambiguous in the
grammar at all. lex_state was unnecessarily mediating.

But the Direct Check Isn't Always Enough parse.y has many
empty rules: opt_block_param → ε, opt_rescue → ε, $@N → ε Default reduction kicks in → per-token entries absent → false negatives Layer 1

Layer 02 Hidden by LALR Default Reduction

int yy_state_eventually_accepts_token(int yystate, int yychar) { int visited[YYNSTATES] = {0};
for ( ; ; ) { if (visited[yystate]) return 0; visited[yystate] = 1; if (yy_state_accepts_token(yystate, yychar)) return 1; int yyn = yydefact[yystate]; if (yyn = = 0 | | yyr2[yyn] ! = 0) return 0; / / non - empty → stop / / GOTO after ε - reduction . . . } } Layer 2: Empty Reduction Tracing Traces chains of empty reductions (RHS length = 0) No stack access needed Stops at non-empty reductions (→ Layer 3)

Layer 2 Example: * after do State S0 after do.
Layer 1: 0 (default reduction). Layer2: [1, 2, 3].each do * a, b = [4, 5, 6] end S0 → $@42 → ε → GOTO → S1 S1 → opt_block_param → ε → GOTO → S2 yy_state_accepts_token(S2, tSTAR) → 1 ✓ / '*' → 0 ✗

Layer 2 Discovery Mid-rule actions and optional clauses create ε-chains.
Hidden by LALR's compression.

Layer 03 Beyond LALR(1) Lookahead Depth

int yy_state_deep_accepts_token(int yystate, int yychar, const short * stack_base, const
short * stack_top) { / / . . . if (rhs_len > 0) { / / Non - empty: unwind the stack const short * target = stack_top - (stack_consumed + rhs_len); int uncovered_state = * target; / / GOTO from uncovered_state . . . } } Layer 3: Stack Traversal YYSETSTATE_CONTEXT macro delivers stack pointers to the lexer.

Layer1-3 Implementation Problems Layer 1: lex_state's unnecessary mediation Layer 2:
Hidden by LALR default reduction Layer 3: Beyond LALR(1) lookahead depth The grammar itself isn't ambiguous. Looking at the parser state correctly gives the answer.

Layer 04 The Lexing/Parsing Boundary Is Blurred

Layer4 The tLABEL Problem f(a: 1) → a: is tLABEL
f(a :foo) → a is tIDENTIFIER, : is tSYMBEG Single-token checks (Layers 1–3) cannot resolve this — need input patterns.

Pseudo-scan: Token Patterns → Scanner FSA Each regex → NFA
→ combined NFA → single DFA DFA state 0 --[a - z_]--> state 1 (accepting: tIDENTIFIER) state 1 --[ : ]--> state 2 (accepting: tLABEL) %token - pattern tIDENTIFIER /[a - z_][a - zA-Z0-9_] * / %token - pattern tLABEL /[a - z_][a - zA-Z0-9_]*:/ Output: yy_scanner_transition[fsa_state][256] in C.

The scanner_accepts Table Parser state A (command arg position): Parser
state B (end of expression): FSA accept 1 → tIDENTIFIER (Parser state × FSA accepting state) → which token to return. FSA accept 1 (tIDENTIFIER) → tIDENTIFIER FSA accept 2 (tLABEL) → tLABEL ✓ FSA accept 2 → nil ✗

Runtime: yy_pseudo_scan() int yy_pseudo_scan( int parser_state, const char * input,
int * match_length ) { / / . . . (Walks input character by character) / / At each accepting state, consults scanner_accepts. / / Longest - match semantics. }

Layer 4 Example: name: in Method Arguments Pseudo-scan trace: n→a→m→e
→ FSA state 1 (tIDENTIFIER accepting). pbest=tIDENTIFIER : → FSA state 2 (tLABEL accepting). scanner_accepts → tLABEL ✓ (space) → no transition → scan ends foo(name: "Ruby", version: 3) Result: name: as tLABEL (match_length=5)

Layer 4 Discovery The boundary between lexing and parsing is
blurred. Grammatical decisions embedded inside token-level concerns.

Layer 05 An Algorithmic Side Effect

Layer5 LALR State Merging Corrupts scanner_accepts State A: only tDIVIDE.
State B: only tREGEXP_BEG. Merging → both allowed → pseudo- scan can't disambiguate. Lrama addresses this with: Lexer Context classification, Inadequacy detection, State splitting.

Lexer Context Classification Context Classifier: checks symbol left of •
→ stores bitmask Inadequacy Detection: Compares expected vs actual scanner profiles. State Splitting: Separate by context → restore correct profiles. %lexer - context BEG keyword_if keyword_unless tLPAREN . . . %lexer - context CMDARG tIDENTIFIER tFID tCONSTANT . . . %lexer - context END tINTEGER tFLOAT keyword_end ')' ']' . . .

Layer 5 Example: * in BEG vs END if *
a = = [1, 2] # BEG : tSTAR (splat) x = a * b # END : '*' (multiplication) Context Classifier: keyword_if → BEG. expr → END. Inadequacy: expected ≠ actual → State splitting Runtime: parser_pslr_context_is(p, YY_CTX_BEG) → tSTAR.

Layer 5 Example: / — Regexp vs Division x =
/pattern/ # BEG : tREGEXP_BEG y = a / b # END : '/' (division) After = → BEG. After a → END. Merged → both accepted. Inadequacy → state splitting → BEG → tREGEXP_BEG, END → '/'.

Layer 5 Example: tLABEL Inadequacy foo a: 1 # id
• args → tLABEL accepted x = a + b # id • → tLABEL not accepted Actual (merged): FSA accept 2 → tLABEL ← Path A leaked State splitting: state 42 {A} → tLABEL ✓ state 1889 {B} → tLABEL ✗

Layer 5 Discovery Ambiguity caused by LALR state merging —
an algorithmic side effect.

Layer 06 Semantic Ambiguity

Layer 6: last_token_type Not 13 flags. Four values. The true
minimum extracted from lex_state. #def i ne LAST_TOKEN_OTHER 0 #def i ne LAST_TOKEN_LVAR 1 #def i ne LAST_TOKEN_VALUE 2 #def i ne LAST_TOKEN_METHOD 3 Unsolvable by all PSLR layers. References CRuby's local variable table (runtime info).

Layer 6 Example: < < — Left Shift vs Heredoc
Both tIDENTIFIER. LALR states identical. Layers 1–5 all fail. m = 1 puts m < < 0 # m < < 0 (left shift) puts foo < < 0 # foo( < < 0) (heredoc argument) if (LAST_TOKEN_IS_VALUE(p)) return tLSHFT; / / LVAR → left shift / / METHOD → heredoc

Layer 6 Example: & — Bitwise AND vs Block Arg
if (LAST_TOKEN_IS_VALUE(p)) { return warn_balanced('&', "&", "argument pref i x"); / / AND } m = proc {} bar m &n # m & n (bitwise AND) bar foo &n # foo(&n) (block argument)

Layer 6 Example: ( — Space + Parenthesis m =
1 foo m (1) # foo(m, (1)). m = local var foo bar (1) # foo(bar(1)). bar = method

Layer 6 Discovery Semantic ambiguities invisible to the grammar. Local
vars and method calls looking identical = Ruby's intentional design choice.

03 SECTION 03 The Map The Ambiguity Map of Ruby's
Grammar

ARCHITECTURE The Six-Layer Ambiguity Map 01 Direct action table def
+, ][, if : 02 Empty reduction tracing do *, begin -, { & 03 Stack traversal yield *, return * * , rescue = > 04 Pseudo-scan name:, a :foo, {key: 05 Lexer Context + split if * / a *, tLABEL 06 last_token_type m < < /foo < < , m (/bar (

Layer1-6 Three Categories of Ambiguity Layer1~3 Implementation Problems Layer4~5 Design
Boundary / Algorithm Layer6 Language Design Grammar isn't ambiguous. Lexing/parsing boundary; LALR side effects. Semantic ambiguity grammar cannot capture.

" lex_state mixed all of these → PSLR separated them.

01 Most of lex_state Was Unnecessary Layers 1–3 automatically resolvable.
02 Guidelines for New Syntax Perspectives offered by 6 Layers Layer 6 → where language design decisions are needed. 03 parse.y as a Specification Remove the hidden second grammar → parse.y alone tells the full story. Contributors can read and understand Ruby's syntax. 3 implications

Deleting lex_state was the means The real achievement: the structure
of Ruby's grammatical ambiguity became visible.

01 Theory is a lens, not a destination Separating what
PSLR can solve from what it can't reveals each ambiguity's true nature. Theory gives the structure of the questions. 02 Replace chaos with hierarchy In parse.y, we moved from ad-hoc hacks to a clear decision cascade. By testing the cheapest, most certain conditions first, we built a discipline that makes a massive system manageable. 03 Solving a grammar gives you the power to design it The ambiguity map = a tool for Ruby's future syntax design. Predict and control which layer a new feature affects. 3 Takeaways

" Thank you!

Liberating Ruby's Parser from Lexer Hacks

Liberating Ruby's Parser from Lexer Hacks

More Decks by ydah

Other Decks in Programming

Featured

Transcript