Liberating Ruby's Parser from Lexer Hacks

Embed

Start on current slide

Slide 1

Slide 1 text

KAMEN RIDER ZEZTZ THEME V2.0 TECH TALK DATE: 2026.04.11 SPEAKER: ZEZTZ DESIGN TEAM RubyKaigi 2026 — Hakodate, Hokkaido Hakodate Citizen Hall 22 April 2026 Liberating Ruby's Parser from Lexer Hacks DATE: 2026.04.22 SPEAKER: @ydah

Slide 2

Slide 2 text

Ruby's grammar is ambiguous. …well, that's not quite right.

Slide 3

Slide 3 text

" Where is Ruby's grammar ambiguous — and why?

Slide 4

Slide 4 text

lex_state The Mechanism Hiding the Answers 13 bit flags ~100 calls to SET_LEX_STATE scattered across parse.y It has been resolving Ruby's grammatical ambiguities — while hiding their structure

Slide 5

Slide 5 text

What I Found When I Opened the Box Dismantled lex_state using a technique called PSLR Ruby's grammatical ambiguities fall into six distinct layers Truly grammatical / LALR limitations / Semantic problems SET_LEX_STATE ~100 → 0

Slide 6

Slide 6 text

Yudai Takada @ydah https://ydah.net/

Slide 7

Slide 7 text

CRuby commiter Lrama commiter

Slide 8

Slide 8 text

SmartHR Product Engineer

Slide 9

Slide 9 text

Otsu City Traditional Performing Arts Hall Saturday, July 18, 2026

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

01 SECTION 01 The Opaque Box What lex_state has been hiding

Slide 12

Slide 12 text

In most languages, lexing is Context-Free The lexer splits input into tokens without caring about context.

Slide 13

Slide 13 text

IN RUBY CODE But Ruby Is Different: * Multiplication a * b Splat foo * args Rest Parameter def f(*a) Token: ‘*’ Token: tSTAR Token: tSTAR

Slide 14

Slide 14 text

IN RUBY CODE Same story: { Hash literal {a: 1} Block foo { |x| x } Command block foo(1) { x } Token: tLBRACE Token: '{' Token: tLBRACE_ARG

Slide 15

Slide 15 text

IN RUBY CODE And: < < Left shift / append a < < b Heredoc start < < HEREDOC Token: tLSHFT Token: tSTRING_BEG

Slide 16

Slide 16 text

lex_state: 13 bit flags enum lex_state_bits { EXPR_BEG_bit, / * ignore newline, +/- is a sign. * / EXPR_END_bit, / * newline signif i cant, +/- is an operator. * / EXPR_ENDARG_bit, / * ditto, and unbound braces. * / EXPR_ENDFN_bit, / * ditto, and unbound braces. * / EXPR_ARG_bit, / * newline signif i cant, +/- is an operator. * / EXPR_CMDARG_bit, / * newline signif i cant, +/- is an operator. * / EXPR_MID_bit, / * newline signif i cant, +/- is an operator. * / EXPR_FNAME_bit, / * ignore newline, no reserved words. * / EXPR_DOT_bit, / * right after `.', `&.' or ` : : ', no reserved words. * / EXPR_CLASS_bit, / * immediate after `class', no here document. * / EXPR_LABEL_bit, / * flag bit, label is allowed. * / EXPR_LABELED_bit, / * flag bit, just after a label. * / EXPR_FITEM_bit, / * symbol literal as FNAME. * / EXPR_MAX_STATE };

Slide 17

Slide 17 text

Tells you what — Not why "We're in EXPR_BEG, so * is a splat." — OK. But why is this EXPR_BEG? Why is * a splat here and multiplication there? lex_state gives the verdict but doesn't distinguish the cause. lex_state

Slide 18

Slide 18 text

01 Tight Coupling The lexer depends on parser internals. Changes ripple unpredictably. 02 Poor Maintainability One fix can break something else. Regression-prone. 03 Hidden Semantics Language semantics are buried in lexer-state logic 3 technical problem

Slide 19

Slide 19 text

" Among the ambiguities lex_state resolves, aren't some of them not actually ambiguous in the grammar

Slide 20

Slide 20 text

lex_state Everything Crammed into One Box Genuinely grammatical ambiguity LALR compression artifacts Semantic issues lex_state doesn't distinguish between them. It treats all three the same way.

Slide 21

Slide 21 text

Open the Box: See the true causes That was the motivation behind PSLR.

Slide 22

Slide 22 text

02 SECTION 02 The Lens PSLR Theory and the 6 Layers of Ambiguity

Slide 23

Slide 23 text

The Information Gap Lexer → Parser: one-way flow Lexer doesn't know where the parser is in the grammar Traditional LR Parser lex_state was the workaround — manual flags as a bridge

Slide 24

Slide 24 text

PSLR Flips the Relationship PSLR(1) = Pseudo-Scannerless Minimal LR(1) Shares the parser's LALR state with the lexer The lexer can ask before returning a token: "If I return this token, will the parser accept it?" One-way flow → two-way flow

Slide 25

Slide 25 text

The Theoretical Foundation The Pseudo-Scanner: Only recognizes tokens the current parser state accepts. A Known Problem: When parser states are merged, pseudo-scanner behavior can break. PSLR(1) Coming Later: his becomes Layer 5 in this talk.

Slide 26

Slide 26 text

PSLR: The Core Idea If the parser's LALR state already knows which tokens could come next, there's no need for the lexer to maintain its own state.

Slide 27

Slide 27 text

A Byproduct The original goal: remove lex_state using PSLR(1). Removing lex_state revealed different kinds of ambiguity, layer by layer — and that gave me a map. The real value

Slide 28

Slide 28 text

LALR Parser States = Item Sets A parser state = a set of positions inside grammar rules. Key point: it often already contains useful information about the next token. In LALR parser

Slide 29

Slide 29 text

primary → keyword_super • '(' call_args rparen primary → keyword_super • command_call → keyword_super • command_args From this state, both '(' (leading to '*') and command_args (leading to tSTAR) could follow. Example: State after keyword_super

Slide 30

Slide 30 text

Disambiguating { The parser's action selection after receiving a token — repurposed as a question before returning it. This is the core of PSLR. 3 candidates: tLBRACE (hash), '{' (block), tLBRACE_ARG (command block). Look up the action table for each. If exactly one accepted. → that's the answer.

Slide 31

Slide 31 text

Auto-Generated: yy_state_accepts_token() int yy_state_accepts_token(int yystate, int yychar) { yysymbol_kind_t yytoken = YYTRANSLATE(yychar); int yyn = yypact[yystate]; if (yypact_value_is_default(yyn)) return 0; yyn += yytoken; if (yyn < 0 | | YYLAST < yyn | | yycheck[yyn] ! = yytoken) return 0; yyn = yytable[yyn]; if (yyn < = 0) return !yytable_value_is_error(yyn); return 1; }

Slide 32

Slide 32 text

Layer 01 It Wasn't Ambiguous at All

Slide 33

Slide 33 text

Layer 1 Example: + after def State S after shifting def: def +(other) @value + other.value end yy_state_accepts_token(S, '+') → 0 (not accepted as binary addition) Accepted only as a method name (fname category) Traditionally: SET_LEX_STATE(EXPR_FNAME) told the lexer "next is a method name.” But the action table already knew.

Slide 34

Slide 34 text

Layer 1 Discovery These ambiguities were not ambiguous in the grammar at all. lex_state was unnecessarily mediating.

Slide 35

Slide 35 text

But the Direct Check Isn't Always Enough parse.y has many empty rules: opt_block_param → ε, opt_rescue → ε, $@N → ε Default reduction kicks in → per-token entries absent → false negatives Layer 1

Slide 36

Slide 36 text

Layer 02 Hidden by LALR Default Reduction

Slide 37

Slide 37 text

int yy_state_eventually_accepts_token(int yystate, int yychar) { int visited[YYNSTATES] = {0}; for ( ; ; ) { if (visited[yystate]) return 0; visited[yystate] = 1; if (yy_state_accepts_token(yystate, yychar)) return 1; int yyn = yydefact[yystate]; if (yyn = = 0 | | yyr2[yyn] ! = 0) return 0; / / non - empty → stop / / GOTO after ε - reduction . . . } } Layer 2: Empty Reduction Tracing Traces chains of empty reductions (RHS length = 0) No stack access needed Stops at non-empty reductions (→ Layer 3)

Slide 38

Slide 38 text

Layer 2 Example: * after do State S0 after do. Layer 1: 0 (default reduction). Layer2: [1, 2, 3].each do * a, b = [4, 5, 6] end S0 → $@42 → ε → GOTO → S1 S1 → opt_block_param → ε → GOTO → S2 yy_state_accepts_token(S2, tSTAR) → 1 ✓ / '*' → 0 ✗

Slide 39

Slide 39 text

Layer 2 Discovery Mid-rule actions and optional clauses create ε-chains. Hidden by LALR's compression.

Slide 40

Slide 40 text

Layer 03 Beyond LALR(1) Lookahead Depth

Slide 41

Slide 41 text

int yy_state_deep_accepts_token(int yystate, int yychar, const short * stack_base, const short * stack_top) { / / . . . if (rhs_len > 0) { / / Non - empty: unwind the stack const short * target = stack_top - (stack_consumed + rhs_len); int uncovered_state = * target; / / GOTO from uncovered_state . . . } } Layer 3: Stack Traversal YYSETSTATE_CONTEXT macro delivers stack pointers to the lexer.

Slide 42

Slide 42 text

Layer1-3 Implementation Problems Layer 1: lex_state's unnecessary mediation Layer 2: Hidden by LALR default reduction Layer 3: Beyond LALR(1) lookahead depth The grammar itself isn't ambiguous. Looking at the parser state correctly gives the answer.

Slide 43

Slide 43 text

Layer 04 The Lexing/Parsing Boundary Is Blurred

Slide 44

Slide 44 text

Layer4 The tLABEL Problem f(a: 1) → a: is tLABEL f(a :foo) → a is tIDENTIFIER, : is tSYMBEG Single-token checks (Layers 1–3) cannot resolve this — need input patterns.

Slide 45

Slide 45 text

Pseudo-scan: Token Patterns → Scanner FSA Each regex → NFA → combined NFA → single DFA DFA state 0 --[a - z_]--> state 1 (accepting: tIDENTIFIER) state 1 --[ : ]--> state 2 (accepting: tLABEL) %token - pattern tIDENTIFIER /[a - z_][a - zA-Z0-9_] * / %token - pattern tLABEL /[a - z_][a - zA-Z0-9_]*:/ Output: yy_scanner_transition[fsa_state][256] in C.

Slide 46

Slide 46 text

The scanner_accepts Table Parser state A (command arg position): Parser state B (end of expression): FSA accept 1 → tIDENTIFIER (Parser state × FSA accepting state) → which token to return. FSA accept 1 (tIDENTIFIER) → tIDENTIFIER FSA accept 2 (tLABEL) → tLABEL ✓ FSA accept 2 → nil ✗

Slide 47

Slide 47 text

Runtime: yy_pseudo_scan() int yy_pseudo_scan( int parser_state, const char * input, int * match_length ) { / / . . . (Walks input character by character) / / At each accepting state, consults scanner_accepts. / / Longest - match semantics. }

Slide 48

Slide 48 text

Layer 4 Example: name: in Method Arguments Pseudo-scan trace: n→a→m→e → FSA state 1 (tIDENTIFIER accepting). pbest=tIDENTIFIER : → FSA state 2 (tLABEL accepting). scanner_accepts → tLABEL ✓ (space) → no transition → scan ends foo(name: "Ruby", version: 3) Result: name: as tLABEL (match_length=5)

Slide 49

Slide 49 text

Layer 4 Discovery The boundary between lexing and parsing is blurred. Grammatical decisions embedded inside token-level concerns.

Slide 50

Slide 50 text

Layer 05 An Algorithmic Side Effect

Slide 51

Slide 51 text

Layer5 LALR State Merging Corrupts scanner_accepts State A: only tDIVIDE. State B: only tREGEXP_BEG. Merging → both allowed → pseudo- scan can't disambiguate. Lrama addresses this with: Lexer Context classification, Inadequacy detection, State splitting.

Slide 52

Slide 52 text

Lexer Context Classification Context Classifier: checks symbol left of • → stores bitmask Inadequacy Detection: Compares expected vs actual scanner profiles. State Splitting: Separate by context → restore correct profiles. %lexer - context BEG keyword_if keyword_unless tLPAREN . . . %lexer - context CMDARG tIDENTIFIER tFID tCONSTANT . . . %lexer - context END tINTEGER tFLOAT keyword_end ')' ']' . . .

Slide 53

Slide 53 text

Layer 5 Example: * in BEG vs END if * a = = [1, 2] # BEG : tSTAR (splat) x = a * b # END : '*' (multiplication) Context Classifier: keyword_if → BEG. expr → END. Inadequacy: expected ≠ actual → State splitting Runtime: parser_pslr_context_is(p, YY_CTX_BEG) → tSTAR.

Slide 54

Slide 54 text

Layer 5 Example: / — Regexp vs Division x = /pattern/ # BEG : tREGEXP_BEG y = a / b # END : '/' (division) After = → BEG. After a → END. Merged → both accepted. Inadequacy → state splitting → BEG → tREGEXP_BEG, END → '/'.

Slide 55

Slide 55 text

Layer 5 Example: tLABEL Inadequacy foo a: 1 # id • args → tLABEL accepted x = a + b # id • → tLABEL not accepted Actual (merged): FSA accept 2 → tLABEL ← Path A leaked State splitting: state 42 {A} → tLABEL ✓ state 1889 {B} → tLABEL ✗

Slide 56

Slide 56 text

Layer 5 Discovery Ambiguity caused by LALR state merging — an algorithmic side effect.

Slide 57

Slide 57 text

Layer 06 Semantic Ambiguity

Slide 58

Slide 58 text

Layer 6: last_token_type Not 13 flags. Four values. The true minimum extracted from lex_state. #def i ne LAST_TOKEN_OTHER 0 #def i ne LAST_TOKEN_LVAR 1 #def i ne LAST_TOKEN_VALUE 2 #def i ne LAST_TOKEN_METHOD 3 Unsolvable by all PSLR layers. References CRuby's local variable table (runtime info).

Slide 59

Slide 59 text

Layer 6 Example: < < — Left Shift vs Heredoc Both tIDENTIFIER. LALR states identical. Layers 1–5 all fail. m = 1 puts m < < 0 # m < < 0 (left shift) puts foo < < 0 # foo( < < 0) (heredoc argument) if (LAST_TOKEN_IS_VALUE(p)) return tLSHFT; / / LVAR → left shift / / METHOD → heredoc

Slide 60

Slide 60 text

Layer 6 Example: & — Bitwise AND vs Block Arg if (LAST_TOKEN_IS_VALUE(p)) { return warn_balanced('&', "&", "argument pref i x"); / / AND } m = proc {} bar m &n # m & n (bitwise AND) bar foo &n # foo(&n) (block argument)

Slide 61

Slide 61 text

Layer 6 Example: ( — Space + Parenthesis m = 1 foo m (1) # foo(m, (1)). m = local var foo bar (1) # foo(bar(1)). bar = method

Slide 62

Slide 62 text

Layer 6 Discovery Semantic ambiguities invisible to the grammar. Local vars and method calls looking identical = Ruby's intentional design choice.

Slide 63

Slide 63 text

03 SECTION 03 The Map The Ambiguity Map of Ruby's Grammar

Slide 64

Slide 64 text

ARCHITECTURE The Six-Layer Ambiguity Map 01 Direct action table def +, ][, if : 02 Empty reduction tracing do *, begin -, { & 03 Stack traversal yield *, return * * , rescue = > 04 Pseudo-scan name:, a :foo, {key: 05 Lexer Context + split if * / a *, tLABEL 06 last_token_type m < < /foo < < , m (/bar (

Slide 65

Slide 65 text

Layer1-6 Three Categories of Ambiguity Layer1~3 Implementation Problems Layer4~5 Design Boundary / Algorithm Layer6 Language Design Grammar isn't ambiguous. Lexing/parsing boundary; LALR side effects. Semantic ambiguity grammar cannot capture.

Slide 66

Slide 66 text

" lex_state mixed all of these → PSLR separated them.

Slide 67

Slide 67 text

01 Most of lex_state Was Unnecessary Layers 1–3 automatically resolvable. 02 Guidelines for New Syntax Perspectives offered by 6 Layers Layer 6 → where language design decisions are needed. 03 parse.y as a Specification Remove the hidden second grammar → parse.y alone tells the full story. Contributors can read and understand Ruby's syntax. 3 implications

Slide 68

Slide 68 text

Deleting lex_state was the means The real achievement: the structure of Ruby's grammatical ambiguity became visible.

Slide 69

Slide 69 text

01 Theory is a lens, not a destination Separating what PSLR can solve from what it can't reveals each ambiguity's true nature. Theory gives the structure of the questions. 02 Replace chaos with hierarchy In parse.y, we moved from ad-hoc hacks to a clear decision cascade. By testing the cheapest, most certain conditions first, we built a discipline that makes a massive system manageable. 03 Solving a grammar gives you the power to design it The ambiguity map = a tool for Ruby's future syntax design. Predict and control which layer a new feature affects. 3 Takeaways

Slide 70

Slide 70 text

" Thank you!