Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Liberating Ruby's Parser from Lexer Hacks

Avatar for ydah ydah
April 22, 2026

Liberating Ruby's Parser from Lexer Hacks

RubyKaigi 2026 「Liberating Ruby's Parser from Lexer Hacks」の発表スライド
https://rubykaigi.org/2026/presentations/ydah_.html #rubykaigi #rubykaigi_a

Avatar for ydah

ydah

April 22, 2026

More Decks by ydah

Other Decks in Programming

Transcript

  1. KAMEN RIDER ZEZTZ THEME V2.0 TECH TALK DATE: 2026.04.11 SPEAKER:

    ZEZTZ DESIGN TEAM RubyKaigi 2026 — Hakodate, Hokkaido Hakodate Citizen Hall 22 April 2026 Liberating Ruby's Parser from Lexer Hacks DATE: 2026.04.22 SPEAKER: @ydah
  2. lex_state The Mechanism Hiding the Answers 13 bit flags ~100

    calls to SET_LEX_STATE scattered across parse.y It has been resolving Ruby's grammatical ambiguities — while hiding their structure
  3. What I Found When I Opened the Box Dismantled lex_state

    using a technique called PSLR Ruby's grammatical ambiguities fall into six distinct layers Truly grammatical / LALR limitations / Semantic problems SET_LEX_STATE ~100 → 0
  4. In most languages, lexing is Context-Free The lexer splits input

    into tokens without caring about context.
  5. IN RUBY CODE But Ruby Is Different: * Multiplication a

    * b Splat foo * args Rest Parameter def f(*a) Token: ‘*’ Token: tSTAR Token: tSTAR
  6. IN RUBY CODE Same story: { Hash literal {a: 1}

    Block foo { |x| x } Command block foo(1) { x } Token: tLBRACE Token: '{' Token: tLBRACE_ARG
  7. IN RUBY CODE And: < < Left shift / append

    a < < b Heredoc start < < HEREDOC Token: tLSHFT Token: tSTRING_BEG
  8. lex_state: 13 bit flags enum lex_state_bits { EXPR_BEG_bit, / *

    ignore newline, +/- is a sign. * / EXPR_END_bit, / * newline signif i cant, +/- is an operator. * / EXPR_ENDARG_bit, / * ditto, and unbound braces. * / EXPR_ENDFN_bit, / * ditto, and unbound braces. * / EXPR_ARG_bit, / * newline signif i cant, +/- is an operator. * / EXPR_CMDARG_bit, / * newline signif i cant, +/- is an operator. * / EXPR_MID_bit, / * newline signif i cant, +/- is an operator. * / EXPR_FNAME_bit, / * ignore newline, no reserved words. * / EXPR_DOT_bit, / * right after `.', `&.' or ` : : ', no reserved words. * / EXPR_CLASS_bit, / * immediate after `class', no here document. * / EXPR_LABEL_bit, / * flag bit, label is allowed. * / EXPR_LABELED_bit, / * flag bit, just after a label. * / EXPR_FITEM_bit, / * symbol literal as FNAME. * / EXPR_MAX_STATE };
  9. Tells you what — Not why "We're in EXPR_BEG, so

    * is a splat." — OK. But why is this EXPR_BEG? Why is * a splat here and multiplication there? lex_state gives the verdict but doesn't distinguish the cause. lex_state
  10. 01 Tight Coupling The lexer depends on parser internals. Changes

    ripple unpredictably. 02 Poor Maintainability One fix can break something else. Regression-prone. 03 Hidden Semantics Language semantics are buried in lexer-state logic 3 technical problem
  11. lex_state Everything Crammed into One Box Genuinely grammatical ambiguity LALR

    compression artifacts Semantic issues lex_state doesn't distinguish between them. It treats all three the same way.
  12. The Information Gap Lexer → Parser: one-way flow Lexer doesn't

    know where the parser is in the grammar Traditional LR Parser lex_state was the workaround — manual flags as a bridge
  13. PSLR Flips the Relationship PSLR(1) = Pseudo-Scannerless Minimal LR(1) Shares

    the parser's LALR state with the lexer The lexer can ask before returning a token: "If I return this token, will the parser accept it?" One-way flow → two-way flow
  14. The Theoretical Foundation The Pseudo-Scanner: Only recognizes tokens the current

    parser state accepts. A Known Problem: When parser states are merged, pseudo-scanner behavior can break. PSLR(1) Coming Later: his becomes Layer 5 in this talk.
  15. PSLR: The Core Idea If the parser's LALR state already

    knows which tokens could come next, there's no need for the lexer to maintain its own state.
  16. A Byproduct The original goal: remove lex_state using PSLR(1). Removing

    lex_state revealed different kinds of ambiguity, layer by layer — and that gave me a map. The real value
  17. LALR Parser States = Item Sets A parser state =

    a set of positions inside grammar rules. Key point: it often already contains useful information about the next token. In LALR parser
  18. primary → keyword_super • '(' call_args rparen primary → keyword_super

    • command_call → keyword_super • command_args From this state, both '(' (leading to '*') and command_args (leading to tSTAR) could follow. Example: State after keyword_super
  19. Disambiguating { The parser's action selection after receiving a token

    — repurposed as a question before returning it. This is the core of PSLR. 3 candidates: tLBRACE (hash), '{' (block), tLBRACE_ARG (command block). Look up the action table for each. If exactly one accepted. → that's the answer.
  20. Auto-Generated: yy_state_accepts_token() int yy_state_accepts_token(int yystate, int yychar) { yysymbol_kind_t yytoken

    = YYTRANSLATE(yychar); int yyn = yypact[yystate]; if (yypact_value_is_default(yyn)) return 0; yyn += yytoken; if (yyn < 0 | | YYLAST < yyn | | yycheck[yyn] ! = yytoken) return 0; yyn = yytable[yyn]; if (yyn < = 0) return !yytable_value_is_error(yyn); return 1; }
  21. Layer 1 Example: + after def State S after shifting

    def: def +(other) @value + other.value end yy_state_accepts_token(S, '+') → 0 (not accepted as binary addition) Accepted only as a method name (fname category) Traditionally: SET_LEX_STATE(EXPR_FNAME) told the lexer "next is a method name.” But the action table already knew.
  22. Layer 1 Discovery These ambiguities were not ambiguous in the

    grammar at all. lex_state was unnecessarily mediating.
  23. But the Direct Check Isn't Always Enough parse.y has many

    empty rules: opt_block_param → ε, opt_rescue → ε, $@N → ε Default reduction kicks in → per-token entries absent → false negatives Layer 1
  24. int yy_state_eventually_accepts_token(int yystate, int yychar) { int visited[YYNSTATES] = {0};

    for ( ; ; ) { if (visited[yystate]) return 0; visited[yystate] = 1; if (yy_state_accepts_token(yystate, yychar)) return 1; int yyn = yydefact[yystate]; if (yyn = = 0 | | yyr2[yyn] ! = 0) return 0; / / non - empty → stop / / GOTO after ε - reduction . . . } } Layer 2: Empty Reduction Tracing Traces chains of empty reductions (RHS length = 0) No stack access needed Stops at non-empty reductions (→ Layer 3)
  25. Layer 2 Example: * after do State S0 after do.

    Layer 1: 0 (default reduction). Layer2: [1, 2, 3].each do * a, b = [4, 5, 6] end S0 → $@42 → ε → GOTO → S1 S1 → opt_block_param → ε → GOTO → S2 yy_state_accepts_token(S2, tSTAR) → 1 ✓ / '*' → 0 ✗
  26. int yy_state_deep_accepts_token(int yystate, int yychar, const short * stack_base, const

    short * stack_top) { / / . . . if (rhs_len > 0) { / / Non - empty: unwind the stack const short * target = stack_top - (stack_consumed + rhs_len); int uncovered_state = * target; / / GOTO from uncovered_state . . . } } Layer 3: Stack Traversal YYSETSTATE_CONTEXT macro delivers stack pointers to the lexer.
  27. Layer1-3 Implementation Problems Layer 1: lex_state's unnecessary mediation Layer 2:

    Hidden by LALR default reduction Layer 3: Beyond LALR(1) lookahead depth The grammar itself isn't ambiguous. Looking at the parser state correctly gives the answer.
  28. Layer4 The tLABEL Problem f(a: 1) → a: is tLABEL

    f(a :foo) → a is tIDENTIFIER, : is tSYMBEG Single-token checks (Layers 1–3) cannot resolve this — need input patterns.
  29. Pseudo-scan: Token Patterns → Scanner FSA Each regex → NFA

    → combined NFA → single DFA DFA state 0 --[a - z_]--> state 1 (accepting: tIDENTIFIER) state 1 --[ : ]--> state 2 (accepting: tLABEL) %token - pattern tIDENTIFIER /[a - z_][a - zA-Z0-9_] * / %token - pattern tLABEL /[a - z_][a - zA-Z0-9_]*:/ Output: yy_scanner_transition[fsa_state][256] in C.
  30. The scanner_accepts Table Parser state A (command arg position): Parser

    state B (end of expression): FSA accept 1 → tIDENTIFIER (Parser state × FSA accepting state) → which token to return. FSA accept 1 (tIDENTIFIER) → tIDENTIFIER FSA accept 2 (tLABEL) → tLABEL ✓ FSA accept 2 → nil ✗
  31. Runtime: yy_pseudo_scan() int yy_pseudo_scan( int parser_state, const char * input,

    int * match_length ) { / / . . . (Walks input character by character) / / At each accepting state, consults scanner_accepts. / / Longest - match semantics. }
  32. Layer 4 Example: name: in Method Arguments Pseudo-scan trace: n→a→m→e

    → FSA state 1 (tIDENTIFIER accepting). pbest=tIDENTIFIER : → FSA state 2 (tLABEL accepting). scanner_accepts → tLABEL ✓ (space) → no transition → scan ends foo(name: "Ruby", version: 3) Result: name: as tLABEL (match_length=5)
  33. Layer 4 Discovery The boundary between lexing and parsing is

    blurred. Grammatical decisions embedded inside token-level concerns.
  34. Layer5 LALR State Merging Corrupts scanner_accepts State A: only tDIVIDE.

    State B: only tREGEXP_BEG. Merging → both allowed → pseudo- scan can't disambiguate. Lrama addresses this with: Lexer Context classification, Inadequacy detection, State splitting.
  35. Lexer Context Classification Context Classifier: checks symbol left of •

    → stores bitmask Inadequacy Detection: Compares expected vs actual scanner profiles. State Splitting: Separate by context → restore correct profiles. %lexer - context BEG keyword_if keyword_unless tLPAREN . . . %lexer - context CMDARG tIDENTIFIER tFID tCONSTANT . . . %lexer - context END tINTEGER tFLOAT keyword_end ')' ']' . . .
  36. Layer 5 Example: * in BEG vs END if *

    a = = [1, 2] # BEG : tSTAR (splat) x = a * b # END : '*' (multiplication) Context Classifier: keyword_if → BEG. expr → END. Inadequacy: expected ≠ actual → State splitting Runtime: parser_pslr_context_is(p, YY_CTX_BEG) → tSTAR.
  37. Layer 5 Example: / — Regexp vs Division x =

    /pattern/ # BEG : tREGEXP_BEG y = a / b # END : '/' (division) After = → BEG. After a → END. Merged → both accepted. Inadequacy → state splitting → BEG → tREGEXP_BEG, END → '/'.
  38. Layer 5 Example: tLABEL Inadequacy foo a: 1 # id

    • args → tLABEL accepted x = a + b # id • → tLABEL not accepted Actual (merged): FSA accept 2 → tLABEL ← Path A leaked State splitting: state 42 {A} → tLABEL ✓ state 1889 {B} → tLABEL ✗
  39. Layer 6: last_token_type Not 13 flags. Four values. The true

    minimum extracted from lex_state. #def i ne LAST_TOKEN_OTHER 0 #def i ne LAST_TOKEN_LVAR 1 #def i ne LAST_TOKEN_VALUE 2 #def i ne LAST_TOKEN_METHOD 3 Unsolvable by all PSLR layers. References CRuby's local variable table (runtime info).
  40. Layer 6 Example: < < — Left Shift vs Heredoc

    Both tIDENTIFIER. LALR states identical. Layers 1–5 all fail. m = 1 puts m < < 0 # m < < 0 (left shift) puts foo < < 0 # foo( < < 0) (heredoc argument) if (LAST_TOKEN_IS_VALUE(p)) return tLSHFT; / / LVAR → left shift / / METHOD → heredoc
  41. Layer 6 Example: & — Bitwise AND vs Block Arg

    if (LAST_TOKEN_IS_VALUE(p)) { return warn_balanced('&', "&", "argument pref i x"); / / AND } m = proc {} bar m &n # m & n (bitwise AND) bar foo &n # foo(&n) (block argument)
  42. Layer 6 Example: ( — Space + Parenthesis m =

    1 foo m (1) # foo(m, (1)). m = local var foo bar (1) # foo(bar(1)). bar = method
  43. Layer 6 Discovery Semantic ambiguities invisible to the grammar. Local

    vars and method calls looking identical = Ruby's intentional design choice.
  44. ARCHITECTURE The Six-Layer Ambiguity Map 01 Direct action table def

    +, ][, if : 02 Empty reduction tracing do *, begin -, { & 03 Stack traversal yield *, return * * , rescue = > 04 Pseudo-scan name:, a :foo, {key: 05 Lexer Context + split if * / a *, tLABEL 06 last_token_type m < < /foo < < , m (/bar (
  45. Layer1-6 Three Categories of Ambiguity Layer1~3 Implementation Problems Layer4~5 Design

    Boundary / Algorithm Layer6 Language Design Grammar isn't ambiguous. Lexing/parsing boundary; LALR side effects. Semantic ambiguity grammar cannot capture.
  46. 01 Most of lex_state Was Unnecessary Layers 1–3 automatically resolvable.

    02 Guidelines for New Syntax Perspectives offered by 6 Layers Layer 6 → where language design decisions are needed. 03 parse.y as a Specification Remove the hidden second grammar → parse.y alone tells the full story. Contributors can read and understand Ruby's syntax. 3 implications
  47. Deleting lex_state was the means The real achievement: the structure

    of Ruby's grammatical ambiguity became visible.
  48. 01 Theory is a lens, not a destination Separating what

    PSLR can solve from what it can't reveals each ambiguity's true nature. Theory gives the structure of the questions. 02 Replace chaos with hierarchy In parse.y, we moved from ad-hoc hacks to a clear decision cascade. By testing the cheapest, most certain conditions first, we built a discipline that makes a massive system manageable. 03 Solving a grammar gives you the power to design it The ambiguity map = a tool for Ruby's future syntax design. Predict and control which layer a new feature affects. 3 Takeaways