Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

What is Parser

yui-knk
September 07, 2024

What is Parser

yui-knk

September 07, 2024
Tweet

More Decks by yui-knk

Other Decks in Programming

Transcript

  1. About Me The world is now in the great age

    of parsers. People are setting sail into the vast sea of parsers. - RubyKaigi 2023 LT- Yuichiro Kaneko https://twitter.com/kakutani/status/1657762294431105025/
  2. Self Introduction • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf

    (Twitter) • Treasure Data • Engineering Manager of Applications Backend
  3. In OSS world • Yuichiro Kaneko • yui-knk (GitHub) /

    spikeolaf (Twitter) • CRuby committer, mainly develop parser generator and parser • Lrama LALR (1) parser generator (2023, Ruby 3.3) • The Bison Slayer • Ripper Rearchitecture (2024, Ruby 3.4) • Code positions to RNode (2018, Ruby 2.6) • RubyVM::AbstractSyntaxTree (2018, Ruby 2.6)
  4. What is Parser Parser generator consists of three parts, Frontend,

    Backend and Code Generator. Each component is independent from others so that we need to touch only necessary components when new feature is enhanced. - BuriKaigi 2024 in Toyama -
  5. What parser does • Parser gives the structure to input

    string • Really ? Class Method Method Assignment @name Call name capitalize
  6. What parser does • Parser gives the structure to input

    string bytes 636c61737320477265657465720a2020646566 20696e697469616c697a65286e616d65290a20 202020406e616d65203d206e616d652e636170 6974616c697a650a2020656e640a0a2020646 5662073616c7574650a2020202070757473202 248656c6c6f20237b406e616d657d21220a202 0656e640a656e640a
  7. What lexer does • Cut bytes into chunks (tokens) 636c61737320477265657465720a2020646566

    20696e697469616c697a65286e616d65290a20 202020406e616d65203d206e616d652e636170 6974616c697a650a2020656e640a0a2020646 5662073616c7574650a2020202070757473202 248656c6c6f20237b406e616d657d21220a202 0656e640a656e640a class Greeter def
  8. Parser & Lexer • Lexer generates tokens from bytes •

    Parser gives structure to tokens Class Method Method Assignment @name Call name capitalize class Greeter def Lexer Parser …
  9. Theory of formal language Defeat calamities by more powerful theory,

    abstraction and Refactoring - RubyKaigi 2024 in Okinawa -
  10. Language • A (formal) language is a subset of words

    • Some words belong to Ruby language • Others don’t
  11. Ruby • Even so these codes are transcendental and imbroglio

    codes, they belong to Ruby language. https://github.com/tric/trick2022/blob/master/01-tompng/entry.rb https://github.com/tric/trick2022/blob/master/06-mame/entry.rb
  12. Not Ruby • At a glance, this code seems Ruby

    code, however it doesn’t belong to Ruby language.
  13. Grammar • (Ruby) language is a infinite set of words

    • Grammar is a finite set of rules which define language Grammar Language …
  14. Grammar class and automaton • Chomsky hierarchy • Four formal

    grammar classes consist hierarchy • There are correspondences between grammars and automatons Regular Context-free Context-sensitive Recursively enumerable Linear-bounded non-deterministic Turing machine Non-deterministic pushdown automaton Finite-state automaton Turing machine
  15. Use appropriate grammar class • With great power comes great

    difficulties • Context-sensitive grammar is more difficult to read and design production rules than context-free grammar S → abc | aSBc cB → Bc bB → bb Production rules for {anbncn : n ≥ 1} S → aSBc → aabcBc → aabBcc → aabbcc S → aSBc → aaSBcBc → aaabcBcBc → aaabcBBcc → aaabBcBcc → aaabBBccc → aaabbBccc → aaabbbccc Generate “aabbcc” Generate “aaabbbccc” https://ja.wikipedia.org/wiki/%E6%96%87%E8%84%88%E4%BE%9D%E5%AD%98%E6%96%87%E6%B3%95 Multiple terminals and nonterminals appear
  16. Context-free grammar (CFG) • Context-free grammar is readable • Then

    you can read it and try it CFG Single nonterminal appears
  17. if + class • The code raise NoMethodError however it’s

    syntactically valid $ ruby -c test.rb Syntax OK $ ruby test.rb test.rb:3:in '<main>': unde fi ned method '+' for nil (NoMethodError) end + class C ^
  18. Context-free grammar (CFG) • Context-free grammar is widely used in

    programing languages • To be accurate, deterministic context-free language (DCFL) • DCFL is a subset of CFG • LR parser analyses DCFL in linear time
  19. In Chomsky hierarchy Context-free Context-sensitive Recursively enumerable Linear-bounded non-deterministic Turing

    machine Non-deterministic pushdown automaton Turing machine DCFL Regular Finite-state automaton Deterministic pushdown automaton
  20. Why LR parser? • LR parser • Can handle large

    range of languages • Major parser algorithm • To be precise, LR-attributed grammar • I believe grammar easy for human is close to LR grammar • LL parser • Has has less power than LR parser • PEG • It’s difficult to create Error Tolerant parser • A rule failure doesn’t imply a parsing failure like in context free grammars
  21. How to create parser? • Use parser generator • Lrama

    (CRuby) • Bison (Perl, PHP, PostgreSQL) • ANTLR (Hive, Trino) • Hand written parser • Go, Rust, C# • Prism
  22. Why LR parser generator is the best? • LR parser

    generator gives accurate feedback for grammar • BNF is very declarative • No gap between grammar and parser implementation • LR parser is based on theory of computer science
  23. RubyKaigi 2024 • Check slides and video for more detail

    • https://rubykaigi.org/2024/presentations/spikeolaf.html
  24. Actually context-free grammar? • Sometimes it’s discussed that Ruby grammar

    is CFG or not • This is a trick used in TRICK 2022 • This is NOT CFG because existence of the variable affects the following codes https://www.slideshare.net/mametter/trick-2022-results
  25. However • Current LR parser can parse such codes •

    Ruby committers have hacked parser but NOT hacked LR parser algorithm • There must be some tricks somewhere
  26. LR-attributed grammar (LR ଐੑจ๏) • The key concept is LR-attributed

    grammar • LR parser can handle LR-attributed grammar
  27. Attribute Grammar (ଐੑจ๏) • Attribute grammars were invented by Donald

    Knuth and Peter Wegner • Original paper is Knuth, Donald E. (1968) "Semantics of context-free languages" • “An attribute grammar is a formal way to supplement a formal grammar with semantic information processing.” • https://en.wikipedia.org/wiki/Attribute_grammar
  28. Static semantic analysis • Use cases • Check variable declarations

    and usages • Type checking • Check control flow function f1() { var i = 1; i + j; } function f1() { var i = 1; var j = 2; i + j; } Error: Not declared variable “j” is used
  29. Check variable declarations and usages • This language has a

    semantic: “variable should be declared before used” • Represent the semantic formally in a grammar
  30. decl: 'var' ident '=' integer ';' {{ decl.var_list[ident.value] = integer.value

    }} expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list raise "#{ident_1.value} is not declared" unless vars[ident_1.value] raise "#{ident_2.value} is not declared" unless vars[ident_2.value] expr.value = vars[ident_1.value] + vars[ident_2.value] }} decls: decls decl {{ decls.var_list = decls.var_list.merge(decl.var_list) }} decls: decl {{ decls.var_list = decl.var_list }} func_body: decls expr {{ expr.var_list = decls.var_list }} Add an identi fi er to a list Check identi fi ers are declared Use identi fi er’s value Merge identi fi er lists to one Pass identi fi er list to expr so that we can access identi fi er list in expr * Only important production rules and semantic rules Copy identi fi er list
  31. Syntax Tree • Create syntax tree from input string decls

    decls func_body expr + decl j = 2 decl i = 1 function f1() { var i = 1; var j = 2; i + j; } ident i ident j
  32. Analyze dependency of the variable list • In “expr”, “ident_1”

    and “ident_2” need variable list of “expr” decls decls func_body expr + decl j = 2 decl i = 1 expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list … expr.value = vars[ident_1.value] + vars[ident_2.value] }} ident i ident j
  33. Analyze dependency of the variable list • In “func_body”, “expr”

    need variable list of “decls” decls decls func_body expr + decl j = 2 decl i = 1 func_body: decls expr {{ expr.var_list = decls.var_list }} ident i ident j
  34. Analyze dependency of the variable list • In “decls”, “decls”

    need variable list of “decls” and “decl” decls decls func_body expr + decl j = 2 decl i = 1 decls: decls decl {{ decls.var_list = decls.var_list.merge( decl.var_list ) }} ident i ident j
  35. Analyze dependency of the variable list • In “decls”, “decls”

    need variable list of “decls” and “decl” decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j decls: decls decl {{ decls.var_list = decls.var_list.merge( decl.var_list ) }}
  36. Create attribute evaluator • Inverse dependency direction to get attribute

    evaluator decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; }
  37. How attribute evaluator works • Visit “i = 1” then

    update the list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1}
  38. How attribute evaluator works • Visit “j = 2” then

    update the list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2}
  39. How attribute evaluator works • Visit “i + j” with

    the variable list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2} (3) list = {i: 1, j: 2}
  40. How attribute evaluator works • Resolve “i” and “j” with

    the variable list decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j function f1() { var i = 1; var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1, j: 2} (3) list = {i: 1, j: 2} (4) list = {i: 1, j: 2} (5) list = {i: 1, j: 2}
  41. Semantically invalid code • Failed to resolve “j” because it’s

    not declared decls func_body expr + decl i = 1 ident i ident j function f1() { var i = 1; // var j = 2; i + j; } (1) list = {i: 1} (2) list = {i: 1} (3) list = {i: 1} (4) Error!!!
  42. Automatic generation • Studied attribute evaluator auto-generation from semantic rules

    Grammar fi le with attributes Parser generator Attribute evaluator generator Parser Attribute evaluator Program
  43. Inherited & Synthesized • Attribute is divided into two groups

    • Inherited Attribute (ܧঝଐੑ): Attribute calculated based on a parent and siblings • Synthesized Attribute (߹੒ଐੑ): Attribute calculated based on children decl: 'var' ident '=' integer ';' {{ decl.var_list[ident.value] = integer.value }} decls: decls decl {{ decls.var_list = decls.var_list.merge(decl.var_list) }} expr: ident_1 '+' ident_2 ';' {{ vars = expr.var_list … expr.value = vars[ident_1.value] + vars[ident_2.value] }} var_list is synthesized attribute var_list is synthesized attribute var_list is inherited attribute
  44. Inherited & Synthesized • In decls, var list is Synthesized

    Attribute • In expr, var list is Inherited Attribute • Inherited Attribute allows to pass from parent to children decls decls func_body expr + decl j = 2 decl i = 1 ident i ident j (ˑ) list = {i: 1} (ˑ) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2} (˒) list = {i: 1, j: 2}
  45. Attribute grammar can be complex • Dependency is a graph

    not tree • It may be circular • It may require exponential time for calculation • Subset of attribute grammar • L-attributed grammar • LR-attributed grammar • S-attributed grammar
  46. How LR parser works • Mental model of LR parser

    is that some automatons are managed by a stack • Generate automatons from each rule program : class_def class_def : "class" id body "end" body : method_def method_def : "def" id "end" M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 P1 P2 method_def def end id class_def C4 class id body end
  47. How LR parser works • At the beginning, one automaton

    exists on the stack class A def m end end P1 P2 class_def
  48. How LR parser works • Parser read “class” then new

    automaton is pushed onto the stack class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end
  49. How LR parser works • Parser read “A” then current

    automaton state is updated class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end
  50. How LR parser works • Parser read “def” then new

    automatons are pushed onto the stack class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end
  51. How LR parser works • Parser read “m” and “end”

    then current automaton reaches to the accepting state class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end
  52. How LR parser works • Pop the current automaton then

    move next automaton state to “B2” • Next automaton also reaches to the accepting state class A def m end end P1 P2 class_def B1 B2 C1 C2 C3 C5 method_def C4 class id body end
  53. How LR parser works • Pop the current automaton then

    move next automaton state to “C4” class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end
  54. How LR parser works • Parser read “end” then current

    automaton reaches to the accepting state class A def m end end P1 P2 class_def C1 C2 C3 C5 C4 class id body end
  55. How LR parser works • Pop the current automaton then

    move next automaton state to “P2” • Reaches to the accepting state and no input lefts then program is accepted class A def m end end P1 P2 class_def
  56. How LR parser works (2) • Program has one method

    definition (defn) or one singleton method definition (defs) program: defn | defe defn: "def" id "end" defs: "def" "self" "." id “end" M1 M2 M3 M4 S1 S2 S3 S5 P1 P2 def end id defn / defs S4 def S6 self . id end
  57. How LR parser works (2) • At the beginning, one

    automaton exists on the stack P1 P2 defn / defs def m end
  58. How LR parser works (2) • Parser read “def” then

    … • Which automatons the parser should put to the stack? M1 M2 M3 M4 S1 S2 S3 S5 P1 P2 def end id defn / defs S4 def S6 self . id end def m end Option 1 Option 2
  59. How LR parser works (2) • Merge these two automatons

    to one automaton M1 M2 M3 M4 S1 S2 S3 S5 def end id S4 def S6 self . id end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end
  60. How LR parser works (2) • Parser read “def” then

    push new merged automaton on the stack P1 P2 defn / defs def m end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end
  61. How LR parser works (2) • LR Parser can postpone

    the decision of automaton def m end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def self.m end
  62. LR-attributed grammar • LR-attributed grammar is an attribute grammar which

    LR parser can evaluate when the parser parse codes • Condition #1: All attribute dependencies are left-to-right direction • Condition #2: All inherited attributes in the same state has unique values
  63. #1: left-to-right direction • “in_class” & “in_def” inherited attributes can

    be handled by LR parser • Class can not be defined in def scope • Variable list also can be handled class A def m end end P1 P2 class_def M1 M2 M3 M4 B1 B2 C1 C2 C3 C5 method_def def end id C4 class id body end in_class = true in_def = true
  64. #2: Unique values for the same state • “in_def” inherited

    attribute can be decided just after “def” D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def m end def self.m end in_def = true
  65. #2: Unique values for the same state • “in_singleton_def” inherited

    attribute can’t be decided just after “def” • The attribute doesn’t exist in Ruby • The attribute can be decided just after “self” D1 D2 D5 D7 D6 D8 def D3 D4 self . id end id end def m end def self.m end in_singleton_def = false in_singleton_def = true in_singleton_def = false in_singleton_def = true
  66. LR-attributed grammar • LR-attributed grammar enables LR parsing to handle

    inherited attributes • Inherited attributes carries contexts from top to bottom • In short, LR parser can manage contexts with some limitation • The direction of context flow is left-to-right • I believe this is reasonable because we read codes from top-to- bottom and left-to-right • If multiple production rules are expected, the value of contexts should be unique • I believe this is reasonable to reduce cognitive cost of human
  67. Leverage Theory • It’s not currently popular to generate attribute

    evaluators from semantic rules • However attribute grammar theory tells us what parsers can do and not
  68. Summary • What parsers do • Parser distinguishes valid input

    and invalid input • Parser gives structure to the input correctly • Grammar defines the boundary of the language and structure of the language • Use appropriate grammar class • With great power comes great difficulties • LR-attributed grammar is the foundations theory of current Ruby parser
  69. Use case oriented approach • The parser’s use case is

    Ruby • Understand Ruby syntax characteristics to understand what are important aspects of parser
  70. Ruby Syntax • Ruby syntax is designed for programmers not

    for machines • What is the key properties of good programing language for programmers? • However sometimes it’s difficult for programmers to understand which sentences are connected
  71. Grammar rule con fl ict • It’s not unusual to

    design the grammar whose grammar rule conflicts • Example: “Dangling else” • https://en.wikipedia.org/wiki/Dangling_else // Rules if a then s if b then s1 else s2 // Code if a then if b then s else s2 // #1 (if a then (if b then s else s2)) // #2 (if a then (if b then s) else s2)
  72. Grammar with conflict • For example, infix operator is the

    cause of conflict • In many language, + has lower precedence * because of arithmetic operators we know + 1 2 3 * * + 1 2 3 #1 #2
  73. Polish Notation + () • Polish Notation + () seems

    to be good idea • But Ruby didn’t choice this direction +
  74. Con fl ict is design matter • We can change

    the precedence locally • Then it’s not the limitation of parser but the design of grammar • In the discussion, consistency of precedence between “=” & “and” are kept
  75. Change precedence in some scopes • I implemented “change precedence

    declaration” as PoC • Within { … }, + has higher precedence than * https://github.com/ruby/lrama/pull/254
  76. Flash point of con fl ict • If the rule’s

    start and end are clear, the chance of conflict will decrease • Informally: I consider how left context is powerful enough to minimize the rule candidates • In change precedence case • The appearance of “{“ on the left is enough powerful to distinguish normal expressions and inverse precedences expressions • The appearance of “}” determines the end of inverse precedences expressions
  77. Case #1: Method de fi nition • Start is clear

    because method definition always starts with “def” • End is clear because method definition always ends with “end”
  78. Case #1: Method de fi nition • Start is clear

    because method definition always starts with “def” • End is clear because method definition always ends with “end” until Ruby 2.7.0 • Endless method definition is introduced from Ruby 3.0.0
  79. Case #2: Modi fi er • As explained, infix operator

    is the cause of conflict • Design the precedence based on human cognitive ability • E.g. ‘+’ < ‘*’ • Modifier has similar characteristic with infix operator
  80. Case #3: parentheses • Parentheses are great • Start is

    clear because the rule starts with “(” • End is clear because the rule starts with “)” • Why do you omit parentheses ???
  81. Ruby Syntax complexities • “The Big Five parse.y calamities” in

    RubyKaigi 2024 • Today’s topic is “Lex State” https://speakerdeck.com/yui_knk/the-grand-strategy-of-ruby-parser?slide=58
  82. What’s Lex State ? • The state of lexer •

    In textbooks, lexer and parser are completely separated components • However both of them are tightly coupled with in Ruby • Sometimes it’s called “Monstrous lex_state”
  83. Why lex_state is needed • In general lexer check input

    text in the longest match manner otherwise longer one never matches • E.g. Check “||” then check “|”
  84. Why lex_state is needed • However in some cases, shorter

    token should be returned • “|” for block parameter is two “|”
  85. EXPR_BEG or not • If lex state is EXPR_BEG then

    “|” is retuned otherwise “||” is retuned • A lot of conditional branches based on lex state • Too complicated “|” “||” Check lex state
  86. Why it’s terrible • “All bugfixes are incompatibilities” • 36:00

    ~ https://rubykaigi.org/2019/presentations/nagachika.html
  87. Fixing a bug caused other bugs • Fixing [Bug #10653]

    caused [Bug #11456] and [Bug #11849]
  88. • All of them include ‘:’ … ? ‘:’ is

    di ff i cult true ? 1.tap do |n| p n end : 0 {foo: ("" rescue "")} { label:<<-DOC Some text for a heredoc goes here DOC }
  89. Fix [Bug #10653] • By the way, I guess not

    r51617 but r51616 fixed the issue, right? • The error was unexpected “keyword_do_cond” and COND_PUSH(1) is called after ‘?’ • There is a space between “end” and ‘:’
  90. Fix [Bug #10653] • Anyway, r51617 changed the logic from

    managing where label is disallowed (EXPR_VALUE) to where label is allowed (EXPR_LABEL) {foo: ("" rescue "")} { label:<<-DOC Some text for a heredoc goes here DOC } label label
  91. [Bug #11456] • Lex state is “EXPR_ARG|EXPR_LABELED” after label is

    tokenized • Then it’s NOT IS_BEG() • tLPAREN_ARG is returned • Only expr is allowed after tLPAREN_ARG !!! • expr doesn’t allow modifier rescue {foo: ("" rescue "")} case '(': if (IS_BEG()) { c = tLPAREN; } else if (IS_SPCARG(-1)) { c = tLPAREN_ARG; } paren_nest++; COND_PUSH(0); CMDARG_PUSH(0); lex_state = EXPR_BEG|EXPR_LABEL; return c; primary: tLPAREN_ARG expr rparen Before: EXPR_LABELARG After: EXPR_ARG|EXPR_LABELED
  92. Fix [Bug #11456] • r51624 fixed the bug by adding

    “EXPR_ARG|EXPR_LABELED” to IS_BEG() https://github.com/ruby/ruby/commit/0958af2ad4e83400f35c296e9ed9cf021b1675b4
  93. [Bug #11849] • Lex state is “EXPR_ARG|EXPR_LABELED” after label is

    tokenized • Then it’s IS_ARG() { label:<<-DOC Some text for a heredoc goes here DOC } case '<': last_state = lex_state; c = nextc(); if (c == '<' && !IS_lex_state(EXPR_DOT | EXPR_CLASS) && !IS_END() && (!IS_ARG() || space_seen)) { int token = heredoc_identi fi er(); if (token) return token; } ...
  94. Fix [Bug #11849] • r53214 fixed the bug by adding

    “EXPR_LABELED” check https://github.com/ruby/ruby/commit/9d5abbff9754589483938dc539226c2ad4895140
  95. Ruby Syntax changes • 3.2.0 (2022-12-25): Anonymous rest and keyword

    rest arguments can now be passed as arguments • 3.1.0 (2021-12-25): Anonymous block argument
  96. Ruby Syntax changes • 3.0.0 (2020-12-25): Endless method definition •

    2.7.0 (2019-12-25): Pattern matching, beginless range • 2.6.0 (2018-12-25): Endless range
  97. What will happen by the change? • Proposal for existing

    grammar • https://bugs.ruby-lang.org/issues/18080
  98. Summary • Use Case: Ruby • Ruby syntax is designed

    for programmers not for machines • Ruby syntax changes • Parser needs to • have theory and mechanism which mitigate implementation complexities • give the language designer feedbacks about syntax changes
  99. Parser & Lexer • Assume parser and lexer can be

    separated Class Method Method Assignment @name Call name capitalize class Greeter def Lexer Parser …
  100. Parser & Lexer • However lexer depends on parser in

    Ruby • Lexer generates different tokens depending on the parser state • Tokens with same length but different identity • Tokens with different length • By the way, parser knows what kind of tokens itself can accept on each parser state
  101. PSLR(1) • It seems good idea to integrate parser and

    lexer then change to manage states on parser side • Joel E. Denny. “PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages”, May 2010. • https://tigerprints.clemson.edu/cgi/viewcontent.cgi? article=1519&context=all_dissertations • PSLR stands for Pseudo-Scannerless Minimal LR
  102. PSLR(1) • > Nevertheless, traditional scanner and parser generators attempt

    to generate loosely coupled scanners and parsers, so the user must maintain these tightly coupled scanner and parser specifications separately but consistently. • > Scanner and parser specifications would be significantly more maintainable if all sub-language transitions were instead computed from a grammar by a parser generator and recognized automatically by the scanner using the parser’s stack.
  103. Sub-languages and scopes • The example comes from “Figure 2.6:

    Scoped Declarations” of the paper • C++0x • and ‘>>’ has higher precedence than ‘>’, vice verse in Lc Lc Lt Y<X<(6>>1)>> x4; Lc Lt Lp Lc : main C++0x language Lt : template argument list sub-language Lp : parenthesized expression sub-sub-language %lex-prec ’>’ -< ’>>’ for Lc and Lp %lex-prec ’>>’ -< ’>’ for Lt Y<X<(6>>1)>> x4;
  104. Sub-languages and scopes • In Ruby case, ‘|’ has higher

    precedence than ‘||’ in Lbp obj.m do || end Lrb Lbp Lrb : main ruby Lbp : block parameters %lex-prec ’|’ -< ’||’ for Lrb %lex-prec ’||’ -< ’|’ for Lbp
  105. Scanner con fl ict • Identity conflict: Tokens with same

    length but different identity • E.g. do, do_cond, do_block, do_LAMBDA • Length conflict: Tokens with different length • E.g. ‘|’, ‘||’
  106. How to specify Sub-languages scopes • Specify nonterminals as a

    scope of sub-languages • See: “3.7 Scoped Declarations” obj.m do |var| expr end method_call brace_block do |var| expr end k_do do_body k_end |var| expr opt_block_ param bodystmt | var | block_ param
  107. How to specify Sub-languages scopes • With in “opt_block_param”, ‘|’

    has higher precedence than ‘||’ %nterm opt_block_param { %lex-prec ‘||’ < ‘|’ } %% primary: method_call brace_block brace_block: k_do do_body k_end do_body: opt_block_param bodystmt opt_block_param: block_param_def block_param_def: '|' opt_bv_decl '|' | '|' block_param opt_bv_decl '|'
  108. How it works • Collecting tokens before “opt_block_param” -> “do”

    • Collecting tokens which are the last token of “opt_block_param” -> ‘|’ • Parser update the lexer precedence to sub-language mode after “do” and restore it after the second ‘|’ primary: method_call brace_block brace_block: k_do do_body k_end do_body: opt_block_param bodystmt opt_block_param: block_param_def block_param_def: '|' opt_bv_decl '|' | '|' … '|'
  109. How it works • Some states are marked • ‘||’

    is separated to two ‘|’ in marked states primary: method_call • brace_block brace_block: • k_do do_body k_end brace_block: k_do • do_body k_end do_body: • opt_block_param bodystmt opt_block_param: • block_param_def block_param_def: • ‘|' block_param ‘|' block_param_def: ‘|’ • block_param ‘|' block_param_def: ‘|’ block_param • ‘|' block_param_def: ‘|’ block_param ‘|' • do_body: opt_block_param • bodystmt ... primary: method_call brace_block •
  110. Scope con fl ict • If contradictional lexer precedence are

    defined, the parser state has scope conflict • Split the state again so that each state doesn’t have contradictional lexer precedence • In this case, the states can be separated because one follows “{” and other follows “do” %nterm opt_block_param { %lex-prec ‘||’ < ‘|’ } %nterm brace_body { %lex-prec ‘||’ > ‘|’ } %% brace_block: k_do do_body k_end do_body: opt_block_param bodystmt brace_block: '{' brace_body '}' brace_body: opt_block_param compstmt
  111. IELR • IELR can split such state • IELR is

    more powerful than LALR • PSLR is an extension of IELR • Both PSLR and IELR are invented by Joel E. Denny
  112. RubyKaigi 2024 • Check slides and video for more detail

    • https://rubykaigi.org/2024/presentations/junk0612.html
  113. Reconsider lex_state • Reconsider block parameter syntax • “||” is

    not accepted after “do” • “||” is not accepted after “var” • No scanner conflict • It’s enough for parser to pass acceptable token list to lexer obj.m do |var| expr end // After “do” do_body: … • opt_block_param bodystmt opt_block_param: • none opt_block_param: • block_param_def block_param_def: • '|' opt_bv_decl ‘|' block_param_def: • '|' block_param opt_bv_decl ‘|' // After “var” $@23: ε • [‘='] f_eq: • $@23 ‘=' f_opt_primary_value: f_arg_asgn • f_eq primary_value f_arg_item: f_arg_asgn • ['|', '\n', ',', ';']
  114. Reconsider modi fi er if • “if” will be •

    keyword_if if the lex state is EXPR_BEG • modifier_if if the lex state is not EXPR_BEG “if” keyword_if modi fi er_if if cond then … … if cond EXPR_BEG ! EXPR_BEG
  115. Checking states table • If I checked correctly, no state

    accepts both keyword_if and modifier_if • If the state accepts keyword_if, it doesn’t accept modifier_if • If the state accepts modifier_if, it doesn’t accept keyword_if • Always current state knows how to handle “if”
  116. EXPR and if • After the operator, state is EXPR_BEG.

    Then “if … end” is accepted • After the number, state is EXPR_END. Then modifier if is accepted • It’s clear which type of if can be written 1 + 2 BEG END BEG END 1 + if true; 1 else 2 end 1 + 2 if true EXPR_BEG ! EXPR_BEG
  117. Hypothesis • #1: In Ruby, the end of nonterminal symbol

    is powerful enough to distinguish which tokens are accepted • #2: A lot of token types can be determined on parser side • If so, sub-language model is not the best mental model in Ruby
  118. I forget command like control syntax • Tweak parse.y to

    replace modifier_if with keyword_if • These grammar rules have conflict return if … return (if …) (return) if … keyword_if modifier_if
  119. Insight • modifier_if or keyword_if • It’s clear in a

    sentence with operator • It’s not clear just after control syntax • If the relation between modifier_if and keyword_if are specified, parser inform conflicts to us • How conflicts are resolved in the language is important insight when new syntax is added
  120. Summary • In Ruby, how to extract token depends on

    the surrounding sentences • lex_state is complicated • Need to mitigate the complexities for further syntax extensions • Tight communication between scanner and parser will reduce the complexities • Explicitly declaration of conflict resolution recodes what the language designer decided • Able to refer to the past decisions when similar pattern appears
  121. Give feedbacks to the language designer It’s fun to hack

    parser generator - RubyKaigi 2024 LT in Okinawa -
  122. What will happen by the change? • Proposal for existing

    grammar • https://bugs.ruby-lang.org/issues/18080
  123. It’s possible to implement • > but nobu said it's

    hard to support because of parse.y limitation. • No, it’s possible!! • https://github.com/yui-knk/ruby/tree/bugs_18080
  124. Need to consider these patterns • There is an argument

    or not • The arguments are sounded by parentheses or not • There is block or not • The symbol of pattern matching, `in` or `=>`
  125. Con fl icts with existing grammar • There is no

    block • The arguments are not sounded by parentheses • The symbol of pattern matching is `=>`
  126. LR parser generator knows this issue • S/R or R/R

    conflict detection is a friend for programming language designer
  127. Why this issue is di ff i cult to detect?

    • Need to check all combination of grammar rules • Discussion of grammar and implementation of parser are localized
  128. Combination of grammar rules • A lot of rules are

    optional • Argument is optional • Parentheses around arguments are optional • Block is optional • (The symbol of pattern matching, `in` or `=>) • Need to discuss grammar rules as group • E.g. “a == b”, “1 + 2” and “1..2” are in same “arg” group • If change “arg” rules, need to consider the impact on “expr” and “stmt” too
  129. Localized discussion & implementation • Examples in a ticket is

    simple • Parser implementation is a combination of parts • Parser generator: combination of rules • Recursive Descent Parser: combination of functions, e.g. “parse_pattern_matching”, “parse_arguments”
  130. Localized discussion & implementation • Localized discussion and implementation are

    good practice • Divide the difficulties • However it requires mechanism to integrate these parts • LR parser generator has the mechanism, conflict detection • Hand written parser doesn’t have such mechanism • Parser generator works as checker/linter for grammar • Can not keep soundness of grammar without the help from computer science
  131. RubyKaigi 2024 • Check slides and video for more detail

    • https://rubykaigi.org/2024/presentations/spikeolaf.html
  132. Lexer level con fl ict • Current parser doesn’t warn

    lexer level conflict • Because parser doesn’t know relationship between keyword_if and modifier_if • However it conflicts on some points from programmers viewpoint • The detection is helpful for syntax discussion return if … return (if …) (return) if … keyword_if modifier_if
  133. Endless range • Endless range literal is cutting-edge syntax •

    Traditional range ends with EXPR_END however endless range ends with EXPR_BEG 1 … 2 BEG END BEG END 1 … BEG END BEG
  134. Endless range • Concerning lex state sensitive tokens • ‘%’:

    is interpreted as a start of % string literal if EXPR_BEG • ‘||’: is divided into two ‘|’ if EXPR_BEG
  135. Endless range • However it might not matter • ‘..’

    and ‘…’ has relatively low precedence
  136. Endless range • “rescue” might matter ? • Parser generator

    could help [Feature #12912] discussion more
  137. Last battle with Space • I think space and newline

    are the most mysterious syntax part of Ruby • “space_seen” variable • ‘\n’ token and tIGNORED_NL token • How to include space and newline into parser context is open problem
  138. Summary • It’s difficult for human to understand the combination

    • Can not keep soundness of grammar without the help from computer science • PSLR is key concept for checking soundness of lexer state sensitive grammar
  139. What parser generates I want to know truth of syntax

    tree design - Osaka RubyKaigi 04 -
  140. Syntax Tree • Parser generates Syntax Tree for other libraries

    and components • Compiler, Type System, LSP, Linter and Code Formatter • Therefore what’s the use case of Syntax Tree ? • How to satisfy the use cases ?
  141. Use cases • They want to execute codes • compile.c

    • Type System • They want to analyze codes • LSP (ruby-lsp) • Linter & Code Formatter (RuboCop)
  142. Code analysis • Need token information (Syntax Highlight) • Need

    to analyze comments (LSP DocumentLink) • Need to walk through parent node from child node (LSP SelectionRange) • Want to rewrite codes (LSP & Code Formatter) • This is the most difficult use case, right now
  143. Code rewriting • Style::IfInsideElse Cop • Unnest if inside if

    https://github.com/rubocop/rubocop/blob/v1.65.1/lib/rubocop/cop/style/if_inside_else.rb#L10-L29
  144. How RuboCop rewrite codes if condition_a action_a else if condition_b

    action_b else action_c end end if condition_a action_a elsif condition_b if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end if condition_a action_a elsif condition_b action_b else action_c end 1. else to elsif 2. Delete “if condition_b” 3. Delete end 4.Delete duplicated “action_b”
  145. Problem of TreeRewriter #1 • Implementation is complex • TreeRewriter

    doesn’t edit source code directly • Create TreeRewriter::Action instances, store the actions then apply the changes at once Action. :replace (2, 0)-(2, 4) “elsif condition_b” Action. :replace (3, 2)-(3, 16) “action_b” Action. :replace (7, 0)-(7, 6) “” Action. :replace (4, 0)-(4, 13) “”
  146. Why Action is needed #1 • It’s costly to edit

    string every time • In both cases, need to move/copy sub-strings after “else” if condition_a action_a else action_b end if condition_a action_a action_b end Delete else if condition_a action_a else action_b end if condition_a action_a elsif action_b end Replace with elsif
  147. Why Action is needed #2 • Directly editing the code

    affects the rest of nodes if condition_a action_a else action_b end Parser::Source::Bu ff er if condition_a action_a action_b end Parser::Source::Bu ff er NODE_VCALL action_b Range (3, 2)-(3, 10) Delete else
  148. Problem of TreeRewriter #2 • Rewriting operations are complicated •

    Need to understand current status of each step if condition_a action_a else if condition_b action_b else action_c end end if condition_a action_a elsif condition_b if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end end if condition_a action_a elsif condition_b action_b action_b else action_c end if condition_a action_a elsif condition_b action_b else action_c end 1. else to elsif 2. Delete “if condition_b” 3. Delete end 4.Delete duplicated “action_b”
  149. Rewriting Syntax Tree • Can leverage Tree Structure • Change

    NODE_IF to NODE_ELSIF then delete NODE_ELSE NODE_IF condition_a action_a NODE_ELSE NODE_IF condition_b action_b NODE_ELSE action_c NODE_IF condition_a action_a NODE_ELSIF condition_b action_b action_c
  150. Generate source code from Syntax Tree • Once rewrite the

    syntax tree, rendering source code from Syntax Tree • However AST doesn’t have spaces, newline and so on… NODE_IF condition_a action_a NODE_ELSIF condition_b action_b action_c
  151. Other di ffi culties #1 • Need to pass range

    information for new node • Calculation is still based on text oriented approach • Can not fully leverage the syntax tree transformation if condition_a action_a else if condition_b action_b else action_c end end if condition_a action_a elsif condition_b action_b else action_c end Start is same with else The last line is “action_c line - 1” The last column is “tail of action_c - 2”
  152. Other di ffi culties #2 • All nodes following the

    updated node are affected NODE_CLASS condition_a action_a NODE_ELSIF NODE_DEF NODE_IF NODE_DEF expr1 expr2 expr2 expr2 condition_b action_b action_c Update!!
  153. Migration Tool • Code rewriting is not only for LSP

    and code formatter • But also migration tool like Transpec http://yujinakayama.me/transpec/
  154. Concrete Syntax Tree • Concrete Syntax Tree (CST) for code

    restoration • CST preserves information which AST omits, e.g. spaces, newlines, parentheses • AST focuses on semantics, CST focuses on Syntax • Implementation • Introduce data structure for token • Keep information on token which lexer omitted • Node has child nodes and tokens
  155. Concrete Syntax Tree • Concrete Syntax Tree (CST) for code

    restoration • CST preserves information which AST omits, e.g. spaces, newlines, parentheses • AST focuses on semantics, CST focuses on Syntax • Implementation • Introduce data structure for token • Keep information on token which lexer omitted • Node has child nodes and tokens Syntax Tree having complete information of source code
  156. Trivia • Trivia is information which lexer omits • Spaces,

    Newlines, comments and so on Trivia (comment) Trivia (spaces) Trivia (new line)
  157. Node, Token and Trivia • Token has trailing trivia and

    leading trivia • Node holds nodes and tokens NODE_IF IF cond action_a END Token NODE Legend space (1) NL (1) + space (2) NL (1) Trivia
  158. Syntax Tree to code • Dump codes with Depth-first search

    to get the whole codes Token NODE Legend NODE_IF IF cond action_a END space (1) NL (1) + space (2) NL (1) Trivia
  159. Red Green Tree • Red Green Tree is editable Syntax

    Tree • Invented by C# (Roslyn) • Swift (SwiftSyntax) and rust-analyzer (LSP) uses this • Represent Syntax Tree with Red Node and Green Node • Let’s read swift-syntax • https://github.com/swiftlang/swift-syntax
  160. Red Green Tree • Green Node • has reference to

    chide elements • has width • Red Node • has reference to parent elements • has offset Token Green NODE Legend Red NODE NODE_IF width: 90 IF width: 3 NODE_IF width: 56 condition_a width: 11 action_a width: 11 NODE_ELSE width: 61 END width: 4 ELSE width: 5 NODE_IF o ff set: 0 NODE_ELSE o ff set: 25 NODE_IF o ff set: 30
  161. Recap • Execute codes: compile.c • Analyze codes: LSP, Linter

    & Code Formatter • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • Concrete Syntax Tree !! • Editable Syntax Tree • Red Green Tree !!
  162. Problem • How to keep edited Syntax Tree correct ?

    • If it’s not correct, hopefully want to auto correct • Syntax Tree rewriting can create Syntax Tree which parser never generates + 1 2 3 * * + 1 2 3 parse Rewrite Dump
  163. Open Problem • Simple approach: Parse the dump code and

    compare syntax tree with the syntax tree before dump • Grammar might know which syntax tree parser can generate Grammar fi le Parser generator Parser Syntax Tree Checker
  164. Summary • Research Syntax Tree use case • Code rewriting

    is the most difficult use case, right now • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • How to keep edited Syntax Tree correct • Open Problem
  165. Conclusion The world is now in the great age of

    parsers. People are setting sail into the vast sea of parsers. - RubyKaigi 2023 LT in Matsumoto -
  166. Summary • Grammar defines the boundary of the language and

    structure of the language • What parsers do • Parser distinguishes valid input and invalid input • Parser gives structure to the input correctly • LR-attributed grammar is the foundations theory of current Ruby parser
  167. Summary • Use Case: Ruby • Ruby syntax is designed

    for programmers not for machines • Ruby syntax changes • Parser needs to • have theory and mechanism which mitigate implementation complexities • give the language designer feedbacks about syntax changes • PSLR is key concept for lexer state sensitive grammar
  168. Summary • Research Syntax Tree use case • Code rewriting

    is the most difficult use case, right now • Test oriented code rewriting is difficult • Syntax Tree rewriting • Generate codes from Syntax Tree • How to keep edited Syntax Tree correct • Open Problem
  169. What is Grammar • The ruler of lexer, parser and

    syntax tree • By grammar, we can reveal what Ruby is • It’s very interesting to expose the secret of Ruby syntax from grammar • I want to reveal what is the key of Ruby’s programmer friendly syntax • Hypothesis: • Programmers • can understand expression beginning and ending • feel it’s natural that conditional branch follows flow control keywords, like “return” • can understand space sensitive grammar • sometimes fails to understand precedence of non-arithmetic operator