Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The future vision of Ruby Parser

The future vision of Ruby Parser

yui-knk

May 11, 2023
Tweet

More Decks by yui-knk

Other Decks in Programming

Transcript

  1. The future vision of
    Ruby Parser
    May 11, 2023 in RubyKaigi 2023


    @yui-knk


    Yuichiro Kaneko

    View full-size slide

  2. About me
    • Yuichiro Kaneko


    • yui-knk (GitHub) / spikeolaf (Twitter)


    • Treasure Data


    • Engineering Manager of Applications Backend

    View full-size slide

  3. PR: We are Gold sponsor!

    View full-size slide

  4. TD and Ruby committers
    twitter: @nalsh


    GitHub: @nurse
    twitter: @k_tsj


    GitHub: @k-tsj
    twitter: @ spikeolaf


    GitHub: @yui-knk
    twitter: @mineroaoki


    GitHub: @aamine
    twitter: @nahi


    GitHub: @nahi
    Applications Backend

    View full-size slide

  5. Attendees from TD
    @spikeolaf
    @nalsh
    @k_tsj
    @frsyuki
    @takkanm
    @makimoto
    @ citystar


    (GH)
    @chezou
    @ybiquitous
    @hkdnet
    @a_ksi19
    @exoego

    View full-size slide

  6. About me
    • Yuichiro Kaneko


    • yui-knk (GitHub) / spikeolaf (Twitter)


    • CRuby committer, mainly develop parser related features


    • Code positions to RNode (2018, Ruby 2.6)


    • Coverage


    • Error reporting


    • RubyVM::AbstractSyntaxTree (2018, Ruby 2.6)


    • keep_tokens option (2022, Ruby 3.2)


    • error_tolerant option (2022, Ruby 3.2)

    View full-size slide

  7. Introduction
    LR ߏจղੳͷجૅΛͳ͢ΞΠσΞ ͸ɼʮਖ਼نݴ
    ޠͷղੳख๏Λ܁Γฦ͠࢖͍ɼจ຺ࣗ༝จ๏ͷ෯
    ޿͍ΫϥεΛղੳ͢Δʯͱ͍͏ (ଟ͘ͷ༏ΕͨΞ
    ΠσΞ ͕ͦ͏Ͱ͋ΔΑ͏ʹ) ୯७ͳ΋ͷͰ͋Δɽ


    େງ ३ “LRߏจղੳͷݪཧ”

    View full-size slide

  8. Parser in Ruby
    • Converting input script into Abstract Syntax Tree


    • CRuby’s parser is LALR parser


    • CRuby uses GNU Bison to generate parser codes

    View full-size slide

  9. History of parser generator
    • 1965: Donald E. Knuth invents LR parsing. “On the translation of
    languages from left to right”


    • 1975: Yacc is published


    • 1985: GNU Bison initial release


    • 1989: Berkeley Yacc initial release


    • 2006: GCC migrates it’s parser from Bison to hand-
    written recursive-descent parsers (C++ was 2004)


    • 2015: Go migrates it’s parser from Bison to hand-written recursive-
    descent parsers

    View full-size slide

  10. What’s the problem?
    • Usability


    • Error-tolerant parser


    • Maintainability


    • “parse.y” seems dif
    fi
    cult


    • Universal Parser


    • It’s tough task to write your own parser for Ruby


    From “Ruby Committers vs The World” 2022 and 2021

    View full-size slide

  11. Today’s talk
    • Usability


    • Maintainability


    • Universal Parser


    • Me

    View full-size slide

  12. Usability
    For example, the Java error recovery approach in
    the Eclipse IDE is 5KLoC long, making it only slightly
    smaller than a modern version of Berkeley Yacc


    View full-size slide

  13. Parser’s responsibility
    • Check if the input is valid Ruby code


    • If valid


    • Build internal representation (AST) for subsequent process,
    “compile.c”


    • If invalid


    • Report Syntax Error

    View full-size slide

  14. Why Error-tolerant parser is need?
    • LSP requires parser to parse invalid code as far as possible


    • Just raising syntax error is not enough in this case

    View full-size slide

  15. Parser’s responsibility
    • Check if the input is valid Ruby code


    • If valid


    • Build internal representation (AST) for subsequent process
    (“compile.c”)


    • If invalid


    • Report Syntax Error


    • Build AST as far as possible (New!)

    View full-size slide

  16. Python’s approach
    • CPython uses PEG parser


    • Try “A”


    • Try “B” if “A” fails


    • Try “C” if “B” fails
    https://devguide.python.org/internals/parser

    View full-size slide

  17. Python’s approach
    • CPython de
    fi
    nes valid cases and invalid cases


    • A rule failure doesn’t imply a parsing failure like in context free
    grammars
    https://github.com/python/cpython/blob/889b0b9bf95651fc05ad532cc4a66c0f8ff29fc2/Grammar/python.gram

    View full-size slide

  18. Rust/Go’s approach
    • Both of them use hand-written parser


    • Go skips one or more tokens until one of “followlist” appears
    https://github.com/golang/go/blob/157aae6eed1c092fd9e8ead3527185378eb828e1/src/cmd/compile/internal/syntax/parser.go#L1032

    https://github.com/golang/go/blob/157aae6eed1c092fd9e8ead3527185378eb828e1/src/cmd/compile/internal/syntax/parser.go#L321

    View full-size slide

  19. Rust/Go’s approach
    • rust-analyzer also skips speci
    fi
    ed token
    https://github.com/rust-lang/rust-analyzer/blob/21e61bee8b74e93f14205f4a6c316db08f811e38/crates/parser/src/grammar.rs#L302

    https://github.com/rust-lang/rust-analyzer/blob/21e61bee8b74e93f14205f4a6c316db08f811e38/crates/parser/src/parser.rs#L227

    View full-size slide

  20. Bison’s approach
    • Bison panic-mode error recovery discards part of input so that it
    can parse the rest of input.
    =>

    View full-size slide

  21. In short
    • PEG requires developers to implement all error cases


    • hand-written parser requires developers to cover all error cases


    • panic-mode error recovery loses part of input, sometimes which
    is most important for complementation

    View full-size slide

  22. Our approach
    • Insert/Delete/Shift operations based error recovery


    • Fischer, Corchuelo, CPCT+


    • Lukas Diekmann and Laurence Tratt. “Don’t Panic! Better,
    Fewer, Syntax Errors for LR Parsers”, July 2020


    • https://arxiv.org/pdf/1804.07133.pdf

    View full-size slide

  23. How it works?
    Diekmann and Tratt. “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers”

    View full-size slide

  24. Insert/Delete token
    • Inserting tokens until input script to be valid


    • Deleting tokens until input script to be valid


    • Mixed operations is also acceptable

    View full-size slide

  25. How LR parse works?
    • (LA)LR converts production rules into DFA (Deterministic
    Finite Automaton)


    • (LA)LR parser is implemented as PDA (Pushdown Automaton)


    • The stack manages states of DFA

    View full-size slide

  26. How approach works?
    (2) k_if • expr_value …
    true •
    k_def • def_name …

    “true”
    “def”

    View full-size slide

  27. How approach works?
    (3) k_if expr_value • then …
    “then”
    “;”
    “\n”
    (4) k_if expr_value then • …
    (4) k_if expr_value then • …
    (4) k_if expr_value then • …

    View full-size slide

  28. How approach works?
    (4) k_if expr_value then • compstmt if_tail k_end
    (7) k_if expr_value then compstmt if_tail k_end •
    compstmt and if_tail are optional

    View full-size slide

  29. How approach works?
    Diekmann and Tratt. “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers”

    View full-size slide

  30. Summary
    • There are some paths to recover an error


    • Find the cheapest path to repair it


    • Need to implement only path
    fi
    nding logic


    • No need to take care about detail of a grammar

    View full-size slide

  31. Problems of expanding Bison
    • Bison = Bison command + template
    fi
    le (“yacc.c”)


    • Template
    fi
    le is a detail of implementation


    • Installed version of Bison depends on environments


    • Expanding Bison template is not easy
    https://github.com/akimd/bison/blob/25b3d0e1a3f97a33615099e4b211f3953990c203/data/skeletons/yacc.c#L1640

    View full-size slide

  32. Lrama LALR (1) parser generator
    • https://github.com/yui-knk/lrama


    • 100% Ruby implementation


    • Will be installed ruby/ruby tool directory


    • Input
    fi
    le is Bison format
    fi
    le (“parse.y”)


    • Output is LALR parser written by C


    • Generate 100% compatible C
    fi
    le for Ruby 3.0.5, 3.1.0, 3.2.0


    • https://bugs.ruby-lang.org/issues/19637

    View full-size slide

  33. Error Recovery by Bison
    • Does not work well for this case


    • This example is provided by @tompng

    View full-size slide

  34. Error Recovery by Lrama
    • https://github.com/yui-knk/lrama/tree/error_recovery

    View full-size slide

  35. Summary
    • Parser’s responsibility is increasing


    • PEG and hand-written parser need to be aware of detail of
    grammar


    • Bison’s panic mode loses part of input


    • Token based error recovery is
    fl
    exible, no need to know the detail
    of grammar


    • We can ride on DFA’s theory if we use LR parser


    • Defeated the
    fi
    rst boss !

    View full-size slide

  36. Maintainability
    In this work we demonstrate that, contrary
    to the prevailing consensus, we can have
    the best of both worlds: for a very general,
    practical class of grammars—a strict
    superset of Knuth’s canonical LR—we can
    generate parsers automatically, and such
    that the resulting parser code, as well as
    the generation procedure itself, is highly
    ef
    fi
    cient.


    “Practical LR Parser Generation”

    View full-size slide

  37. Why “parse.y” is dif
    fi
    cult?
    1. “parse.y” is large (about 15,000 lines)


    2. LALR is dif
    fi
    cult, e.g. S/R con
    fl
    ict, R/R con
    fl
    ict


    3. Bison doesn’t provide syntax sugar like option, list


    4. It’s a mixture of parser and ripper


    5. Parser and Lexer are tightly-coupled

    View full-size slide

  38. Monstrous lexer state
    • enum lex_state_e state: Has 13 different types of state.


    • int paren_nest: Nest level of (, [, {. Used for parsing -> {}.


    • int lpar_beg: Stores paren_next when parsing lambda starts.


    • int brace_nest: Next level of {. Used for parsing “#{var}".


    • stack_type cond_stack: Used for parsing condition like “while ...
    do”


    • stack_type cmdarg_stack: Used for parsing command call like
    “foo 1, 2 do”

    View full-size slide

  39. 1. Resolve con
    fl
    icts
    • Ruby uses 4 different “do”


    • lambda


    • condition (while)


    • command call


    • method call

    View full-size slide

  40. • It’s not a joke nor metaphor, CRuby literally has 4 different “do”
    1. Resolve con
    fl
    icts

    View full-size slide

  41. 1. Resolve con
    fl
    icts
    • “do” is a cause of shift/reduce con
    fl
    icts


    • “do” never appears in the condition of while, until and so on

    View full-size slide

  42. 1. Resolve con
    fl
    icts
    • Matz daily 2004/04/26


    • https://matz.rubyist.net/20040426.html#p02


    1. Write two full set of rules, one is with do, another is without do.

    View full-size slide

  43. 1. Resolve con
    fl
    icts
    • Matz daily 2004/04/26


    • https://matz.rubyist.net/20040426.html#p02


    1. Write two full set of rules, one is with do, another is without do.


    2. Hack a lexer so that a lexer returns different tokens for same “do”
    string based on the context (= state)


    • CRuby selected the later

    View full-size slide

  44. Matz dialy
    • > yaccͷએݴతͳจ๏͸৚͕݅ॻ͚ͳ͍ɻ ʮ͜ͷ৚݅ͷͱ͖͸͜ͷϧʔ
    ϧΛద༻͠ͳ͍ʯͱ͍͏Α͏ͳจ๏͸͋Γ͑ͳ͍ɻ·͋ɺLALR(1)ͷੑ࣭
    Λߟ͑Ε͹͋Δҙຯ౰વͳͷͰɺ͜ΕΛऑ఺ͱ͍͏ͷ͸ద੾Ͱ͸ͳ͍ɻ Α
    Γਖ਼֬ʹ͸ʮऑ఺ʯͰ͸ͳͯ͘ɺ͍͍ͤͥʮཁ๬ʯͱ͔ʮཉٻʯͱ͔ͩͳɻ


    • > yacc doesn’t support conditions for rules, we can not omit some
    rules when some conditions are met.


    • Matz daily 2004/04/26


    • https://matz.rubyist.net/20040426.html#p02

    View full-size slide

  45. Nonterminal attributes
    • Correct, Bison doesn’t have such feature. However the fact does
    not mean it’s impossible!


    • Joe Zimmerman “Practical LR Parser Generation”, Sep 2022


    • https://arxiv.org/pdf/2209.08383.pdf


    • “Nonterminal attributes”

    View full-size slide

  46. Nonterminal attributes
    1. De
    fi
    ne attributes


    2. “do” is allowed in top level,
    however not allowed in
    while condition


    3. “k_do” is allowed when
    DO_ALLOWED is true

    View full-size slide

  47. Nonterminal attributes
    https://github.com/yui-knk/lrama/blob/e4a708d2f080f8e9ca8b082ac038fd6658d31077

    View full-size slide

  48. • It works well for “while”
    Nonterminal attributes
    https://github.com/yui-knk/lrama/blob/e4a708d2f080f8e9ca8b082ac038fd6658d31077/sample/
    nonterminal_attributes_3_2_0.output

    View full-size slide

  49. Con
    fl
    icts
    • E.g. Con
    fl
    ict on endless method de
    fi
    nition


    • Lexer hack introduced ambiguities

    View full-size slide

  50. GAME OVER
    CONTINUE?

    YES

    NO

    View full-size slide

  51. GAME OVER
    CONTINUE?

    YES

    NO

    View full-size slide

  52. Rethink “do”
    • There are 4 “do”

    View full-size slide

  53. Rethink “do”
    • They has precedences


    • I don’t recommend writing such codes

    View full-size slide

  54. Rethink “do”
    • (, [, { reset the “context”


    • Need to care about “context”

    View full-size slide

  55. Already have hints (1)
    • Precedence is solved yet by “Operator Precedence”


    • https://www.gnu.org/software/bison/manual/html_node/
    Precedence.html

    View full-size slide

  56. Already have hints (2)
    • Nonterminal attributes carries “context”

    View full-size slide

  57. Nonterminal attributes for


    con
    fl
    ict resolution
    1. De
    fi
    ne attributes


    2. “do” in f_larglist has less
    precedence than “do” in
    lambda_body -> “do” is
    reduced


    3. “do” in () is shifted


    4. “do” in top level is
    shifted

    View full-size slide

  58. Nonterminal attributes for


    con
    fl
    ict resolution
    https://github.com/yui-knk/lrama/tree/nonterminal_attributes
    No con
    fl
    ict !!

    View full-size slide

  59. What happens behind the scenes
    • Generate two states for one state

    View full-size slide

  60. Summary
    • “parse.y” dif
    fi
    culty comes from tightly coupling between parser
    and lexer


    • Nonterminal attributes solves a part of problems


    • Nonterminal attributes for precedence solves “do” overload


    • We have not leveraged the potential of LR parser


    • Defeated the second boss !!

    View full-size slide

  61. Universal
    Parser
    We can solve any problem by introducing
    an extra level of indirection.


    View full-size slide

  62. Why Universal Parser is needed?
    • Everyone wants to use CRuby parser


    • mruby, PicoRuby: Other Ruby implementation by C


    • JRuby, Truf
    fl
    eRuby, ruruby: Other Ruby implementation by
    non-C


    • sorbet, typeprof: Tools


    • Implementing 100 % compatible Ruby parser is a bit dif
    fi
    cult


    • Managing parser for each version is dif
    fi
    cult

    View full-size slide

  63. Why it isn’t Universal Parser?
    Ruby
    lexer & parser

    View full-size slide

  64. Why it isn’t Universal Parser?
    • CRuby parser depends on other CRuby functionaries !!!
    lexer & parser
    GC
    RString


    RArray


    RHash



    rb_mRubyVMFrozenCore


    struct rb_iseq_struct *
    Ruby

    View full-size slide

  65. The road to Universal Parser
    1. Passing required functions as function pointer


    2. Linking functions into a parser shared library


    1. parse.o: Generated from “parse.y”


    2. node2.o: Separate AST/Node codes from “node.c”


    3. st2.o: Copy “st.c” and remove unnecessary codes


    • https://github.com/yui-knk/ruby/tree/universal-parser

    View full-size slide

  66. Done!!!
    • https://github.com/yui-knk/ruby/tree/universal-parser


    • https://github.com/yui-knk/my-ruby-parser



    View full-size slide

  67. However…
    • 209 functions

    View full-size slide

  68. Sort out the interface
    • Memory management


    • malloc, realloc, free …


    • They should be in the interface


    • imemo


    • tmpbuf_auto_free_pointer, tmpbuf_set_ptr


    • CRuby internal, let’s remove the dependency

    View full-size slide

  69. Sort out the interface
    • Literal Object


    • Do not create object, but keep it as “string” instead.

    View full-size slide

  70. Sort out the interface
    • Parser manipulates object


    • Parser needs to know structure
    of objects


    • Need to pass functions

    View full-size slide

  71. Sort out the interface
    • Add “negative”
    fl
    ag


    • Add NODE_NEG

    View full-size slide

  72. Sort out the interface
    • AST transformation


    • Move it to “compile.c”

    View full-size slide

  73. Summary
    • Universal Parser is required for tools and other Ruby
    implementations


    • 209 functions is a starting line


    • A lot of sub tasks to make the interface user-friendly


    • Defeated the third boss !!!

    View full-size slide

  74. Conclusion
    The future is not laid out on a track. It is
    something that we can decide, and to the
    extent that we do not violate any known
    laws of the universe, we can probably make
    it work the way that we want to.


    Alan Curtis Kay

    View full-size slide

  75. Conclusion
    • LR parser can solve 3 major problems, Usability/Maintainability/
    Universal Parser


    • We can ride on DFA’s theory when we use LR parser


    • We have not leveraged the potential of LR parser


    • Lrama LALR parser generator is an infrastructure for Ruby parser

    View full-size slide

  76. Dragon Book shows…
    • Usability


    • Maintainability


    • Universal Parser


    View full-size slide

  77. Dragon Book shows…
    • Usability


    • Maintainability


    • Universal Parser


    • LALR Parser
    Generation

    View full-size slide

  78. History of parser generator
    • 2006: GCC migrates it’s parser from Bison to hand-
    written recursive-descent parsers (C++ was 2004)


    • 2015: Go migrates it’s parser from Bison to hand-written recursive-
    descent parsers

    View full-size slide

  79. History of parser generator
    • 2006: GCC migrates it’s parser from Bison to hand-
    written recursive-descent parsers (C++ was 2004)


    • 2015: Go migrates it’s parser from Bison to hand-written recursive-
    descent parsers


    • 2020: “Don't Panic! Better, Fewer, Syntax Errors for LR Parsers”


    • 2022: “Practical LR Parser Generation”

    View full-size slide

  80. History of parser generator
    • 2006: GCC migrates it’s parser from Bison to hand-
    written recursive-descent parsers (C++ was 2004)


    • 2015: Go migrates it’s parser from Bison to hand-written recursive-
    descent parsers


    • 2020: “Don't Panic! Better, Fewer, Syntax Errors for LR Parsers”


    • 2022: “Practical LR Parser Generation”


    • 2023: “The future vision of Ruby Parser” in RubyKaigi 2023

    View full-size slide

  81. New era of LR parser

    View full-size slide

  82. The future vision of Parser
    • LR parser is the basic building blocks


    • Expanding grammar DSL to leverage LR parser


    • Moving lexer’s logic into parser grammar rules


    • Multiple parser algorithms for multiple purposes


    • Users can focus on writing grammar

    View full-size slide

  83. Next Steps
    • Migrate CRuby parser generator from GNU Bison to Lrama


    • Install Lrama into CRuby


    • Usability (Error-tolerant parser)


    • Integrate Lrama error-tolerant functions with CRuby


    • Maintainability


    • Use Nonterminal attributes precedence


    • Universal Parser


    • Sort out the interface then merge the PR

    View full-size slide

  84. Need your help !
    • For an expert of LR parser


    • Any feedbacks are welcome


    • For developers who will use Universal Parser and AST


    • Share me use cases


    • For developers who has interest in implementing Universal Parser


    • Let me know

    View full-size slide

  85. Acknowledgements
    • @mame, @ko1 and other committers


    • @nurse, I can not defeat all of 3 bosses without you supports

    View full-size slide

  86. References
    • Jeffrey Kegler. “Parsing: a timeline”, Sep 2014. http://jeffreykegler.github.io/Ocean-of-
    Awareness-blog/individual/2014/09/chron.html


    • Lukas Diekmann and Laurence Tratt. “Don’t Panic! Better, Fewer, Syntax Errors for LR
    Parsers”, July 2020. https://arxiv.org/pdf/1804.07133.pdf


    • Joe Zimmerman “Practical LR Parser Generation”, Sep 2022 https://arxiv.org/pdf/
    2209.08383.pdf


    • େງ ३ “LRߏจղੳͷݪཧ”, Feb 2014. https://www.jstage.jst.go.jp/article/jssst/31/1/31_1_30/
    _pdf/-char/ja


    • Matz “[OSS]yaccͷऑ఺(ͦͷ2)” Matz daily 2004/04/26. https://matz.rubyist.net/
    20040426.html#p02


    • Eugene Wallingford “ALAN KAY'S TALKS AT OOPSLA” Knowing and Doing 2004/11/06. http://
    www.cs.uni.edu/~wallingf/blog/archives/monthly/2004-11.html#e2004-11-06T21_03_42.htm

    View full-size slide

  87. References
    • yui-knk “Ruby Parser։ൃ೔ࢽ (1)” ͔Ͷ͜ʹ͖ͬ 2022/12/11. https://yui-
    knk.hatenablog.com/entry/2022/12/11/154502


    • yui-knk “Ruby Parser։ൃ೔ࢽ (2)” ͔Ͷ͜ʹ͖ͬ 2023/01/08. https://yui-
    knk.hatenablog.com/entry/2023/01/08/190105


    • yui-knk “Ruby Parser։ൃ೔ࢽ (3)” ͔Ͷ͜ʹ͖ͬ 2023/01/11. https://yui-
    knk.hatenablog.com/entry/2023/01/11/220223


    • yui-knk “Ruby Parser։ൃ೔ࢽ (4)” ͔Ͷ͜ʹ͖ͬ 2023/01/14. https://yui-
    knk.hatenablog.com/entry/2023/01/14/144131

    View full-size slide

  88. References
    • yui-knk “Ruby Parser։ൃ೔ࢽ (5) - Lrama LALR (1) parser generatorΛ࣮૷ͨ͠” ͔
    Ͷ͜ʹ͖ͬ 2023/03/13. https://yui-knk.hatenablog.com/entry/2023/03/13/101951


    • yui-knk “Ruby Parser։ൃ೔ࢽ (6) - parse.yͷMaintainabilityͷ࿩” ͔Ͷ͜ʹ͖ͬ
    2023/04/04. https://yui-knk.hatenablog.com/entry/2023/04/04/190413


    • yui-knk “Ruby Parser։ൃ೔ࢽ (7) - doʹ͍ͭͯߟ͑Δ” ͔Ͷ͜ʹ͖ͬ 2023/04/09.
    https://yui-knk.hatenablog.com/entry/2023/04/09/123723


    • yui-knk “Ruby Parser։ൃ೔ࢽ (8) - Universal Parser΁ͷಓ” ͔Ͷ͜ʹ͖ͬ
    2023/05/01. https://yui-knk.hatenablog.com/entry/2023/05/01/174828

    View full-size slide

  89. Thank you !!!

    View full-size slide