Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The future vision of Ruby Parser

The future vision of Ruby Parser

yui-knk

May 11, 2023
Tweet

More Decks by yui-knk

Other Decks in Programming

Transcript

  1. The future vision of Ruby Parser May 11, 2023 in

    RubyKaigi 2023 @yui-knk Yuichiro Kaneko
  2. About me • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf

    (Twitter) • Treasure Data • Engineering Manager of Applications Backend
  3. TD and Ruby committers twitter: @nalsh GitHub: @nurse twitter: @k_tsj

    GitHub: @k-tsj twitter: @ spikeolaf GitHub: @yui-knk twitter: @mineroaoki GitHub: @aamine twitter: @nahi GitHub: @nahi Applications Backend
  4. Attendees from TD @spikeolaf @nalsh @k_tsj @frsyuki @takkanm @makimoto @

    citystar (GH) @chezou @ybiquitous @hkdnet @a_ksi19 @exoego
  5. About me • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf

    (Twitter) • CRuby committer, mainly develop parser related features • Code positions to RNode (2018, Ruby 2.6) • Coverage • Error reporting • RubyVM::AbstractSyntaxTree (2018, Ruby 2.6) • keep_tokens option (2022, Ruby 3.2) • error_tolerant option (2022, Ruby 3.2)
  6. Parser in Ruby • Converting input script into Abstract Syntax

    Tree • CRuby’s parser is LALR parser • CRuby uses GNU Bison to generate parser codes
  7. History of parser generator • 1965: Donald E. Knuth invents

    LR parsing. “On the translation of languages from left to right” • 1975: Yacc is published • 1985: GNU Bison initial release • 1989: Berkeley Yacc initial release • 2006: GCC migrates it’s parser from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers
  8. What’s the problem? • Usability • Error-tolerant parser • Maintainability

    • “parse.y” seems dif fi cult • Universal Parser • It’s tough task to write your own parser for Ruby From “Ruby Committers vs The World” 2022 and 2021
  9. Usability For example, the Java error recovery approach in the

    Eclipse IDE is 5KLoC long, making it only slightly smaller than a modern version of Berkeley Yacc
  10. Parser’s responsibility • Check if the input is valid Ruby

    code • If valid • Build internal representation (AST) for subsequent process, “compile.c” • If invalid • Report Syntax Error
  11. Why Error-tolerant parser is need? • LSP requires parser to

    parse invalid code as far as possible • Just raising syntax error is not enough in this case
  12. Parser’s responsibility • Check if the input is valid Ruby

    code • If valid • Build internal representation (AST) for subsequent process (“compile.c”) • If invalid • Report Syntax Error • Build AST as far as possible (New!)
  13. Python’s approach • CPython uses PEG parser • Try “A”

    • Try “B” if “A” fails • Try “C” if “B” fails https://devguide.python.org/internals/parser
  14. Python’s approach • CPython de fi nes valid cases and

    invalid cases • A rule failure doesn’t imply a parsing failure like in context free grammars https://github.com/python/cpython/blob/889b0b9bf95651fc05ad532cc4a66c0f8ff29fc2/Grammar/python.gram
  15. Rust/Go’s approach • Both of them use hand-written parser •

    Go skips one or more tokens until one of “followlist” appears https://github.com/golang/go/blob/157aae6eed1c092fd9e8ead3527185378eb828e1/src/cmd/compile/internal/syntax/parser.go#L1032 
 https://github.com/golang/go/blob/157aae6eed1c092fd9e8ead3527185378eb828e1/src/cmd/compile/internal/syntax/parser.go#L321
  16. Rust/Go’s approach • rust-analyzer also skips speci fi ed token

    https://github.com/rust-lang/rust-analyzer/blob/21e61bee8b74e93f14205f4a6c316db08f811e38/crates/parser/src/grammar.rs#L302 
 https://github.com/rust-lang/rust-analyzer/blob/21e61bee8b74e93f14205f4a6c316db08f811e38/crates/parser/src/parser.rs#L227
  17. Bison’s approach • Bison panic-mode error recovery discards part of

    input so that it can parse the rest of input. =>
  18. In short • PEG requires developers to implement all error

    cases • hand-written parser requires developers to cover all error cases • panic-mode error recovery loses part of input, sometimes which is most important for complementation
  19. Our approach • Insert/Delete/Shift operations based error recovery • Fischer,

    Corchuelo, CPCT+ • Lukas Diekmann and Laurence Tratt. “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers”, July 2020 • https://arxiv.org/pdf/1804.07133.pdf
  20. Insert/Delete token • Inserting tokens until input script to be

    valid • Deleting tokens until input script to be valid • Mixed operations is also acceptable
  21. How LR parse works? • (LA)LR converts production rules into

    DFA (Deterministic Finite Automaton) • (LA)LR parser is implemented as PDA (Pushdown Automaton) • The stack manages states of DFA
  22. How approach works? (2) k_if • expr_value … true •

    k_def • def_name … … “true” “def” …
  23. How approach works? (3) k_if expr_value • then … “then”

    “;” “\n” (4) k_if expr_value then • … (4) k_if expr_value then • … (4) k_if expr_value then • …
  24. How approach works? (4) k_if expr_value then • compstmt if_tail

    k_end (7) k_if expr_value then compstmt if_tail k_end • compstmt and if_tail are optional
  25. Summary • There are some paths to recover an error

    • Find the cheapest path to repair it • Need to implement only path fi nding logic • No need to take care about detail of a grammar
  26. Problems of expanding Bison • Bison = Bison command +

    template fi le (“yacc.c”) • Template fi le is a detail of implementation • Installed version of Bison depends on environments • Expanding Bison template is not easy https://github.com/akimd/bison/blob/25b3d0e1a3f97a33615099e4b211f3953990c203/data/skeletons/yacc.c#L1640
  27. Lrama LALR (1) parser generator • https://github.com/yui-knk/lrama • 100% Ruby

    implementation • Will be installed ruby/ruby tool directory • Input fi le is Bison format fi le (“parse.y”) • Output is LALR parser written by C • Generate 100% compatible C fi le for Ruby 3.0.5, 3.1.0, 3.2.0 • https://bugs.ruby-lang.org/issues/19637
  28. Error Recovery by Bison • Does not work well for

    this case • This example is provided by @tompng
  29. Summary • Parser’s responsibility is increasing • PEG and hand-written

    parser need to be aware of detail of grammar • Bison’s panic mode loses part of input • Token based error recovery is fl exible, no need to know the detail of grammar • We can ride on DFA’s theory if we use LR parser • Defeated the fi rst boss !
  30. Maintainability In this work we demonstrate that, contrary to the

    prevailing consensus, we can have the best of both worlds: for a very general, practical class of grammars—a strict superset of Knuth’s canonical LR—we can generate parsers automatically, and such that the resulting parser code, as well as the generation procedure itself, is highly ef fi cient. “Practical LR Parser Generation”
  31. Why “parse.y” is dif fi cult? 1. “parse.y” is large

    (about 15,000 lines) 2. LALR is dif fi cult, e.g. S/R con fl ict, R/R con fl ict 3. Bison doesn’t provide syntax sugar like option, list 4. It’s a mixture of parser and ripper 5. Parser and Lexer are tightly-coupled
  32. Monstrous lexer state • enum lex_state_e state: Has 13 different

    types of state. • int paren_nest: Nest level of (, [, {. Used for parsing -> {}. • int lpar_beg: Stores paren_next when parsing lambda starts. • int brace_nest: Next level of {. Used for parsing “#{var}". • stack_type cond_stack: Used for parsing condition like “while ... do” • stack_type cmdarg_stack: Used for parsing command call like “foo 1, 2 do”
  33. 1. Resolve con fl icts • Ruby uses 4 different

    “do” • lambda • condition (while) • command call • method call
  34. • It’s not a joke nor metaphor, CRuby literally has

    4 different “do” 1. Resolve con fl icts
  35. 1. Resolve con fl icts • “do” is a cause

    of shift/reduce con fl icts • “do” never appears in the condition of while, until and so on
  36. 1. Resolve con fl icts • Matz daily 2004/04/26 •

    https://matz.rubyist.net/20040426.html#p02 1. Write two full set of rules, one is with do, another is without do.
  37. 1. Resolve con fl icts • Matz daily 2004/04/26 •

    https://matz.rubyist.net/20040426.html#p02 1. Write two full set of rules, one is with do, another is without do. 2. Hack a lexer so that a lexer returns different tokens for same “do” string based on the context (= state) • CRuby selected the later
  38. Nonterminal attributes • Correct, Bison doesn’t have such feature. However

    the fact does not mean it’s impossible! • Joe Zimmerman “Practical LR Parser Generation”, Sep 2022 • https://arxiv.org/pdf/2209.08383.pdf • “Nonterminal attributes”
  39. Nonterminal attributes 1. De fi ne attributes 2. “do” is

    allowed in top level, however not allowed in while condition 3. “k_do” is allowed when DO_ALLOWED is true
  40. Con fl icts • E.g. Con fl ict on endless

    method de fi nition • Lexer hack introduced ambiguities
  41. Already have hints (1) • Precedence is solved yet by

    “Operator Precedence” • https://www.gnu.org/software/bison/manual/html_node/ Precedence.html
  42. Nonterminal attributes for con fl ict resolution 1. De fi

    ne attributes 2. “do” in f_larglist has less precedence than “do” in lambda_body -> “do” is reduced 3. “do” in () is shifted 4. “do” in top level is shifted
  43. Summary • “parse.y” dif fi culty comes from tightly coupling

    between parser and lexer • Nonterminal attributes solves a part of problems • Nonterminal attributes for precedence solves “do” overload • We have not leveraged the potential of LR parser • Defeated the second boss !!
  44. Why Universal Parser is needed? • Everyone wants to use

    CRuby parser • mruby, PicoRuby: Other Ruby implementation by C • JRuby, Truf fl eRuby, ruruby: Other Ruby implementation by non-C • sorbet, typeprof: Tools • Implementing 100 % compatible Ruby parser is a bit dif fi cult • Managing parser for each version is dif fi cult
  45. Why it isn’t Universal Parser? • CRuby parser depends on

    other CRuby functionaries !!! lexer & parser GC RString RArray RHash … rb_mRubyVMFrozenCore struct rb_iseq_struct * Ruby
  46. The road to Universal Parser 1. Passing required functions as

    function pointer 2. Linking functions into a parser shared library 1. parse.o: Generated from “parse.y” 2. node2.o: Separate AST/Node codes from “node.c” 3. st2.o: Copy “st.c” and remove unnecessary codes • https://github.com/yui-knk/ruby/tree/universal-parser
  47. Sort out the interface • Memory management • malloc, realloc,

    free … • They should be in the interface • imemo • tmpbuf_auto_free_pointer, tmpbuf_set_ptr • CRuby internal, let’s remove the dependency
  48. Sort out the interface • Literal Object • Do not

    create object, but keep it as “string” instead.
  49. Sort out the interface • Parser manipulates object • Parser

    needs to know structure of objects • Need to pass functions
  50. Summary • Universal Parser is required for tools and other

    Ruby implementations • 209 functions is a starting line • A lot of sub tasks to make the interface user-friendly • Defeated the third boss !!!
  51. Conclusion The future is not laid out on a track.

    It is something that we can decide, and to the extent that we do not violate any known laws of the universe, we can probably make it work the way that we want to. Alan Curtis Kay
  52. Conclusion • LR parser can solve 3 major problems, Usability/Maintainability/

    Universal Parser • We can ride on DFA’s theory when we use LR parser • We have not leveraged the potential of LR parser • Lrama LALR parser generator is an infrastructure for Ruby parser
  53. History of parser generator • 2006: GCC migrates it’s parser

    from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers
  54. History of parser generator • 2006: GCC migrates it’s parser

    from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers • 2020: “Don't Panic! Better, Fewer, Syntax Errors for LR Parsers” • 2022: “Practical LR Parser Generation”
  55. History of parser generator • 2006: GCC migrates it’s parser

    from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers • 2020: “Don't Panic! Better, Fewer, Syntax Errors for LR Parsers” • 2022: “Practical LR Parser Generation” • 2023: “The future vision of Ruby Parser” in RubyKaigi 2023
  56. The future vision of Parser • LR parser is the

    basic building blocks • Expanding grammar DSL to leverage LR parser • Moving lexer’s logic into parser grammar rules • Multiple parser algorithms for multiple purposes • Users can focus on writing grammar
  57. Next Steps • Migrate CRuby parser generator from GNU Bison

    to Lrama • Install Lrama into CRuby • Usability (Error-tolerant parser) • Integrate Lrama error-tolerant functions with CRuby • Maintainability • Use Nonterminal attributes precedence • Universal Parser • Sort out the interface then merge the PR
  58. Need your help ! • For an expert of LR

    parser • Any feedbacks are welcome • For developers who will use Universal Parser and AST • Share me use cases • For developers who has interest in implementing Universal Parser • Let me know
  59. Acknowledgements • @mame, @ko1 and other committers • @nurse, I

    can not defeat all of 3 bosses without you supports
  60. References • Jeffrey Kegler. “Parsing: a timeline”, Sep 2014. http://jeffreykegler.github.io/Ocean-of-

    Awareness-blog/individual/2014/09/chron.html • Lukas Diekmann and Laurence Tratt. “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers”, July 2020. https://arxiv.org/pdf/1804.07133.pdf • Joe Zimmerman “Practical LR Parser Generation”, Sep 2022 https://arxiv.org/pdf/ 2209.08383.pdf • େງ ३ “LRߏจղੳͷݪཧ”, Feb 2014. https://www.jstage.jst.go.jp/article/jssst/31/1/31_1_30/ _pdf/-char/ja • Matz “[OSS]yaccͷऑ఺(ͦͷ2)” Matz daily 2004/04/26. https://matz.rubyist.net/ 20040426.html#p02 • Eugene Wallingford “ALAN KAY'S TALKS AT OOPSLA” Knowing and Doing 2004/11/06. http:// www.cs.uni.edu/~wallingf/blog/archives/monthly/2004-11.html#e2004-11-06T21_03_42.htm
  61. References • yui-knk “Ruby Parser։ൃ೔ࢽ (1)” ͔Ͷ͜ʹ͖ͬ 2022/12/11. https://yui- knk.hatenablog.com/entry/2022/12/11/154502

    • yui-knk “Ruby Parser։ൃ೔ࢽ (2)” ͔Ͷ͜ʹ͖ͬ 2023/01/08. https://yui- knk.hatenablog.com/entry/2023/01/08/190105 • yui-knk “Ruby Parser։ൃ೔ࢽ (3)” ͔Ͷ͜ʹ͖ͬ 2023/01/11. https://yui- knk.hatenablog.com/entry/2023/01/11/220223 • yui-knk “Ruby Parser։ൃ೔ࢽ (4)” ͔Ͷ͜ʹ͖ͬ 2023/01/14. https://yui- knk.hatenablog.com/entry/2023/01/14/144131
  62. References • yui-knk “Ruby Parser։ൃ೔ࢽ (5) - Lrama LALR (1)

    parser generatorΛ࣮૷ͨ͠” ͔ Ͷ͜ʹ͖ͬ 2023/03/13. https://yui-knk.hatenablog.com/entry/2023/03/13/101951 • yui-knk “Ruby Parser։ൃ೔ࢽ (6) - parse.yͷMaintainabilityͷ࿩” ͔Ͷ͜ʹ͖ͬ 2023/04/04. https://yui-knk.hatenablog.com/entry/2023/04/04/190413 • yui-knk “Ruby Parser։ൃ೔ࢽ (7) - doʹ͍ͭͯߟ͑Δ” ͔Ͷ͜ʹ͖ͬ 2023/04/09. https://yui-knk.hatenablog.com/entry/2023/04/09/123723 • yui-knk “Ruby Parser։ൃ೔ࢽ (8) - Universal Parser΁ͷಓ” ͔Ͷ͜ʹ͖ͬ 2023/05/01. https://yui-knk.hatenablog.com/entry/2023/05/01/174828