Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ruby Parser progress report

yui-knk
August 19, 2023

Ruby Parser progress report

yui-knk

August 19, 2023
Tweet

More Decks by yui-knk

Other Decks in Programming

Transcript

  1. About me • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf

    (Twitter) • The author of ruby/lrama LALR parser generator • CRuby committer, mainly develop parser related features • Code positions to RNode (2018, Ruby 2.6) • Coverage • Error reporting • RubyVM::AbstractSyntaxTree (2018, Ruby 2.6) • keep_tokens & error_tolerant option (2022, Ruby 3.2)
  2. Parser generator • Lrama creates “parse.c” from “parse.y” • Parser

    generator creates parser from DSL fi le parse.y parse.c Lrama
  3. Brief history of release • v0.5.4. Counterexamples @yui-knk • v0.5.3.

    Merge error_recovery branch by @alitaso345 • v0.5.2. Named references by @junk0612 • v0.5.1. Add RBS and Steep check by @Little-Rubyist • v0.5.0. Add stdin mode by @nobu • v0.4.0. First ruby/ruby migrated version (<- RubyKaigi 2023)
  4. 15 Contributors !!! • Many of them are not CRuby

    committers • For some of them, Lrama is a fi rst OSS project they send PR :)
  5. What’s the problem? • Error-tolerant parser • For LSP •

    Universal Parser • It’s tough task to write your own parser for Ruby • Maintainability • “parse.y” seems dif fi cult
  6. Interface is implemented • Runtime con fi gures • YYERROR_RECOVERY_ENABLED:

    Enable/Disable the feature • YYMAXREPAIR: How deep search candidates for recovery https://github.com/ruby/lrama/pull/74
  7. What’s next? • Integrate it with Ruby • I’m now

    fi ghting with memory related bugs…
  8. Why LR parser is the best • BNF is good

    interface • BNF is common knowledge https://dev.mysql.com/doc/refman/8.0/en/select.html https://datatracker.ietf.org/doc/html/rfc3986
  9. Why LR parser is the best • BNF is good

    interface • BNF is common knowledge • Bene fi ts derived from LR parser • Auto detection of grammar rule con fl icts • Error-tolerant parser without detailed grammar knowledge
  10. Why “parse.y” is dif fi cult? 1. “parse.y” is large

    (about 15,000 lines) 2. LALR is dif fi cult, e.g. S/R con fl ict, R/R con fl ict 3. Bison doesn’t provide syntax sugar like option, list 4. It’s a mixture of parser and ripper 5. Parser and Lexer are tightly-coupled
  11. Still not enough • There might be a better syntax

    to direct how to solve the con fl ict
  12. How to support new syntax • kaneko: How do you

    come up with new syntax? • nobu: I remember BNF of Ruby then it’s easy to fi nd new syntax idea. Once come up with an idea, I check it by changing “parse.y” • kaneko: ?? • nobu: ??
  13. More easy approach • (New) syntax for reducing indent of

    nested module • Finding syntax which is not valid now • “module X in Consts”
  14. IDE • Report fi le is primitive and static •

    More interactive tool is better, e.g. irb commands, LSP
  15. mixture of parser and ripper • One of complexities of

    Ripper is that we use only one semantic value stack to mange both (1) semantic value and (2) returned value of callback methods
  16. mixture of parser and ripper • Implement User de fi

    ned stack for Lrama • Can manage VALUE, the result of callback method call, on another stack
  17. Parser and Lexer are tightly-coupled • Today I focus on

    “enum lex_state_e state”. • “1 || 2”. “||” is tOROP • “a do || end”. “||” is ‘|’ and ‘|’ • Lexer checks EXPR_BEG to decide which token generated • I have been wondering why such communication is needed, because parser knows current situation. Parser knows it never accept “||” after “a do”.
  18. Pseudo-Scannerless Minimal LR(1) • Joel Denny “PSLR(1): Pseudo-Scannerless Minimal LR(1)

    for the Deterministic Parsing of Composite Languages”, 2010 • “vector<list<string>> v;”. “>>” is an example of this paper • > Nevertheless, traditional scanner and parser generators attempt to generate loosely coupled scanners and parsers, so the user must maintain these tightly coupled scanner and parser speci fi cations separately but consistently. • > Scanner and parser speci fi cations would be signi fi cantly more maintainable if all sub-language transitions were instead computed from a grammar by a parser generator and recognized automatically by the scanner using the parser’s stack.
  19. Lex state • State of the lexer by which lexer

    determines which token type is generated • Parser updates lex state • Lexer updates and uses lex state • Example: keyword_if & modi fi er_if
  20. Move lexer logic to parser • Only parser updates lex

    state • Lexer has very few logic
  21. Automaton & Automaton • LR parser manages automaton • Lex

    state is also automaton • automaton + automaton = automaton
  22. In the future • Lexer (C code) is a source

    of Ruby speci fi c knowledge. Then developers need to learn the knowledge • Scannerless parser can remove such Ruby speci fi c knowledge. • Once scannerless parser is covered by textbooks of computer science, developers can manage “parse.y” by just textbooks knowledge
  23. Why “parse.y” is dif fi cult? 1. “parse.y” is large

    (about 15,000 lines) 2. LALR is dif fi cult, e.g. S/R con fl ict, R/R con fl ict • Counterexamples • More hints for Con fl ict Resolution & new syntax discussion 3. Bison doesn’t provide syntax sugar like option, list • option, list and so on 4. It’s a mixture of parser and ripper • User de fi ned stack 5. Parser and Lexer are tightly-coupled • Moving lexer logic into parser • Scannerless LR • IELR(1) • Lex state management in parser
  24. Summary • Error-tolerant parser • Next action: Integration with Ruby

    • Universal Parser • Next action: Remove Ruby object from AST node
  25. • Maintainability • Ruby speci fi c knowledge from parse.y

    • Scannerless LR • Lex state management by parser • Improve Language Designer’s Developer Experience • Fine grained con fl ict resolution • Guide for how to resolve con fl ict Summary
  26. References • Yuichiro Kaneko “The future vision of Ruby Parser”,

    2023 https://rubykaigi.org/ 2023/presentations/spikeolaf.html • Llama LALR parser generator https://github.com/ruby/lrama • Chinawat Isradisaikul and Andrew C. Myers. “Finding Counterexamples from Parsing Con fl icts”, 2015 https://www.cs.cornell.edu/andru/papers/cupex/ cupex.pdf • Menhir Reference Manual (version 20230608) “5.2 Parameterizing rules” http:// gallium.inria.fr/~fpottier/menhir/manual.html#sec32 • Joel Denny “PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages”, 2010 https://tigerprints.clemson.edu/cgi/ viewcontent.cgi?article=1519&context=all_dissertations
  27. References • Joel E. Denny “The IELR(1) algorithm for generating

    minimal LR(1) parser tables for non-LR(1) grammars with con fl ict resolution”, 2010 https://core.ac.uk/ download/pdf/82047055.pdf