$30 off During Our Annual Pro Sale. View Details »

Ruby Parser progress report

yui-knk
August 19, 2023

Ruby Parser progress report

yui-knk

August 19, 2023
Tweet

More Decks by yui-knk

Other Decks in Programming

Transcript

  1. Ruby Parser
    progress report
    August 19, 2023


    @yui-knk


    Yuichiro Kaneko

    View Slide

  2. The Bison Slayer
    https://twitter.com/kakutani/status/1657762294431105025

    View Slide

  3. About me
    • Yuichiro Kaneko


    • yui-knk (GitHub) / spikeolaf (Twitter)


    • The author of ruby/lrama LALR parser generator


    • CRuby committer, mainly develop parser related features


    • Code positions to RNode (2018, Ruby 2.6)


    • Coverage


    • Error reporting


    • RubyVM::AbstractSyntaxTree (2018, Ruby 2.6)


    • keep_tokens & error_tolerant option (2022, Ruby 3.2)

    View Slide

  4. Ruby Method Karuta
    https://twitter.com/spikeolaf/status/1658104271580045312

    View Slide

  5. Parser generator
    • Lrama creates “parse.c” from “parse.y”


    • Parser generator creates parser from DSL
    fi
    le
    parse.y parse.c
    Lrama

    View Slide

  6. Lrama v0.5.4 released 🎉

    View Slide

  7. Brief history of release
    • v0.5.4. Counterexamples @yui-knk


    • v0.5.3. Merge error_recovery branch by @alitaso345


    • v0.5.2. Named references by @junk0612


    • v0.5.1. Add RBS and Steep check by @Little-Rubyist


    • v0.5.0. Add stdin mode by @nobu


    • v0.4.0. First ruby/ruby migrated version (<- RubyKaigi 2023)

    View Slide

  8. 15 Contributors !!!
    • Many of them are not CRuby committers


    • For some of them, Lrama is a
    fi
    rst OSS project they send PR :)

    View Slide

  9. What’s the problem?
    • Error-tolerant parser


    • For LSP


    • Universal Parser


    • It’s tough task to write your own parser for Ruby


    • Maintainability


    • “parse.y” seems dif
    fi
    cult

    View Slide

  10. Error-tolerant parser

    View Slide

  11. Progress
    • error_recovery branch has been merged
    https://github.com/ruby/lrama/pull/44

    View Slide

  12. Interface is implemented
    • Runtime con
    fi
    gures


    • YYERROR_RECOVERY_ENABLED: Enable/Disable the feature


    • YYMAXREPAIR: How deep search candidates for recovery
    https://github.com/ruby/lrama/pull/74

    View Slide

  13. What’s next?
    • Integrate it with Ruby


    • I’m now
    fi
    ghting with memory related bugs…

    View Slide

  14. Universal Parser

    View Slide

  15. Universal Parser mode
    • Development has been started
    https://github.com/ruby/ruby/pull/7927

    View Slide

  16. Progress
    • char(s) functions are reimplemented
    https://github.com/ruby/ruby/pull/8044

    View Slide

  17. What’s next?
    • Replacing Ruby objects in AST node with Strings

    View Slide

  18. Maintainability

    View Slide

  19. Why LR parser is the best
    • BNF is good interface


    • BNF is common knowledge
    https://dev.mysql.com/doc/refman/8.0/en/select.html
    https://datatracker.ietf.org/doc/html/rfc3986

    View Slide

  20. Why LR parser is the best
    • BNF is good interface


    • BNF is common knowledge


    • Bene
    fi
    ts derived from LR parser


    • Auto detection of grammar rule con
    fl
    icts


    • Error-tolerant parser without detailed grammar knowledge

    View Slide

  21. Why “parse.y” is dif
    fi
    cult?
    1. “parse.y” is large (about 15,000 lines)


    2. LALR is dif
    fi
    cult, e.g. S/R con
    fl
    ict, R/R con
    fl
    ict


    3. Bison doesn’t provide syntax sugar like option, list


    4. It’s a mixture of parser and ripper


    5. Parser and Lexer are tightly-coupled

    View Slide

  22. S/R con
    fl
    ict, R/R con
    fl
    ict
    • Counterexamples is implemented from v0.5.4

    View Slide

  23. How to resolve con
    fl
    ict
    • Add %prec to resolve con
    fl
    ict

    View Slide

  24. Not enough
    • We can show how to resolve it (sometimes).

    View Slide

  25. Still not enough
    • There might be a better syntax to direct how to solve the con
    fl
    ict

    View Slide

  26. How to support new syntax
    • kaneko: How do you come up with new syntax?


    • nobu: I remember BNF of Ruby then it’s easy to
    fi
    nd new syntax
    idea. Once come up with an idea, I check it by changing “parse.y”


    • kaneko: ??


    • nobu: ??

    View Slide

  27. More easy approach
    • (New) syntax for reducing indent of nested module


    • Finding syntax which is not valid now


    • “module X in Consts”

    View Slide

  28. module’s case
    • There is no “`in`” after M

    View Slide

  29. class’s case
    • There is “`in`” after “class C < D”

    View Slide

  30. IDE
    • Report
    fi
    le is primitive and static


    • More interactive tool is better, e.g. irb commands, LSP

    View Slide

  31. Maintainability ->


    Language Designer’s
    Developer Experience

    View Slide

  32. Only primitive syntax
    • https://github.com/ruby/racc/pull/222


    • Menhir provides “Parameterizing rules”

    View Slide

  33. mixture of parser and ripper
    • One of complexities of Ripper is that we use only one semantic
    value stack to mange both (1) semantic value and (2) returned
    value of callback methods

    View Slide

  34. mixture of parser and ripper
    • Implement User de
    fi
    ned stack for Lrama


    • Can manage VALUE, the result of callback method call, on
    another stack

    View Slide

  35. Hand written parser -> Racc
    https://github.com/ruby/lrama/pull/62

    View Slide

  36. Parser and Lexer are tightly-coupled
    • Today I focus on “enum lex_state_e state”.


    • “1 || 2”. “||” is tOROP


    • “a do || end”. “||” is ‘|’ and ‘|’


    • Lexer checks EXPR_BEG to decide which token generated


    • I have been wondering why such communication is needed,
    because parser knows current situation. Parser knows it never
    accept “||” after “a do”.

    View Slide

  37. Pseudo-Scannerless Minimal LR(1)
    • Joel Denny “PSLR(1): Pseudo-Scannerless Minimal LR(1) for the
    Deterministic Parsing of Composite Languages”, 2010


    • “vector> v;”. “>>” is an example of this paper


    • > Nevertheless, traditional scanner and parser generators attempt to
    generate loosely coupled scanners and parsers, so the user must
    maintain these tightly coupled scanner and parser speci
    fi
    cations
    separately but consistently.


    • > Scanner and parser speci
    fi
    cations would be signi
    fi
    cantly more
    maintainable if all sub-language transitions were instead computed
    from a grammar by a parser generator and recognized
    automatically by the scanner using the parser’s stack.

    View Slide

  38. Lex state
    • State of the lexer by which lexer determines which token type is
    generated


    • Parser updates lex state


    • Lexer updates and uses lex state


    • Example: keyword_if & modi
    fi
    er_if

    View Slide

  39. Move lexer logic to parser
    • Only parser updates lex state


    • Lexer has very few logic

    View Slide

  40. Automaton & Automaton
    • LR parser manages automaton


    • Lex state is also automaton


    • automaton + automaton = automaton

    View Slide

  41. In the future
    • Lexer (C code) is a source of Ruby speci
    fi
    c knowledge. Then
    developers need to learn the knowledge


    • Scannerless parser can remove such Ruby speci
    fi
    c knowledge.


    • Once scannerless parser is covered by textbooks of computer
    science, developers can manage “parse.y” by just textbooks
    knowledge

    View Slide

  42. Why “parse.y” is dif
    fi
    cult?
    1. “parse.y” is large (about 15,000 lines)


    2. LALR is dif
    fi
    cult, e.g. S/R con
    fl
    ict, R/R con
    fl
    ict


    • Counterexamples


    • More hints for Con
    fl
    ict Resolution & new syntax discussion


    3. Bison doesn’t provide syntax sugar like option, list


    • option, list and so on


    4. It’s a mixture of parser and ripper


    • User de
    fi
    ned stack


    5. Parser and Lexer are tightly-coupled


    • Moving lexer logic into parser


    • Scannerless LR


    • IELR(1)


    • Lex state management in parser

    View Slide

  43. Summary
    • Error-tolerant parser


    • Next action: Integration with Ruby


    • Universal Parser


    • Next action: Remove Ruby object from AST node

    View Slide

  44. • Maintainability


    • Ruby speci
    fi
    c knowledge from parse.y


    • Scannerless LR


    • Lex state management by parser


    • Improve Language Designer’s Developer Experience


    • Fine grained con
    fl
    ict resolution


    • Guide for how to resolve con
    fl
    ict
    Summary

    View Slide

  45. Thank you!!

    View Slide

  46. References
    • Yuichiro Kaneko “The future vision of Ruby Parser”, 2023 https://rubykaigi.org/
    2023/presentations/spikeolaf.html


    • Llama LALR parser generator https://github.com/ruby/lrama


    • Chinawat Isradisaikul and Andrew C. Myers. “Finding Counterexamples from
    Parsing Con
    fl
    icts”, 2015 https://www.cs.cornell.edu/andru/papers/cupex/
    cupex.pdf


    • Menhir Reference Manual (version 20230608) “5.2 Parameterizing rules” http://
    gallium.inria.fr/~fpottier/menhir/manual.html#sec32


    • Joel Denny “PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic
    Parsing of Composite Languages”, 2010 https://tigerprints.clemson.edu/cgi/
    viewcontent.cgi?article=1519&context=all_dissertations

    View Slide

  47. References
    • Joel E. Denny “The IELR(1) algorithm for generating minimal LR(1) parser tables
    for non-LR(1) grammars with con
    fl
    ict resolution”, 2010 https://core.ac.uk/
    download/pdf/82047055.pdf

    View Slide