Slide 1

Slide 1 text

Ruby Parser progress report August 19, 2023 @yui-knk Yuichiro Kaneko

Slide 2

Slide 2 text

The Bison Slayer https://twitter.com/kakutani/status/1657762294431105025

Slide 3

Slide 3 text

About me • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf (Twitter) • The author of ruby/lrama LALR parser generator • CRuby committer, mainly develop parser related features • Code positions to RNode (2018, Ruby 2.6) • Coverage • Error reporting • RubyVM::AbstractSyntaxTree (2018, Ruby 2.6) • keep_tokens & error_tolerant option (2022, Ruby 3.2)

Slide 4

Slide 4 text

Ruby Method Karuta https://twitter.com/spikeolaf/status/1658104271580045312

Slide 5

Slide 5 text

Parser generator • Lrama creates “parse.c” from “parse.y” • Parser generator creates parser from DSL fi le parse.y parse.c Lrama

Slide 6

Slide 6 text

Lrama v0.5.4 released 🎉

Slide 7

Slide 7 text

Brief history of release • v0.5.4. Counterexamples @yui-knk • v0.5.3. Merge error_recovery branch by @alitaso345 • v0.5.2. Named references by @junk0612 • v0.5.1. Add RBS and Steep check by @Little-Rubyist • v0.5.0. Add stdin mode by @nobu • v0.4.0. First ruby/ruby migrated version (<- RubyKaigi 2023)

Slide 8

Slide 8 text

15 Contributors !!! • Many of them are not CRuby committers • For some of them, Lrama is a fi rst OSS project they send PR :)

Slide 9

Slide 9 text

What’s the problem? • Error-tolerant parser • For LSP • Universal Parser • It’s tough task to write your own parser for Ruby • Maintainability • “parse.y” seems dif fi cult

Slide 10

Slide 10 text

Error-tolerant parser

Slide 11

Slide 11 text

Progress • error_recovery branch has been merged https://github.com/ruby/lrama/pull/44

Slide 12

Slide 12 text

Interface is implemented • Runtime con fi gures • YYERROR_RECOVERY_ENABLED: Enable/Disable the feature • YYMAXREPAIR: How deep search candidates for recovery https://github.com/ruby/lrama/pull/74

Slide 13

Slide 13 text

What’s next? • Integrate it with Ruby • I’m now fi ghting with memory related bugs…

Slide 14

Slide 14 text

Universal Parser

Slide 15

Slide 15 text

Universal Parser mode • Development has been started https://github.com/ruby/ruby/pull/7927

Slide 16

Slide 16 text

Progress • char(s) functions are reimplemented https://github.com/ruby/ruby/pull/8044

Slide 17

Slide 17 text

What’s next? • Replacing Ruby objects in AST node with Strings

Slide 18

Slide 18 text

Maintainability

Slide 19

Slide 19 text

Why LR parser is the best • BNF is good interface • BNF is common knowledge https://dev.mysql.com/doc/refman/8.0/en/select.html https://datatracker.ietf.org/doc/html/rfc3986

Slide 20

Slide 20 text

Why LR parser is the best • BNF is good interface • BNF is common knowledge • Bene fi ts derived from LR parser • Auto detection of grammar rule con fl icts • Error-tolerant parser without detailed grammar knowledge

Slide 21

Slide 21 text

Why “parse.y” is dif fi cult? 1. “parse.y” is large (about 15,000 lines) 2. LALR is dif fi cult, e.g. S/R con fl ict, R/R con fl ict 3. Bison doesn’t provide syntax sugar like option, list 4. It’s a mixture of parser and ripper 5. Parser and Lexer are tightly-coupled

Slide 22

Slide 22 text

S/R con fl ict, R/R con fl ict • Counterexamples is implemented from v0.5.4

Slide 23

Slide 23 text

How to resolve con fl ict • Add %prec to resolve con fl ict

Slide 24

Slide 24 text

Not enough • We can show how to resolve it (sometimes).

Slide 25

Slide 25 text

Still not enough • There might be a better syntax to direct how to solve the con fl ict

Slide 26

Slide 26 text

How to support new syntax • kaneko: How do you come up with new syntax? • nobu: I remember BNF of Ruby then it’s easy to fi nd new syntax idea. Once come up with an idea, I check it by changing “parse.y” • kaneko: ?? • nobu: ??

Slide 27

Slide 27 text

More easy approach • (New) syntax for reducing indent of nested module • Finding syntax which is not valid now • “module X in Consts”

Slide 28

Slide 28 text

module’s case • There is no “`in`” after M

Slide 29

Slide 29 text

class’s case • There is “`in`” after “class C < D”

Slide 30

Slide 30 text

IDE • Report fi le is primitive and static • More interactive tool is better, e.g. irb commands, LSP

Slide 31

Slide 31 text

Maintainability -> Language Designer’s Developer Experience

Slide 32

Slide 32 text

Only primitive syntax • https://github.com/ruby/racc/pull/222 • Menhir provides “Parameterizing rules”

Slide 33

Slide 33 text

mixture of parser and ripper • One of complexities of Ripper is that we use only one semantic value stack to mange both (1) semantic value and (2) returned value of callback methods

Slide 34

Slide 34 text

mixture of parser and ripper • Implement User de fi ned stack for Lrama • Can manage VALUE, the result of callback method call, on another stack

Slide 35

Slide 35 text

Hand written parser -> Racc https://github.com/ruby/lrama/pull/62

Slide 36

Slide 36 text

Parser and Lexer are tightly-coupled • Today I focus on “enum lex_state_e state”. • “1 || 2”. “||” is tOROP • “a do || end”. “||” is ‘|’ and ‘|’ • Lexer checks EXPR_BEG to decide which token generated • I have been wondering why such communication is needed, because parser knows current situation. Parser knows it never accept “||” after “a do”.

Slide 37

Slide 37 text

Pseudo-Scannerless Minimal LR(1) • Joel Denny “PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages”, 2010 • “vector> v;”. “>>” is an example of this paper • > Nevertheless, traditional scanner and parser generators attempt to generate loosely coupled scanners and parsers, so the user must maintain these tightly coupled scanner and parser speci fi cations separately but consistently. • > Scanner and parser speci fi cations would be signi fi cantly more maintainable if all sub-language transitions were instead computed from a grammar by a parser generator and recognized automatically by the scanner using the parser’s stack.

Slide 38

Slide 38 text

Lex state • State of the lexer by which lexer determines which token type is generated • Parser updates lex state • Lexer updates and uses lex state • Example: keyword_if & modi fi er_if

Slide 39

Slide 39 text

Move lexer logic to parser • Only parser updates lex state • Lexer has very few logic

Slide 40

Slide 40 text

Automaton & Automaton • LR parser manages automaton • Lex state is also automaton • automaton + automaton = automaton

Slide 41

Slide 41 text

In the future • Lexer (C code) is a source of Ruby speci fi c knowledge. Then developers need to learn the knowledge • Scannerless parser can remove such Ruby speci fi c knowledge. • Once scannerless parser is covered by textbooks of computer science, developers can manage “parse.y” by just textbooks knowledge

Slide 42

Slide 42 text

Why “parse.y” is dif fi cult? 1. “parse.y” is large (about 15,000 lines) 2. LALR is dif fi cult, e.g. S/R con fl ict, R/R con fl ict • Counterexamples • More hints for Con fl ict Resolution & new syntax discussion 3. Bison doesn’t provide syntax sugar like option, list • option, list and so on 4. It’s a mixture of parser and ripper • User de fi ned stack 5. Parser and Lexer are tightly-coupled • Moving lexer logic into parser • Scannerless LR • IELR(1) • Lex state management in parser

Slide 43

Slide 43 text

Summary • Error-tolerant parser • Next action: Integration with Ruby • Universal Parser • Next action: Remove Ruby object from AST node

Slide 44

Slide 44 text

• Maintainability • Ruby speci fi c knowledge from parse.y • Scannerless LR • Lex state management by parser • Improve Language Designer’s Developer Experience • Fine grained con fl ict resolution • Guide for how to resolve con fl ict Summary

Slide 45

Slide 45 text

Thank you!!

Slide 46

Slide 46 text

References • Yuichiro Kaneko “The future vision of Ruby Parser”, 2023 https://rubykaigi.org/ 2023/presentations/spikeolaf.html • Llama LALR parser generator https://github.com/ruby/lrama • Chinawat Isradisaikul and Andrew C. Myers. “Finding Counterexamples from Parsing Con fl icts”, 2015 https://www.cs.cornell.edu/andru/papers/cupex/ cupex.pdf • Menhir Reference Manual (version 20230608) “5.2 Parameterizing rules” http:// gallium.inria.fr/~fpottier/menhir/manual.html#sec32 • Joel Denny “PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages”, 2010 https://tigerprints.clemson.edu/cgi/ viewcontent.cgi?article=1519&context=all_dissertations

Slide 47

Slide 47 text

References • Joel E. Denny “The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with con fl ict resolution”, 2010 https://core.ac.uk/ download/pdf/82047055.pdf