Slide 1

Slide 1 text

The future vision of Ruby Parser May 11, 2023 in RubyKaigi 2023 @yui-knk Yuichiro Kaneko

Slide 2

Slide 2 text

About me • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf (Twitter) • Treasure Data • Engineering Manager of Applications Backend

Slide 3

Slide 3 text

PR: We are Gold sponsor!

Slide 4

Slide 4 text

TD and Ruby committers twitter: @nalsh GitHub: @nurse twitter: @k_tsj GitHub: @k-tsj twitter: @ spikeolaf GitHub: @yui-knk twitter: @mineroaoki GitHub: @aamine twitter: @nahi GitHub: @nahi Applications Backend

Slide 5

Slide 5 text

Attendees from TD @spikeolaf @nalsh @k_tsj @frsyuki @takkanm @makimoto @ citystar (GH) @chezou @ybiquitous @hkdnet @a_ksi19 @exoego

Slide 6

Slide 6 text

About me • Yuichiro Kaneko • yui-knk (GitHub) / spikeolaf (Twitter) • CRuby committer, mainly develop parser related features • Code positions to RNode (2018, Ruby 2.6) • Coverage • Error reporting • RubyVM::AbstractSyntaxTree (2018, Ruby 2.6) • keep_tokens option (2022, Ruby 3.2) • error_tolerant option (2022, Ruby 3.2)

Slide 7

Slide 7 text

Introduction LR ߏจղੳͷجૅΛͳ͢ΞΠσΞ ͸ɼʮਖ਼نݴ ޠͷղੳख๏Λ܁Γฦ͠࢖͍ɼจ຺ࣗ༝จ๏ͷ෯ ޿͍ΫϥεΛղੳ͢Δʯͱ͍͏ (ଟ͘ͷ༏ΕͨΞ ΠσΞ ͕ͦ͏Ͱ͋ΔΑ͏ʹ) ୯७ͳ΋ͷͰ͋Δɽ େງ ३ “LRߏจղੳͷݪཧ”

Slide 8

Slide 8 text

Parser in Ruby • Converting input script into Abstract Syntax Tree • CRuby’s parser is LALR parser • CRuby uses GNU Bison to generate parser codes

Slide 9

Slide 9 text

History of parser generator • 1965: Donald E. Knuth invents LR parsing. “On the translation of languages from left to right” • 1975: Yacc is published • 1985: GNU Bison initial release • 1989: Berkeley Yacc initial release • 2006: GCC migrates it’s parser from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers

Slide 10

Slide 10 text

What’s the problem? • Usability • Error-tolerant parser • Maintainability • “parse.y” seems dif fi cult • Universal Parser • It’s tough task to write your own parser for Ruby From “Ruby Committers vs The World” 2022 and 2021

Slide 11

Slide 11 text

Today’s talk • Usability • Maintainability • Universal Parser • Me

Slide 12

Slide 12 text

Usability For example, the Java error recovery approach in the Eclipse IDE is 5KLoC long, making it only slightly smaller than a modern version of Berkeley Yacc

Slide 13

Slide 13 text

Parser’s responsibility • Check if the input is valid Ruby code • If valid • Build internal representation (AST) for subsequent process, “compile.c” • If invalid • Report Syntax Error

Slide 14

Slide 14 text

Why Error-tolerant parser is need? • LSP requires parser to parse invalid code as far as possible • Just raising syntax error is not enough in this case

Slide 15

Slide 15 text

Parser’s responsibility • Check if the input is valid Ruby code • If valid • Build internal representation (AST) for subsequent process (“compile.c”) • If invalid • Report Syntax Error • Build AST as far as possible (New!)

Slide 16

Slide 16 text

Python’s approach • CPython uses PEG parser • Try “A” • Try “B” if “A” fails • Try “C” if “B” fails https://devguide.python.org/internals/parser

Slide 17

Slide 17 text

Python’s approach • CPython de fi nes valid cases and invalid cases • A rule failure doesn’t imply a parsing failure like in context free grammars https://github.com/python/cpython/blob/889b0b9bf95651fc05ad532cc4a66c0f8ff29fc2/Grammar/python.gram

Slide 18

Slide 18 text

Rust/Go’s approach • Both of them use hand-written parser • Go skips one or more tokens until one of “followlist” appears https://github.com/golang/go/blob/157aae6eed1c092fd9e8ead3527185378eb828e1/src/cmd/compile/internal/syntax/parser.go#L1032 
 https://github.com/golang/go/blob/157aae6eed1c092fd9e8ead3527185378eb828e1/src/cmd/compile/internal/syntax/parser.go#L321

Slide 19

Slide 19 text

Rust/Go’s approach • rust-analyzer also skips speci fi ed token https://github.com/rust-lang/rust-analyzer/blob/21e61bee8b74e93f14205f4a6c316db08f811e38/crates/parser/src/grammar.rs#L302 
 https://github.com/rust-lang/rust-analyzer/blob/21e61bee8b74e93f14205f4a6c316db08f811e38/crates/parser/src/parser.rs#L227

Slide 20

Slide 20 text

Bison’s approach • Bison panic-mode error recovery discards part of input so that it can parse the rest of input. =>

Slide 21

Slide 21 text

In short • PEG requires developers to implement all error cases • hand-written parser requires developers to cover all error cases • panic-mode error recovery loses part of input, sometimes which is most important for complementation

Slide 22

Slide 22 text

Our approach • Insert/Delete/Shift operations based error recovery • Fischer, Corchuelo, CPCT+ • Lukas Diekmann and Laurence Tratt. “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers”, July 2020 • https://arxiv.org/pdf/1804.07133.pdf

Slide 23

Slide 23 text

How it works? Diekmann and Tratt. “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers”

Slide 24

Slide 24 text

Insert/Delete token • Inserting tokens until input script to be valid • Deleting tokens until input script to be valid • Mixed operations is also acceptable

Slide 25

Slide 25 text

How LR parse works? • (LA)LR converts production rules into DFA (Deterministic Finite Automaton) • (LA)LR parser is implemented as PDA (Pushdown Automaton) • The stack manages states of DFA

Slide 26

Slide 26 text

How approach works? (2) k_if • expr_value … true • k_def • def_name … … “true” “def” …

Slide 27

Slide 27 text

How approach works? (3) k_if expr_value • then … “then” “;” “\n” (4) k_if expr_value then • … (4) k_if expr_value then • … (4) k_if expr_value then • …

Slide 28

Slide 28 text

How approach works? (4) k_if expr_value then • compstmt if_tail k_end (7) k_if expr_value then compstmt if_tail k_end • compstmt and if_tail are optional

Slide 29

Slide 29 text

How approach works? Diekmann and Tratt. “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers”

Slide 30

Slide 30 text

Summary • There are some paths to recover an error • Find the cheapest path to repair it • Need to implement only path fi nding logic • No need to take care about detail of a grammar

Slide 31

Slide 31 text

Problems of expanding Bison • Bison = Bison command + template fi le (“yacc.c”) • Template fi le is a detail of implementation • Installed version of Bison depends on environments • Expanding Bison template is not easy https://github.com/akimd/bison/blob/25b3d0e1a3f97a33615099e4b211f3953990c203/data/skeletons/yacc.c#L1640

Slide 32

Slide 32 text

Lrama LALR (1) parser generator • https://github.com/yui-knk/lrama • 100% Ruby implementation • Will be installed ruby/ruby tool directory • Input fi le is Bison format fi le (“parse.y”) • Output is LALR parser written by C • Generate 100% compatible C fi le for Ruby 3.0.5, 3.1.0, 3.2.0 • https://bugs.ruby-lang.org/issues/19637

Slide 33

Slide 33 text

Error Recovery by Bison • Does not work well for this case • This example is provided by @tompng

Slide 34

Slide 34 text

Error Recovery by Lrama • https://github.com/yui-knk/lrama/tree/error_recovery

Slide 35

Slide 35 text

Summary • Parser’s responsibility is increasing • PEG and hand-written parser need to be aware of detail of grammar • Bison’s panic mode loses part of input • Token based error recovery is fl exible, no need to know the detail of grammar • We can ride on DFA’s theory if we use LR parser • Defeated the fi rst boss !

Slide 36

Slide 36 text

Maintainability In this work we demonstrate that, contrary to the prevailing consensus, we can have the best of both worlds: for a very general, practical class of grammars—a strict superset of Knuth’s canonical LR—we can generate parsers automatically, and such that the resulting parser code, as well as the generation procedure itself, is highly ef fi cient. “Practical LR Parser Generation”

Slide 37

Slide 37 text

Why “parse.y” is dif fi cult? 1. “parse.y” is large (about 15,000 lines) 2. LALR is dif fi cult, e.g. S/R con fl ict, R/R con fl ict 3. Bison doesn’t provide syntax sugar like option, list 4. It’s a mixture of parser and ripper 5. Parser and Lexer are tightly-coupled

Slide 38

Slide 38 text

Monstrous lexer state • enum lex_state_e state: Has 13 different types of state. • int paren_nest: Nest level of (, [, {. Used for parsing -> {}. • int lpar_beg: Stores paren_next when parsing lambda starts. • int brace_nest: Next level of {. Used for parsing “#{var}". • stack_type cond_stack: Used for parsing condition like “while ... do” • stack_type cmdarg_stack: Used for parsing command call like “foo 1, 2 do”

Slide 39

Slide 39 text

1. Resolve con fl icts • Ruby uses 4 different “do” • lambda • condition (while) • command call • method call

Slide 40

Slide 40 text

• It’s not a joke nor metaphor, CRuby literally has 4 different “do” 1. Resolve con fl icts

Slide 41

Slide 41 text

1. Resolve con fl icts • “do” is a cause of shift/reduce con fl icts • “do” never appears in the condition of while, until and so on

Slide 42

Slide 42 text

1. Resolve con fl icts • Matz daily 2004/04/26 • https://matz.rubyist.net/20040426.html#p02 1. Write two full set of rules, one is with do, another is without do.

Slide 43

Slide 43 text

1. Resolve con fl icts • Matz daily 2004/04/26 • https://matz.rubyist.net/20040426.html#p02 1. Write two full set of rules, one is with do, another is without do. 2. Hack a lexer so that a lexer returns different tokens for same “do” string based on the context (= state) • CRuby selected the later

Slide 44

Slide 44 text

Matz dialy • > yaccͷએݴతͳจ๏͸৚͕݅ॻ͚ͳ͍ɻ ʮ͜ͷ৚݅ͷͱ͖͸͜ͷϧʔ ϧΛద༻͠ͳ͍ʯͱ͍͏Α͏ͳจ๏͸͋Γ͑ͳ͍ɻ·͋ɺLALR(1)ͷੑ࣭ Λߟ͑Ε͹͋Δҙຯ౰વͳͷͰɺ͜ΕΛऑ఺ͱ͍͏ͷ͸ద੾Ͱ͸ͳ͍ɻ Α Γਖ਼֬ʹ͸ʮऑ఺ʯͰ͸ͳͯ͘ɺ͍͍ͤͥʮཁ๬ʯͱ͔ʮཉٻʯͱ͔ͩͳɻ • > yacc doesn’t support conditions for rules, we can not omit some rules when some conditions are met. • Matz daily 2004/04/26 • https://matz.rubyist.net/20040426.html#p02

Slide 45

Slide 45 text

Nonterminal attributes • Correct, Bison doesn’t have such feature. However the fact does not mean it’s impossible! • Joe Zimmerman “Practical LR Parser Generation”, Sep 2022 • https://arxiv.org/pdf/2209.08383.pdf • “Nonterminal attributes”

Slide 46

Slide 46 text

Nonterminal attributes 1. De fi ne attributes 2. “do” is allowed in top level, however not allowed in while condition 3. “k_do” is allowed when DO_ALLOWED is true

Slide 47

Slide 47 text

Nonterminal attributes https://github.com/yui-knk/lrama/blob/e4a708d2f080f8e9ca8b082ac038fd6658d31077

Slide 48

Slide 48 text

• It works well for “while” Nonterminal attributes https://github.com/yui-knk/lrama/blob/e4a708d2f080f8e9ca8b082ac038fd6658d31077/sample/ nonterminal_attributes_3_2_0.output

Slide 49

Slide 49 text

Con fl icts • E.g. Con fl ict on endless method de fi nition • Lexer hack introduced ambiguities

Slide 50

Slide 50 text

GAME OVER CONTINUE? 
 YES 
 NO

Slide 51

Slide 51 text

GAME OVER CONTINUE? 
 YES 
 NO

Slide 52

Slide 52 text

Rethink “do” • There are 4 “do”

Slide 53

Slide 53 text

Rethink “do” • They has precedences • I don’t recommend writing such codes

Slide 54

Slide 54 text

Rethink “do” • (, [, { reset the “context” • Need to care about “context”

Slide 55

Slide 55 text

Already have hints (1) • Precedence is solved yet by “Operator Precedence” • https://www.gnu.org/software/bison/manual/html_node/ Precedence.html

Slide 56

Slide 56 text

Already have hints (2) • Nonterminal attributes carries “context”

Slide 57

Slide 57 text

Nonterminal attributes for con fl ict resolution 1. De fi ne attributes 2. “do” in f_larglist has less precedence than “do” in lambda_body -> “do” is reduced 3. “do” in () is shifted 4. “do” in top level is shifted

Slide 58

Slide 58 text

Nonterminal attributes for con fl ict resolution https://github.com/yui-knk/lrama/tree/nonterminal_attributes No con fl ict !!

Slide 59

Slide 59 text

What happens behind the scenes • Generate two states for one state

Slide 60

Slide 60 text

Summary • “parse.y” dif fi culty comes from tightly coupling between parser and lexer • Nonterminal attributes solves a part of problems • Nonterminal attributes for precedence solves “do” overload • We have not leveraged the potential of LR parser • Defeated the second boss !!

Slide 61

Slide 61 text

Universal Parser We can solve any problem by introducing an extra level of indirection.

Slide 62

Slide 62 text

Why Universal Parser is needed? • Everyone wants to use CRuby parser • mruby, PicoRuby: Other Ruby implementation by C • JRuby, Truf fl eRuby, ruruby: Other Ruby implementation by non-C • sorbet, typeprof: Tools • Implementing 100 % compatible Ruby parser is a bit dif fi cult • Managing parser for each version is dif fi cult

Slide 63

Slide 63 text

Why it isn’t Universal Parser? Ruby lexer & parser

Slide 64

Slide 64 text

Why it isn’t Universal Parser? • CRuby parser depends on other CRuby functionaries !!! lexer & parser GC RString RArray RHash … rb_mRubyVMFrozenCore struct rb_iseq_struct * Ruby

Slide 65

Slide 65 text

The road to Universal Parser 1. Passing required functions as function pointer 2. Linking functions into a parser shared library 1. parse.o: Generated from “parse.y” 2. node2.o: Separate AST/Node codes from “node.c” 3. st2.o: Copy “st.c” and remove unnecessary codes • https://github.com/yui-knk/ruby/tree/universal-parser

Slide 66

Slide 66 text

Done!!! • https://github.com/yui-knk/ruby/tree/universal-parser • https://github.com/yui-knk/my-ruby-parser •

Slide 67

Slide 67 text

However… • 209 functions

Slide 68

Slide 68 text

Sort out the interface • Memory management • malloc, realloc, free … • They should be in the interface • imemo • tmpbuf_auto_free_pointer, tmpbuf_set_ptr • CRuby internal, let’s remove the dependency

Slide 69

Slide 69 text

Sort out the interface • Literal Object • Do not create object, but keep it as “string” instead.

Slide 70

Slide 70 text

Sort out the interface • Parser manipulates object • Parser needs to know structure of objects • Need to pass functions

Slide 71

Slide 71 text

Sort out the interface • Add “negative” fl ag • Add NODE_NEG

Slide 72

Slide 72 text

Sort out the interface • AST transformation • Move it to “compile.c”

Slide 73

Slide 73 text

Summary • Universal Parser is required for tools and other Ruby implementations • 209 functions is a starting line • A lot of sub tasks to make the interface user-friendly • Defeated the third boss !!!

Slide 74

Slide 74 text

Conclusion The future is not laid out on a track. It is something that we can decide, and to the extent that we do not violate any known laws of the universe, we can probably make it work the way that we want to. Alan Curtis Kay

Slide 75

Slide 75 text

Conclusion • LR parser can solve 3 major problems, Usability/Maintainability/ Universal Parser • We can ride on DFA’s theory when we use LR parser • We have not leveraged the potential of LR parser • Lrama LALR parser generator is an infrastructure for Ruby parser

Slide 76

Slide 76 text

Dragon Book shows… • Usability • Maintainability • Universal Parser

Slide 77

Slide 77 text

Dragon Book shows… • Usability • Maintainability • Universal Parser • LALR Parser Generation

Slide 78

Slide 78 text

History of parser generator • 2006: GCC migrates it’s parser from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers

Slide 79

Slide 79 text

History of parser generator • 2006: GCC migrates it’s parser from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers • 2020: “Don't Panic! Better, Fewer, Syntax Errors for LR Parsers” • 2022: “Practical LR Parser Generation”

Slide 80

Slide 80 text

History of parser generator • 2006: GCC migrates it’s parser from Bison to hand- written recursive-descent parsers (C++ was 2004) • 2015: Go migrates it’s parser from Bison to hand-written recursive- descent parsers • 2020: “Don't Panic! Better, Fewer, Syntax Errors for LR Parsers” • 2022: “Practical LR Parser Generation” • 2023: “The future vision of Ruby Parser” in RubyKaigi 2023

Slide 81

Slide 81 text

New era of LR parser

Slide 82

Slide 82 text

The future vision of Parser • LR parser is the basic building blocks • Expanding grammar DSL to leverage LR parser • Moving lexer’s logic into parser grammar rules • Multiple parser algorithms for multiple purposes • Users can focus on writing grammar

Slide 83

Slide 83 text

Next Steps • Migrate CRuby parser generator from GNU Bison to Lrama • Install Lrama into CRuby • Usability (Error-tolerant parser) • Integrate Lrama error-tolerant functions with CRuby • Maintainability • Use Nonterminal attributes precedence • Universal Parser • Sort out the interface then merge the PR

Slide 84

Slide 84 text

Need your help ! • For an expert of LR parser • Any feedbacks are welcome • For developers who will use Universal Parser and AST • Share me use cases • For developers who has interest in implementing Universal Parser • Let me know

Slide 85

Slide 85 text

Acknowledgements • @mame, @ko1 and other committers • @nurse, I can not defeat all of 3 bosses without you supports

Slide 86

Slide 86 text

References • Jeffrey Kegler. “Parsing: a timeline”, Sep 2014. http://jeffreykegler.github.io/Ocean-of- Awareness-blog/individual/2014/09/chron.html • Lukas Diekmann and Laurence Tratt. “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers”, July 2020. https://arxiv.org/pdf/1804.07133.pdf • Joe Zimmerman “Practical LR Parser Generation”, Sep 2022 https://arxiv.org/pdf/ 2209.08383.pdf • େງ ३ “LRߏจղੳͷݪཧ”, Feb 2014. https://www.jstage.jst.go.jp/article/jssst/31/1/31_1_30/ _pdf/-char/ja • Matz “[OSS]yaccͷऑ఺(ͦͷ2)” Matz daily 2004/04/26. https://matz.rubyist.net/ 20040426.html#p02 • Eugene Wallingford “ALAN KAY'S TALKS AT OOPSLA” Knowing and Doing 2004/11/06. http:// www.cs.uni.edu/~wallingf/blog/archives/monthly/2004-11.html#e2004-11-06T21_03_42.htm

Slide 87

Slide 87 text

References • yui-knk “Ruby Parser։ൃ೔ࢽ (1)” ͔Ͷ͜ʹ͖ͬ 2022/12/11. https://yui- knk.hatenablog.com/entry/2022/12/11/154502 • yui-knk “Ruby Parser։ൃ೔ࢽ (2)” ͔Ͷ͜ʹ͖ͬ 2023/01/08. https://yui- knk.hatenablog.com/entry/2023/01/08/190105 • yui-knk “Ruby Parser։ൃ೔ࢽ (3)” ͔Ͷ͜ʹ͖ͬ 2023/01/11. https://yui- knk.hatenablog.com/entry/2023/01/11/220223 • yui-knk “Ruby Parser։ൃ೔ࢽ (4)” ͔Ͷ͜ʹ͖ͬ 2023/01/14. https://yui- knk.hatenablog.com/entry/2023/01/14/144131

Slide 88

Slide 88 text

References • yui-knk “Ruby Parser։ൃ೔ࢽ (5) - Lrama LALR (1) parser generatorΛ࣮૷ͨ͠” ͔ Ͷ͜ʹ͖ͬ 2023/03/13. https://yui-knk.hatenablog.com/entry/2023/03/13/101951 • yui-knk “Ruby Parser։ൃ೔ࢽ (6) - parse.yͷMaintainabilityͷ࿩” ͔Ͷ͜ʹ͖ͬ 2023/04/04. https://yui-knk.hatenablog.com/entry/2023/04/04/190413 • yui-knk “Ruby Parser։ൃ೔ࢽ (7) - doʹ͍ͭͯߟ͑Δ” ͔Ͷ͜ʹ͖ͬ 2023/04/09. https://yui-knk.hatenablog.com/entry/2023/04/09/123723 • yui-knk “Ruby Parser։ൃ೔ࢽ (8) - Universal Parser΁ͷಓ” ͔Ͷ͜ʹ͖ͬ 2023/05/01. https://yui-knk.hatenablog.com/entry/2023/05/01/174828

Slide 89

Slide 89 text

Thank you !!!