RubyConf Taiwan / Understanding Parser Generators surrounding Ruby with Contributing Lrama

Slide 1

Slide 1 text

Understanding Parser Generators surrounding Ruby with Contributing Lrama Junichi Kobayashi (@junk0612) ESM, Inc. RubyConf Taiwan 2023 National Taipei University of Education 2023/12/15(Fri.)

Slide 2

Slide 2 text

Junichi Kobayashi (@junk0612)

Slide 3

Slide 3 text

Junichi Kobayashi (@junk0612) ● Working as Rails engineer at ESM, Inc. ○ Agile division and RubyxAgile group ○ A member of Parser Club ● Hobbies ○ Parsers ○ Games (Rhythm games / Sim games / Tabletop games) ○ Speed cubes

Slide 4

Slide 4 text

ESM, Inc. ● An IT development company in Japan ● The sponsor of this presentation ● "ESM" is the initials of "Eiwa System Management" ○ Japanese: "永和システムマネジメント"

Slide 5

Slide 5 text

永和

Slide 6

Slide 6 text

永和區

Slide 7

Slide 7 text

永和豆漿大王

Slide 8

Slide 8 text

I'm from 永和

Slide 9

Slide 9 text

I'm from 永和 I visited 永和

Slide 10

Slide 10 text

I'm from 永和 I visited 永和 I ate at 永和

Slide 11

Slide 11 text

Today's topics ● Basic Knowledge of Parsing ● My Contributions to Lrama ● Understanding Internal Structure of Lrama through Implementation ● Future Endeavors

Slide 12

Slide 12 text

Basic Knowledge of Parsing

Slide 13

Slide 13 text

Basic Knowledge of Parsing ● Components of Parsing ○ Lexer ○ Parser ○ Parser Generator ● Terms of Programming Language Processor ○ Formal Language ○ Context Free Grammar ○ Backus-Naur Form

Slide 14

Slide 14 text

Components of Parsing ● Lexer ○ Program that splits text into tokens (Tokenization) ● Parser ○ Program that constructs a structure from token stream ■ Compilers: source code -> Abstract Syntax Tree (AST) ■ JSON or CSV parsers: text -> some data structure ● Parser Generator ○ Program that generates a parser from grammar ﬁles

Slide 15

Slide 15 text

CRuby Environment .rb ﬁle Ruby VM Components of Parsing

Slide 16

Slide 16 text

CRuby Environment .rb ﬁle Ruby VM Components of Parsing Lexer Parser Generator Token Stream AST Byte codes

Slide 17

Slide 17 text

CRuby Environment .rb ﬁle Ruby VM Components of Parsing Lexer Parser Generator Token Stream AST Byte codes Parser Generator grammar

Slide 18

Slide 18 text

CRuby Parser Generator grammar Components of Parsing Lexer Parser Generator Token Stream AST Parser ﬁle

Slide 19

Slide 19 text

Terms of Programming Language Processor

Slide 20

Slide 20 text

Terms of PL Processor ● Formal Language ○ The ﬁeld of linguistics that deals with "Language" in a mathematical and set-theoretical way ○ Considers how a language is represented as text ■ Does not consider the semantics ■ e.g., English is represented as sequences of alphabets, interspersed with symbols and spaces ○ Composed of Symbols and Grammar

Slide 21

Slide 21 text

Terms of PL Processor ● Context Free Grammar (CFG) ○ A kind of Formal Language that is represented as follows: ■ rule: A B C ... | D E F ... ■ This notation is called Backus-Naur Form (BNF) ○ Almost Programming Languages belong to this category ○ Used in the grammar ﬁle which is the input of Parser Generator

Slide 22

Slide 22 text

Terms of PL Processor ● Context Free Grammar (CFG) ○ Nonterminal Symbol ■ A symbol that can be replaced by other symbols ○ Terminal Symbol ■ A symbol that appears in input text

Slide 23

Slide 23 text

CFG and BNF number: digit digit digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Slide 24

Slide 24 text

CFG and BNF number: digit digit digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Language for 2-digit integers (00 to 99)

Slide 25

Slide 25 text

CFG and BNF expression: digit '+' digit | digit '-' digit | digit '*' digit | digit '/' digit digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Slide 26

Slide 26 text

expression: digit '+' digit | digit '-' digit | digit '*' digit | digit '/' digit digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' Language for 2-term arithmetic expressions with 1-digit integers CFG and BNF

Slide 27

Slide 27 text

expression: digit | expression '+' digit | expression '-' digit | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' CFG and BNF

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Operation of Parser expression: digit | expression '+' digit | expression '-' digit | expression '*' digit | expression '/' digit | '(' expression ')' digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' e.g., ( 2 + 7 ) / 3

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Summary ● The Lexer and Parser analyze the program and create its data structures when source code is executed or compiled ● A program which generates parser from grammar ﬁle is called Parser Generator ● The input grammar ﬁle of Parser Generator is written in CFG ● BNF is one of the representation of CFG

Slide 47

Slide 47 text

My Contributions to Lrama

Slide 48

Slide 48 text

Contribution Details ● Implement "Named References" to Lrama

Slide 49

Slide 49 text

Contribution Details ● Implement "Named References" to Lrama 🤔

Slide 50

Slide 50 text

Lrama ● A Parser Generator built with Ruby as a replacement for Bison ○ https://github.com/ruby/lrama ● Presented in RubyKaigi 2023 by Yuichiro Kaneko ○ https://youtu.be/IhfDsLx784g?si=kO1q6mLpTa1bIRYL ● Use in CRuby 3.3 build process ○ You can try now by building HEAD of Ruby ○ Ruby's behavior is NOT changed

Slide 51

Slide 51 text

Benefits of Using Lrama ● No longer dependent on Bison's version ○ Since Bison versions vary among users, it's necessary to assume that older versions may be installed ○ Cannot be used even if new features are introduced ● Allows for the implementation of Ruby-specific features ○ Parsing unfinished code for LSP ○ Making the complex parse.y more readable

Slide 52

Slide 52 text

Ruby-Speciﬁc Features ● Parameterizing Rules ○ https://github.com/ruby/lrama/pull/181 ○ Speciﬁc pattern rules can be created by attaching a symbol after symbols ■ sym*: represents a list of 0+ syms ■ sym+: represents a list of 1+ syms ■ sym?: represents sym is appeared or not

Slide 53

Slide 53 text

Other parser tools surrounding Ruby ● Prism ○ Hand-written parser for CRuby ○ Built as a replacement for CRuby parser generated by Bison ○ Compatible with both JRuby and TruffleRuby

Slide 54

Slide 54 text

Other parser tools surrounding Ruby ● Bison ○ A next-generation parser generator developed by GNU, following Yacc ○ Used for generating CRuby parser from parse.y ● Racc ○ A parser generator developed by Minero Aoki ○ Used in Parser gem (RuboCop dependency) and others

Slide 55

Slide 55 text

Can Bison be replaced with Racc? ● It's impractical because while the generation algorithms are the same, there are few parts that can be commonly used, and it's less costly to create new ones ○ Input ﬁle grammar ■ Bison: Yacc-like / Racc: Original ○ Generated parser's language ■ Bison: C / Racc: Ruby

Slide 56

Slide 56 text

Named References ● A feature of Bison ● Symbol names can be used as References in Action

Slide 57

Slide 57 text

🤔 Named References ● A feature of Bison ● Nonterminal symbol can be used as References in Action

Slide 58

Slide 58 text

%{ Prologue (~ 1500 Lines) %} Bison declarations (~ 200 Lines) %% Grammar rules (~ 4500 Lines) %% Epilogue (~ 8300 Lines) Structure of Bison Grammar File Lines in () indicate CRuby's parse.y

Slide 59

Slide 59 text

%{ Prologue (~ 1500 Lines) %} Bison declarations (~ 200 Lines) %% Grammar rules (~ 4500 Lines) ← Today's topic %% Epilogue (~ 8300 Lines) Structure of Bison Grammar File

Slide 60

Slide 60 text

Structure of Grammar rules rule_name: rule rule ... rule { action } | rule rule ... rule { action } expression: NUMBER '+' expression { $$ = $1 + $3 } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 }

Slide 61

Slide 61 text

rule_name: rule rule ... rule { action } | rule rule ... rule { action } expression: NUMBER '+' expression { $$ = $1 + $3 } ← | NUMBER '-' expression { $$ = $1 - $3 } ← Today's topic | '(' expression ')' { $$ = $2 } ← Structure of Grammar rules

Slide 62

Slide 62 text

Action ● A parser generated by Bison, if left unmodiﬁed, only tells you whether the input adheres to the grammar or not ○ It does not create an AST, nor does it save any information necessary for subsequent processing ● You can write programs in {} following each grammar rule ● Can use $n or @n, as the values of symbols in grammar ○ This feature is known as (Numbered) References

Slide 63

Slide 63 text

An Example of Actions expression: NUMBER '+' expression { $$ = $1 + $3 } The return value when this grammar is accepted is the sum of the 1st and 3rd elements

Slide 64

Slide 64 text

expression: NUMBER '+' expression { $$ = $1 + $3 } An Example of Actions The return value when this grammar is accepted is the sum of the 1st and 3rd elements

Slide 65

Slide 65 text

expression: NUMBER '+' expression { $$ = $1 + $3 } An Example of Actions The return value when this grammar is accepted is the sum of the 1st and 3rd elements

Slide 66

Slide 66 text

expression: NUMBER '+' expression { $$ = $1 + $3 } An Example of Actions The return value when this grammar is accepted is the sum of the 1st and 3rd elements

Slide 67

Slide 67 text

Named References ● Issues in Numbered References ○ Hard to understand due to lack of declarativeness ○ If the grammar changes, it must be rewritten since it speciﬁes by position number ● Named References was developed to resolve these issues, enabling the use of values through referencing nonterminal symbol names

Slide 68

Slide 68 text

Named References expression: NUMBER '+' expression { $$ = $1 + $3 } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 } expression[result]: NUMBER '+' expression[rest] { $result = $NUMBER + $rest } | NUMBER '-' expression[rest] { $result = $NUMBER - $rest } | '(' expression[inside-exp] ')' { $result = $[inside-exp] }

Slide 69

Slide 69 text

expression: NUMBER '+' expression { $$ = $1 + $3 } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 } expression[result]: NUMBER '+' expression[rest] { $result = $NUMBER + $rest } | NUMBER '-' expression[rest] { $result = $NUMBER - $rest } | '(' expression[inside-exp] ')' { $result = $[inside-exp] } Values can be referenced by rule names prefixed with $ Named References

Slide 70

Slide 70 text

expression: NUMBER '+' expression { $$ = $1 + $3 } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 } expression[result]: NUMBER '+' expression[rest] { $result = $NUMBER + $rest } | NUMBER '-' expression[rest] { $result = $NUMBER - $rest } | '(' expression[inside-exp] ')' { $result = $[inside-exp] } Enclosing with [] in rule descriptions allows for assigning aliases Named References

Slide 71

Slide 71 text

expression: NUMBER '+' expression { $$ = $1 + $3 } | NUMBER '-' expression { $$ = $1 - $3 } | '(' expression ')' { $$ = $2 } expression[result]: NUMBER '+' expression[rest] { $result = $NUMBER + $rest } | NUMBER '-' expression[rest] { $result = $NUMBER - $rest } | '(' expression[inside-exp] ')' { $result = $[inside-exp] } If rule names or aliases contain symbols, enclosing them in [] on the calling side is fine Named References

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Summary ● Implement Named References to Lrama ○ Lrama is a parser generator built with Ruby as a replacement for Bison ○ Named References is a feature of Bison, allowing the use of symbol names as References within Actions

Slide 74

Slide 74 text

Understanding Internal Structure of Lrama through Implementation

Slide 75

Slide 75 text

Structure of Parser Generator Parser Generator grammar Lexer Parser Generator Token Stream AST Parser ﬁle

Slide 76

Slide 76 text

Structure of Lrama Lrama parse.y Lexer Parser Output Token Stream AST parse.c

Slide 77

Slide 77 text

Implementation Approach ● Numbered References was already implemented ○ If the symbol names can be associated with calls within Actions, the part for generating code from the association already exists ○ Decided to associate symbol names with calls, taking inspiration from the implementation of Numbered References

Slide 78

Slide 78 text

Write Test First

Slide 79

Slide 79 text

Which ﬁle should I edit? Lrama parse.y Lexer Parser Output Token Stream AST parse.c

Slide 80

Slide 80 text

Which ﬁle should I edit? Lrama parse.y Lexer Parser Output Token Stream AST parse.c

Slide 81

Slide 81 text

Which ﬁle should I edit? Lrama parse.y Lexer Parser Output Token Stream AST parse.c

Slide 82

Slide 82 text

Which ﬁle should I edit? Lrama parse.y Lexer Output Token Stream AST parse.c Parser

Slide 83

Slide 83 text

Reason of Editing Lexer ● Lexer knows the parsing context ○ Information about what is currently being parsed ● When loading Actions, it checks the location of the rule just read to determine 'which symbols are being referenced', completing the association process ● Transferring this entire process to the parser is extremely challenging and does not align with the original design

Slide 84

Slide 84 text

● Parsing context is required when tokenizing ○ Ignore comments ○ Does not raise ParseError when the newline is included inside of HereDoc ○ etc. ● Who should know it? (parser or lexer) Who knows parsing context

Slide 85

Slide 85 text

Who knows parsing context ● If the lexer knows it ○ Possible to tokenize all the input in one go by changing its own state, allowing the process to be divided into phases ● If the parser knows it ○ Since the parser will receive the next token from the lexer based on its state, the lexer can focus on tokenization

Slide 86

Slide 86 text

https://github.com/ruby/lrama/pull/41 Pull Request

Slide 87

Slide 87 text

Future Work ● Implement Bison's feature ○ Generating IELR parser ● Todo list for Lrama written by Yuichiro Kaneko ○ https://docs.google.com/document/d/1EAZzYMXBOdzK-6 mMIj2YNJxZZRVcpJxE7-4zXbHn8JA/edit?usp=sharing ○ (Japanese only)

Slide 88

Slide 88 text

The world is now in the great age of parsers. People are setting sail into the vast sea of parsers. ――RubyKaigi 2023 LT Yuichiro Kaneko