From LALR to IELR: A Lrama's Next Step

From LALR to IELR: A Lrama's Next Step Junichi Kobayashi
(@junk0612) ESM, Inc. RubyKaigi 2024 Naha Cultural Arts Theater NAHArt 2024/05/17(Fri.)

Junichi Kobayashi • X / GitHub: @junk0612 • Working at
ESM, Inc. ◦ Work as a Rails engineer ◦ A Member of Parser Club • Hobbies ◦ Parsers ◦ Rhythm games, Board games ◦ Haiku

Contributor of Lrama

Contributor of Lrama • Became in Official Party !!! Committer

Night Cruise Sponsor

Attendees from ESM Attendee @kasumi8pon Attendee @mhirata Attendee @sfjwr Karaoke
@fugakkbn Speaker @koic Attendee @wai-doi Speaker @junk0612 Attendee @haruguchi Attendee @htkymtks @S.H. LT

Overview of Lrama おきなさい。おきなさい　わたしの　かわいい　ぼうや ……。きょうは　とても　たいせつなひ。あなたが　はじめておしろに　いくひ　だったでしょう。 ―勇者の母親

Lrama • A LALR parser generator built with Ruby ◦
https://github.com/ruby/lrama • Presented in RubyKaigi 2023 by Yuichiro Kaneko ◦ https://youtu.be/IhfDsLx784g?si=kO1q6mLpTa1bIRYL • Use in CRuby 3.3 build process ◦ Use BASERUBY when building Ruby

Basis of LR Parser https://yui-knk.hatenablog.com/entry/2023/12/06/082203

My Contributions to Lrama Slide(JP, presented in Osaka RubyKaigi 03):
https://speakerdeck.com/junk0612/lrama-henokontoribiyusiyonwotong-sitexu e-bu-ruby-nopasazieneretashi-qing Slide(EN, presented in RubyConf Taiwan 2023): https://speakerdeck.com/junk0612/understanding-parser-generators-surroun ding-ruby-with-contributing-lrama

My Contributions to Lrama https://www.ruby-lang.org/en/news/2023/12/25/ruby-3-3 -0-released/

What We Aim to Solve おお　えにくす！　ゆうしゃロトの　ちをひくものよ！そなたのくるのをまっておったぞ。 ―ラルス16世

What We Aim to Solve • Hard to manage the
CRuby's lexer state

A Motivating Example $ ruby -e 'p 1.. || 2'
#=> ???

$ ruby -e 'p 1.. || 2' -e:1: syntax error,
unexpected '|' p 1.. || 2 -e: compile error (SyntaxError) A Motivating Example

Exec with Parser Info

$ ruby -y -e 'p 1.. || 2' | rg
'Next token' | uniq Next token is token "local variable or method" (1.0-1.1: p) Next token is token "integer literal" (1.2-1.3: 1) Next token is token ".." (1.3-1.5: ) Next token is token '|' (1.6-1.7: ) Next token is token '|' (1.7-1.8: ) Scanned Tokens

case '|': if ((c = nextc(p)) == '|') { (...snip...)
if (IS_lex_state_for(last_state, EXPR_BEG)) { c = '|'; pushback(p, '|'); return c; } return tOROP; } yylex in parse.y

$ ruby -y -e 'p 1.. || 2' | rg
'Next token|lex_state' | uniq lex_state: NONE -> BEG at line 2195 lex_state: BEG -> CMDARG at line 10384 Next token is token "local variable or method" (1.0-1.1: p) lex_state: CMDARG -> END at line 9649 lex_state: END -> END at line 8930 Next token is token "integer literal" (1.2-1.3: 1) lex_state: END -> BEG at line 10872 Next token is token ".." (1.3-1.5: ) lex_state: BEG -> BEG at line 10789 Next token is token '|' (1.6-1.7: ) lex_state: BEG -> BEG|LABEL at line 10808 Next token is token '|' (1.7-1.8: ) lex_state

Problems of Changing Lexer • Hard to identify the scope
of impact of the change ◦ parse.y is over 16k LOC ◦ lex_state is not the only variable that determines lexer behavior

Problems of Changing Lexer • Change cost for future maintenance
will almost certainly increase ◦ parse.y is always the most readable now ◦ (Lrama tries to decrease these cost)

Parser and Lexer • Academically, parser and lexer are considered
separable • In reality, however, the state of the lexer is affected by the state of the parser ◦ The same sequence of symbols can have different meanings depending on where they are written, and the tokens you want to cut out are different

• || ◦ a || b -> '||' ◦ ary.each
{|| do_something } -> '|' '|' • <<- ◦ p <<-HEREDOC -> '<<-' ◦ [] <<-1 -> '<<' '-' • %s{a} ◦ p %s{a} -> '%s{' 'a' '}' ◦ 1 %s{a} -> '%' 's' '{' 'a' '}' Examples

Scannerless Parser やっぱり修行で得た力と言うのは　他人のために使うものだと私は思います。 ―アバン

Ruby Parser Roadmap https://docs.google.com/presentation/d/1E4v9WPHBLjtvkN7QqulHPGJzKkwIweVfcaMsIQ984_Q/edit?usp=sharing

Scannerless Parser • Performs tokenization and parsing in a single
step • Infer acceptable tokens for each states from the grammar ﬁle ◦ Lexer state management is done by the computer, not the developer Scannerless Parser Lexer Parser

PSLR • Pseudo-scannerless Minimal LR • https://tigerprints.clemson.edu/all_dissertations/519/ • Tokenize only
tokens that are acceptable in the current context by pseudo lexer

Ruby Parser Roadmap https://docs.google.com/presentation/d/1E4v9WPHBLjtvkN7QqulHPGJzKkwIweVfcaMsIQ984_Q/edit?usp=sharing

PSLR • LALR has insufficient parsing capabilities for the parser
part of PSLR • It requires a LR algorithm with the same parsing capability as Canonical LR • There is a difference in parsable grammar between Canonical LR and LALR

Canonical LR • Perform a reduction on a rule when
the next token is included by the lookahead set • Widest range of languages that can be parsed by the LR method • Huge number of states are created • Large space complexity required

LALR (LookAhead LR) • Merge states with a same core
from the Canonical LR automaton • Slightly less languages can be parsed than Canonical LR • Compared to Canonical LR, merging the states may cause some conﬂicts ◦ These are called "Mysterious Conﬂicts" in the document of GNU Bison

IELR Overview そして　えにくす。どんなに　はなれていてもオレたちは　友だちだよな！ ―キーファ

IELR • Inadequacy Elimination LR • https://www.sciencedirect.com/science/article/pii/S01676423 09001191 • As
a part of preliminary works of the PSLR paper • Bridging the gap between LALR and Canonical LR while taking advantage of LALR's strengths of small space complexities

IELR Concepts • Create a parser table for LALR •
Recompute the lookahead sets for each states from the start state • Verify that the state merge did not cause any Mysterious Conﬂicts using the original lookahead set and the recomputed lookahead set

Implement IELR parser だから頼んだぜ勇者さま。オレにも見せてくれよな。魔王をぶっ倒す勇者の奇跡ってヤツをさ。 ―カミュ

IELR Implementation in Lrama def split_states (...snip...) transition_queue = []
@states.first.transitions.each do |shift, next_state| transition_queue << [@states.first, shift, next_state] end until transition_queue.empty? state, shift, next_state = transition_queue.shift compute_state(state, shift, next_state) next_state.transitions.each do |sh, next_st| transition_queue << [next_state, sh, next_st] end end end

def compute_state(state, shift, next_state) k = propagate_lookaheads(state, next_state) s =
@ielr_isocores[next_state].find {|st| compatible?(st, k) } if s.nil? split_state(@ielr_isocores[next_state].last) elsif(!@lookaheads_recomputed[s]) @item_lookahead_set[s] = k @lookaheads_recomputed[s] = true else state.update_transition(shift, s) merge_lookaheads(s, k) end end IELR Implementation in Lrama

https://github.com/ruby/lrama/pull/398 Pull Request

Conclusion 人の　愛は　勇気は　消して　消えることは　ありませんもし　私が闇に墜ちてしまったらその時は　どうか　この剣を手に …… ―聖竜

Conclusion • Managing lex_state is difficult • Solve it to
support PSLR parser generation to Lrama • Implement IELR that is the prerequisite of PSLR • IELR is bridging the gap of the parse capability between LALR and Canonical LR

Future Works • Merge the PR • Support generating PSLR
parsers • Refactor parse.y and throw lex_state away

Acknowledgements • @yui-knk, @ydah ◦ #lramafriends #LR_parser_gangs • @koic, @S.H.
◦ #esm_parser_club • My Wife ◦ #a_beautiful_life

References • Lrama Repository: https://github.com/ruby/lrama • Yuichiro Kaneko, "The future
vision of Ruby parser", May 2023: https://youtu.be/IhfDsLx784g?si=kO1q6mLpTa1bIRYL • Yuichiro Kaneko, "Ruby Parser開発日誌(14) - LR parser完全に理解した", Dec 2023: https://yui-knk.hatenablog.com/entry/2023/12/06/082203 • Junichi Kobayashi, "Lrama へのコントリビューションを通して学ぶ Ruby のパーサジェネレータ事情", Sep 2023: https://speakerdeck.com/junk0612/lrama-henokontoribiyusiyonwotong-si texue-bu-ruby-nopasazieneretashi-qing

References • Junichi Kobayashi, "Understanding Parser Generator surrounding Ruby with
Contributing Lrama", Dec 2023: https://speakerdeck.com/junk0612/understanding-parser-generators-surr ounding-ruby-with-contributing-lrama • Ruby 3.3.0 Release Note: https://www.ruby-lang.org/en/news/2023/12/25/ruby-3-3-0-released/ • Yuichiro Kaneko, Ruby Parser Roadmap: https://docs.google.com/presentation/d/1E4v9WPHBLjtvkN7QqulHPGJzKk wIweVfcaMsIQ984_Q/edit?usp=sharing

References • Joel E. Denny, "PSLR(1): Pseudo-Scannerless Minimal LR(1) for
the Deterministic Parsing of Composite Languages", May 2010: https://tigerprints.clemson.edu/all_dissertations/519/ • Joel E. Denny and Brian A. Malloy, "The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conﬂict resolution" https://www.sciencedirect.com/science/article/pii/S0167642309001191

Presentations around Parsers

That's one small patch for Lrama, one giant leap for
Ruby.

From LALR to IELR: A Lrama's Next Step

From LALR to IELR: A Lrama's Next Step

More Decks by Junichi Kobayashi

Other Decks in Programming

Featured

Transcript