Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

From LALR to IELR: A Lrama's Next Step

From LALR to IELR: A Lrama's Next Step

RubyKaigi 2024 day 3
https://rubykaigi.org/2024/presentations/junk0612.html

References:
Lrama Repository: https://github.com/ruby/lrama
Yuichiro Kaneko, "The future vision of Ruby parser", May 2023: https://youtu.be/IhfDsLx784g?si=kO1q6mLpTa1bIRYL
Yuichiro Kaneko, "Ruby Parser開発日誌(14) - LR parser完全に理解した", Dec 2023: https://yui-knk.hatenablog.com/entry/2023/12/06/082203
Junichi Kobayashi, "Lrama へのコントリビューションを通して学ぶ Ruby のパーサジェネレータ事情", Sep 2023: https://speakerdeck.com/junk0612/lrama-henokontoribiyusiyonwotong-sitexue-bu-ruby-nopasazieneretashi-qing
Junichi Kobayashi, "Understanding Parser Generator surrounding Ruby with Contributing Lrama", Dec 2023: https://speakerdeck.com/junk0612/understanding-parser-generators-surrounding-ruby-with-contributing-lrama
Ruby 3.3.0 Release Note: https://www.ruby-lang.org/en/news/2023/12/25/ruby-3-3-0-released/
Yuichiro Kaneko, Ruby Parser Roadmap: https://docs.google.com/presentation/d/1E4v9WPHBLjtvkN7QqulHPGJzKkwIweVfcaMsIQ984_Q/edit?usp=sharing
Joel E. Denny, "PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages", May 2010: https://tigerprints.clemson.edu/all_dissertations/519/
Joel E. Denny and Brian A. Malloy, "The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution" https://www.sciencedirect.com/science/article/pii/S0167642309001191

Junichi Kobayashi

May 17, 2024
Tweet

More Decks by Junichi Kobayashi

Other Decks in Programming

Transcript

  1. From LALR to IELR: A Lrama's Next Step Junichi Kobayashi

    (@junk0612) ESM, Inc. RubyKaigi 2024 Naha Cultural Arts Theater NAHArt 2024/05/17(Fri.)
  2. Junichi Kobayashi • X / GitHub: @junk0612 • Working at

    ESM, Inc. ◦ Work as a Rails engineer ◦ A Member of Parser Club • Hobbies ◦ Parsers ◦ Rhythm games, Board games ◦ Haiku
  3. Attendees from ESM Attendee @kasumi8pon Attendee @mhirata Attendee @sfjwr Karaoke

    @fugakkbn Speaker @koic Attendee @wai-doi Speaker @junk0612 Attendee @haruguchi Attendee @htkymtks @S.H. LT
  4. Lrama • A LALR parser generator built with Ruby ◦

    https://github.com/ruby/lrama • Presented in RubyKaigi 2023 by Yuichiro Kaneko ◦ https://youtu.be/IhfDsLx784g?si=kO1q6mLpTa1bIRYL • Use in CRuby 3.3 build process ◦ Use BASERUBY when building Ruby
  5. My Contributions to Lrama Slide(JP, presented in Osaka RubyKaigi 03):

    https://speakerdeck.com/junk0612/lrama-henokontoribiyusiyonwotong-sitexu e-bu-ruby-nopasazieneretashi-qing Slide(EN, presented in RubyConf Taiwan 2023): https://speakerdeck.com/junk0612/understanding-parser-generators-surroun ding-ruby-with-contributing-lrama
  6. $ ruby -e 'p 1.. || 2' -e:1: syntax error,

    unexpected '|' p 1.. || 2 -e: compile error (SyntaxError) A Motivating Example
  7. $ ruby -e 'p 1.. || 2' -e:1: syntax error,

    unexpected '|' p 1.. || 2 -e: compile error (SyntaxError) A Motivating Example
  8. $ ruby -y -e 'p 1.. || 2' | rg

    'Next token' | uniq Next token is token "local variable or method" (1.0-1.1: p) Next token is token "integer literal" (1.2-1.3: 1) Next token is token ".." (1.3-1.5: ) Next token is token '|' (1.6-1.7: ) Next token is token '|' (1.7-1.8: ) Scanned Tokens
  9. case '|': if ((c = nextc(p)) == '|') { (...snip...)

    if (IS_lex_state_for(last_state, EXPR_BEG)) { c = '|'; pushback(p, '|'); return c; } return tOROP; } yylex in parse.y
  10. $ ruby -y -e 'p 1.. || 2' | rg

    'Next token|lex_state' | uniq lex_state: NONE -> BEG at line 2195 lex_state: BEG -> CMDARG at line 10384 Next token is token "local variable or method" (1.0-1.1: p) lex_state: CMDARG -> END at line 9649 lex_state: END -> END at line 8930 Next token is token "integer literal" (1.2-1.3: 1) lex_state: END -> BEG at line 10872 Next token is token ".." (1.3-1.5: ) lex_state: BEG -> BEG at line 10789 Next token is token '|' (1.6-1.7: ) lex_state: BEG -> BEG|LABEL at line 10808 Next token is token '|' (1.7-1.8: ) lex_state
  11. $ ruby -y -e 'p 1.. || 2' | rg

    'Next token|lex_state' | uniq lex_state: NONE -> BEG at line 2195 lex_state: BEG -> CMDARG at line 10384 Next token is token "local variable or method" (1.0-1.1: p) lex_state: CMDARG -> END at line 9649 lex_state: END -> END at line 8930 Next token is token "integer literal" (1.2-1.3: 1) lex_state: END -> BEG at line 10872 Next token is token ".." (1.3-1.5: ) lex_state: BEG -> BEG at line 10789 Next token is token '|' (1.6-1.7: ) lex_state: BEG -> BEG|LABEL at line 10808 Next token is token '|' (1.7-1.8: ) lex_state
  12. Problems of Changing Lexer • Hard to identify the scope

    of impact of the change ◦ parse.y is over 16k LOC ◦ lex_state is not the only variable that determines lexer behavior
  13. Problems of Changing Lexer • Change cost for future maintenance

    will almost certainly increase ◦ parse.y is always the most readable now ◦ (Lrama tries to decrease these cost)
  14. Parser and Lexer • Academically, parser and lexer are considered

    separable • In reality, however, the state of the lexer is affected by the state of the parser ◦ The same sequence of symbols can have different meanings depending on where they are written, and the tokens you want to cut out are different
  15. • || ◦ a || b -> '||' ◦ ary.each

    {|| do_something } -> '|' '|' • <<- ◦ p <<-HEREDOC -> '<<-' ◦ [] <<-1 -> '<<' '-' • %s{a} ◦ p %s{a} -> '%s{' 'a' '}' ◦ 1 %s{a} -> '%' 's' '{' 'a' '}' Examples
  16. Scannerless Parser • Performs tokenization and parsing in a single

    step • Infer acceptable tokens for each states from the grammar file ◦ Lexer state management is done by the computer, not the developer Scannerless Parser Lexer Parser
  17. PSLR • LALR has insufficient parsing capabilities for the parser

    part of PSLR • It requires a LR algorithm with the same parsing capability as Canonical LR • There is a difference in parsable grammar between Canonical LR and LALR
  18. Canonical LR • Perform a reduction on a rule when

    the next token is included by the lookahead set • Widest range of languages that can be parsed by the LR method • Huge number of states are created • Large space complexity required
  19. LALR (LookAhead LR) • Merge states with a same core

    from the Canonical LR automaton • Slightly less languages can be parsed than Canonical LR • Compared to Canonical LR, merging the states may cause some conflicts ◦ These are called "Mysterious Conflicts" in the document of GNU Bison
  20. IELR • Inadequacy Elimination LR • https://www.sciencedirect.com/science/article/pii/S01676423 09001191 • As

    a part of preliminary works of the PSLR paper • Bridging the gap between LALR and Canonical LR while taking advantage of LALR's strengths of small space complexities
  21. IELR Concepts • Create a parser table for LALR •

    Recompute the lookahead sets for each states from the start state • Verify that the state merge did not cause any Mysterious Conflicts using the original lookahead set and the recomputed lookahead set
  22. IELR Implementation in Lrama def split_states (...snip...) transition_queue = []

    @states.first.transitions.each do |shift, next_state| transition_queue << [@states.first, shift, next_state] end until transition_queue.empty? state, shift, next_state = transition_queue.shift compute_state(state, shift, next_state) next_state.transitions.each do |sh, next_st| transition_queue << [next_state, sh, next_st] end end end
  23. def compute_state(state, shift, next_state) k = propagate_lookaheads(state, next_state) s =

    @ielr_isocores[next_state].find {|st| compatible?(st, k) } if s.nil? split_state(@ielr_isocores[next_state].last) elsif(!@lookaheads_recomputed[s]) @item_lookahead_set[s] = k @lookaheads_recomputed[s] = true else state.update_transition(shift, s) merge_lookaheads(s, k) end end IELR Implementation in Lrama
  24. Conclusion • Managing lex_state is difficult • Solve it to

    support PSLR parser generation to Lrama • Implement IELR that is the prerequisite of PSLR • IELR is bridging the gap of the parse capability between LALR and Canonical LR
  25. Future Works • Merge the PR • Support generating PSLR

    parsers • Refactor parse.y and throw lex_state away
  26. References • Lrama Repository: https://github.com/ruby/lrama • Yuichiro Kaneko, "The future

    vision of Ruby parser", May 2023: https://youtu.be/IhfDsLx784g?si=kO1q6mLpTa1bIRYL • Yuichiro Kaneko, "Ruby Parser開発日誌(14) - LR parser完全に理解した", Dec 2023: https://yui-knk.hatenablog.com/entry/2023/12/06/082203 • Junichi Kobayashi, "Lrama へのコントリビューションを通して学ぶ Ruby のパーサ ジェネレータ事情", Sep 2023: https://speakerdeck.com/junk0612/lrama-henokontoribiyusiyonwotong-si texue-bu-ruby-nopasazieneretashi-qing
  27. References • Junichi Kobayashi, "Understanding Parser Generator surrounding Ruby with

    Contributing Lrama", Dec 2023: https://speakerdeck.com/junk0612/understanding-parser-generators-surr ounding-ruby-with-contributing-lrama • Ruby 3.3.0 Release Note: https://www.ruby-lang.org/en/news/2023/12/25/ruby-3-3-0-released/ • Yuichiro Kaneko, Ruby Parser Roadmap: https://docs.google.com/presentation/d/1E4v9WPHBLjtvkN7QqulHPGJzKk wIweVfcaMsIQ984_Q/edit?usp=sharing
  28. References • Joel E. Denny, "PSLR(1): Pseudo-Scannerless Minimal LR(1) for

    the Deterministic Parsing of Composite Languages", May 2010: https://tigerprints.clemson.edu/all_dissertations/519/ • Joel E. Denny and Brian A. Malloy, "The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution" https://www.sciencedirect.com/science/article/pii/S0167642309001191