Making the RBS Parser Faster

Transcript

Making the RBS Parser Faster Soutaro Matsumoto (@soutaro)  Shopify, Inc.

Soutaro Matsumoto • Senior Software Engineer at Shopify Ruby DX

team • A Ruby core committer • RBS designer • Steep developer

Soutaro Matsumoto • Senior Software Engineer at Shopify Ruby DX

steep server singleton(T)[S] type Generics lower bounds: T > S

ruby-rbs and ruby-rbs-sys crates steep check -e "1 + true" Several type narrowing updat RBS 4.0 & Steep 2.0 steep query

steep server Inline RBS declaration support 🎉 singleton(T)[S] type Generics

lower bounds: T > S ruby-rbs and ruby-rbs-sys crates steep check -e "1 + true" Several type narrowing updat RBS 4.0 & Steep 2.0 steep query

Inline RBS Type Declaration Experimental

Inline RBS Type Declaration Experimental Type of attributes

Inline RBS Type Declaration Experimental Type of attributes Method type

Inline RBS Type Declaration • RBS and Steep directly support

the feature, without using rbs-inline gem • Enable it by adding inline: true to the check calls in your Steep fi le Experimental

None

steep query • steep query lets you navigate your codebase

from the command line • Output is in the same JSON format as LSP $ steep query hover signature_service:3:15 $ steep query definition Steep::Typing

steep query • steep query lets you navigate your codebase

The RBS Parser • RBS parser translates source text into

an AST • Using RBS always starts from parsing

RBS::Parser in RBS gem The pure C parser The New

Parser Architecture • The new parser constructs a pure C AST and translates it into Ruby objects The source code C Struct AST Ruby Object AST

None

🤔 We can ignore the issue because parsing accounts for

only a small portion of total tool execution time

🤔 We can ignore the issue because parsing accounts for

only a small portion of total tool execution time We could parallelize parsing for better performance if needed.

🤔 We can ignore the issue because parsing accounts for

only a small portion of total tool execution time The new parser scans the input twice. It is inevitable. We could parallelize parsing for better performance if needed.

🤔 We can ignore the issue because parsing accounts for

None

Small Inputs • Small inputs are more common with inline

RBS declarations, as used in both RBS and Sorbet • We have many small RBS inputs in one fi le for annotations

Small Inputs • Small inputs are more common with inline

RBS declarations, as used in both RBS and Sorbet • We have many small RBS inputs in one fi le for annotations

Start Pro fi ling • Best practice: pro fi le

the code before you start • I started with some random optimizations (didn't work) • We use Instruments, a pro fi ler bundled in Xcode on macOS

Looks like there is a chance to optimize the lexer

Lexer • Lexer -- lexical analyzer or tokenizer -- groups

characters into tokens • Parser operates on tokens rather than individual characters c l a s s I n t e g e r < N u m e r i c Characters Tokens

Lexer • Lexer -- lexical analyzer or tokenizer -- groups

characters into tokens • Parser operates on tokens rather than individual characters c l a s s I n t e g e r < N u m e r i c Characters Tokens kCLASS

Lexer • Lexer -- lexical analyzer or tokenizer -- groups

characters into tokens • Parser operates on tokens rather than individual characters c l a s s I n t e g e r < N u m e r i c Characters Tokens kCLASS tUIDENT

Lexer • Lexer -- lexical analyzer or tokenizer -- groups

characters into tokens • Parser operates on tokens rather than individual characters c l a s s I n t e g e r < N u m e r i c Characters Tokens kCLASS tUIDENT pLT

Lexer • Lexer -- lexical analyzer or tokenizer -- groups

characters into tokens • Parser operates on tokens rather than individual characters c l a s s I n t e g e r < N u m e r i c Characters Tokens kCLASS tUIDENT pLT tUIDENT

re2c RBS uses re2c lexer generator, which supports Unicode Generated

C code re2c De fi nition

re2c RBS uses re2c lexer generator, which supports Unicode Generated

t y p e t = " വ ؗ "

t (116) Moves 1 byte Moves 3 bytes വ (20989)

🤔 • rbs_peek calculates UTF-8 codepoint • rbs_skip calculates encoding

dependent character width

Encoding of RBS Source Text • There was no speci

fi cation of encoding of RBS source text • De fi ne an encoding spec 💪 • Follow Ruby's spec • It supports multi-byte encoding, but they must be ASCII compatible • UTF-8, SJIS, EUC-JP are fi ne, UTF-16 and UTF-32 are not

The Problem • Our lexer generator doesn't support encoding other

than Unicode • Do we need to implement the conversion from supported encodings to Unicode? 😫

We don't need actual codepoints for parsing Only comments and

string literal types allow non-ASCII characters

We don't need actual codepoints for parsing type language =

"ϧϏʔ" type language = "ϥετ" Only comments and string literal types allow non-ASCII characters

We don't need actual codepoints for parsing type language =

"ϧϏʔ" type language = "ϥετ" # ू߹Λදݱ͢ΔΫϥε class Set[T] end # ϧϧϧϧϧϧϧϧϧϧ class Set[T] end Only comments and string literal types allow non-ASCII characters

Skip Codepoint Calculation • If the next character is multi-byte,

returning any Unicode codepoint works • If it's a single-byte character, it's an ASCII character • The actual comment/string content can be fetched from the bu ff er

Skip Codepoint Calculation • If the next character is multi-byte,

returning any Unicode codepoint works • If it's a single-byte character, it's an ASCII character • The actual comment/string content can be fetched from the bu ff er

None

Benchmarking RBS(4.0.0.dev.4 in Gemfile-unicode) parsing with 86 files (2148952 bytes)...

✅ 30.108 i/s (33.254 ms/i) (±3.321%) Benchmarking RBS(4.0.0.dev.4 in Gemfile-base) parsing with 86 files (2148952 bytes)... ✅ 25.854 i/s (38.759 ms/i) (±3.868%) Lexer Updated • Correctly handles non-UTF-8 input • Add character byte count to the lexer struct, and use it in rbs_peek to skip second encoding->char_width call 16% improvement with core RBS benchmark

Making the RBS Parser Faster

Making the RBS Parser Faster

More Decks by Soutaro Matsumoto

Other Decks in Programming

Featured

Transcript