Slide 1

Slide 1 text

Is Ruby's Multi-Encoding Overhead Heavy? 2026-04-22 Mari Imaizumi @ima1zumi

Slide 2

Slide 2 text

Λ’ Mari Imaizumi @ima1zumi Λ’ Working at STORES, Inc. Λ’ Ruby commi tt er πŸ’Ž Introduction 2

Slide 3

Slide 3 text

Λ’ In 2026, text is mostly UTF-8 (or ASCII) Λ’ Ruby's M17N is CSI (Character Set Independent): each String carries an encoding tag Λ’ Most languages are UCS: internal form is always Unicode (Python / Java) Λ’ How much does CSI actually cost? Λ’ Scope: per-char String ops only. String#encode / IO conversion / Regexp are out. Motivation 3

Slide 4

Slide 4 text

Λ’ Keep CSI, poke the encoding layer β€” does speed change? Λ’ If faster: what was heavy? Λ’ If not: why not? Question 4

Slide 5

Slide 5 text

Λ’ Experiments: Λ’ 1. Shrink the number of encodings Λ’ 2. Replace indirect calls with direct UTF-8 calls Λ’ All numbers below are ratios vs. master(9dd446f18e). Λ’ Nx faster = NΓ— master speed. Nx slower = 1/N master speed. Question 5

Slide 6

Slide 6 text

Benchmark 6 prelude: | utf8_short = "γ“γ‚“γ«γ‘γ―δΈ–η•ŒοΌ" utf8_long = "γ“γ‚“γ«γ‘γ―δΈ–η•ŒοΌ" * 100 ascii_long = "Hello, World! " * 100 mixed = "Hello こんにけは " * 100 # For each_char / chars - iterates character by character through encoding layer utf8_chars = "γ‚γ„γ†γˆγŠγ‹γγγ‘γ“" * 100 # --- Regression-focused cases for rb_str_inspect UTF-8 fast path --- # Tiny strings β€” overhead of the added fast-path branch / coderange check str_empty = "" str_1ascii = "a" str_1utf8 = "あ" # Non-UTF-8 encodings: must take the generic path β€” should NOT regress binary_long = ("Hello, World! " * 100).b us_ascii_long = ("Hello, World! " * 100).force_encoding("US-ASCII") # eucjp_long = ("γ“γ‚“γ«γ‘γ―δΈ–η•ŒοΌ" * 100).encode("EUC-JP") # sjis_long = ("γ“γ‚“γ«γ‘γ―δΈ–η•ŒοΌ" * 100).encode("Shift_JIS") # utf16le_long = ("γ“γ‚“γ«γ‘γ―δΈ–η•ŒοΌ" * 100).encode("UTF-16LE") # Broken UTF-8 (coderange BROKEN β†’ generic path) broken_utf8 = ("\xFF".b * 1000).force_encoding("UTF-8") ...

Slide 7

Slide 7 text

Benchmark 7 benchmark: # Baseline (kept from original) inspect_utf8_short: utf8_short.inspect inspect_utf8_long: utf8_long.inspect inspect_ascii_long: ascii_long.inspect inspect_mixed: mixed.inspect # Tiny strings (branch overhead) inspect_empty: str_empty.inspect inspect_1char_ascii: str_1ascii.inspect inspect_1char_utf8: str_1utf8.inspect # Non-UTF-8 encodings (generic path β€” regression check) inspect_binary_long: binary_long.inspect inspect_us_ascii_long: us_ascii_long.inspect # inspect_eucjp_long: eucjp_long.inspect # inspect_sjis_long: sjis_long.inspect # inspect_utf16le_long: utf16le_long.inspect # Broken UTF-8 (generic path) inspect_broken_utf8: broken_utf8.inspect # Escape-heavy (no bulk-skip possible) inspect_all_newlines: newlines_long.inspect inspect_control_chars: ctrl_long.inspect inspect_all_quotes: quotes_long.inspect ...

Slide 8

Slide 8 text

Λ’ Goal: 103 β†’ 3 (UTF-8, ASCII-8BIT, US-ASCII) Λ’ Benchmarks Λ’ 1.00x~1.05x slower or faster Λ’ No change Λ’ Why: non-builtin encodings are dynamically loaded. If unused, they never touch memory β€” zero hot-path cost. 1. Shrink encodings 8

Slide 9

Slide 9 text

Λ’ Patched `ONIGENC_PRECISE_MBC_ENC_LEN` etc. with `__builtin_expect(encoding_index == 1, 1)` β†’ call `utf8_mbc_enc_len` directly for UTF-8 Λ’ Result: Λ’ inspect_binary_long (ASCII-8BIT 1400B .inspect): 1.33x slower Λ’ inspect_us_ascii_long (US-ASCII 1400B .inspect): 1.22x slower Λ’ valid_encoding_utf8 (.valid_encoding?): 1.16x slower Λ’ Rest: noise Λ’ Got slower (especially on non-UTF-8 strings). 2. Direct calls 9

Slide 10

Slide 10 text

Λ’ Why: Λ’ 1. Hot paths already skip the indirect (String#length, String#+) Λ’ 2. The predictor handles the rest β€” encoding is stable Λ’ 3. Non-UTF-8 pays a dead compare β€” then takes the indirect anyway 2. Direct calls 10

Slide 11

Slide 11 text

Λ’ The encoding layer itself has no overhead Λ’ String#inspect's bo tt leneck = per-char work itself (3 indirect calls per char) Λ’ reduce the whole thing Λ’ Add UTF-8-speci fi c fast paths Λ’ See also: byroot's blog β€” h tt ps://byroot.github.io/ruby/ performance/2026/04/18/faster-paths.html So what actually works? 11

Slide 12

Slide 12 text

Λ’ String#inspect UTF-8 fast path What I built 12

Slide 13

Slide 13 text

Problem 13

Slide 14

Slide 14 text

Patch 14 WIP: https://github.com/ima1zumi/ruby/tree/enc2

Slide 15

Slide 15 text

Λ’ Pure-ASCII at 1400 bytes: 7–10x faster Λ’ inspect_ascii_long (`"Hello, World! "` Γ— 100, UTF-8): 9.89x faster Λ’ inspect_us_ascii_long (same bytes, US-ASCII): 7.74x faster Λ’ inspect_binary_long (same bytes, ASCII-8BIT): 7.03x faster Λ’ inspect_sparse_escape (`"a" Γ— 99 + "\n"` Γ— 14, 1400B): 6.26x faster Results(String#inspect) 15

Slide 16

Slide 16 text

Λ’ Mixed ASCII + multibyte: also improves Λ’ inspect_mixed (`"Hello ͜ΜʹΝͺΝΈ "` Γ— 100): 1.71x faster Λ’ inspect_utf8_long (`"͜ΜʹΝͺΝΈΰ©ˆΦ„Κ‚"` Γ— 100): 1.17x faster Λ’ Non-UTF-8 encodings (inspect_eucjp_long / sjis_long / utf16le_long, Japanese re-encoded): noise (Β±3%, generic path unchanged, no regression) Results(String#inspect) 16

Slide 17

Slide 17 text

Λ’ Ruby's CSI is not a heavy abstraction β€” the encoding layer itself is cheap Λ’ But there's still room to bolt UTF-8 / ASCII fast paths on top of CSI β€” `inspect` gives up to 10x Λ’ Same trick should work anywhere per-char indirect calls still live in the loop Takeaway 17