Is Ruby's Multi-Encoding Overhead Heavy?

by ima1zumi

Embed

Start on current slide

Slide 1

Slide 1 text

Is Ruby's Multi-Encoding Overhead Heavy? 2026-04-22 Mari Imaizumi @ima1zumi

Slide 2

Slide 2 text

˒ Mari Imaizumi @ima1zumi ˒ Working at STORES, Inc. ˒ Ruby commi tt er 💎 Introduction 2

Slide 3

Slide 3 text

˒ In 2026, text is mostly UTF-8 (or ASCII) ˒ Ruby's M17N is CSI (Character Set Independent): each String carries an encoding tag ˒ Most languages are UCS: internal form is always Unicode (Python / Java) ˒ How much does CSI actually cost? ˒ Scope: per-char String ops only. String#encode / IO conversion / Regexp are out. Motivation 3

Slide 4

Slide 4 text

˒ Keep CSI, poke the encoding layer — does speed change? ˒ If faster: what was heavy? ˒ If not: why not? Question 4

Slide 5

Slide 5 text

˒ Experiments: ˒ 1. Shrink the number of encodings ˒ 2. Replace indirect calls with direct UTF-8 calls ˒ All numbers below are ratios vs. master(9dd446f18e). ˒ Nx faster = N× master speed. Nx slower = 1/N master speed. Question 5

Slide 6

Slide 6 text

Benchmark 6 prelude: | utf8_short = "こんにちは世界！" utf8_long = "こんにちは世界！" * 100 ascii_long = "Hello, World! " * 100 mixed = "Hello こんにちは " * 100 # For each_char / chars - iterates character by character through encoding layer utf8_chars = "あいうえおかきくけこ" * 100 # --- Regression-focused cases for rb_str_inspect UTF-8 fast path --- # Tiny strings — overhead of the added fast-path branch / coderange check str_empty = "" str_1ascii = "a" str_1utf8 = "あ" # Non-UTF-8 encodings: must take the generic path — should NOT regress binary_long = ("Hello, World! " * 100).b us_ascii_long = ("Hello, World! " * 100).force_encoding("US-ASCII") # eucjp_long = ("こんにちは世界！" * 100).encode("EUC-JP") # sjis_long = ("こんにちは世界！" * 100).encode("Shift_JIS") # utf16le_long = ("こんにちは世界！" * 100).encode("UTF-16LE") # Broken UTF-8 (coderange BROKEN → generic path) broken_utf8 = ("\xFF".b * 1000).force_encoding("UTF-8") ...

Slide 7

Slide 7 text

Benchmark 7 benchmark: # Baseline (kept from original) inspect_utf8_short: utf8_short.inspect inspect_utf8_long: utf8_long.inspect inspect_ascii_long: ascii_long.inspect inspect_mixed: mixed.inspect # Tiny strings (branch overhead) inspect_empty: str_empty.inspect inspect_1char_ascii: str_1ascii.inspect inspect_1char_utf8: str_1utf8.inspect # Non-UTF-8 encodings (generic path — regression check) inspect_binary_long: binary_long.inspect inspect_us_ascii_long: us_ascii_long.inspect # inspect_eucjp_long: eucjp_long.inspect # inspect_sjis_long: sjis_long.inspect # inspect_utf16le_long: utf16le_long.inspect # Broken UTF-8 (generic path) inspect_broken_utf8: broken_utf8.inspect # Escape-heavy (no bulk-skip possible) inspect_all_newlines: newlines_long.inspect inspect_control_chars: ctrl_long.inspect inspect_all_quotes: quotes_long.inspect ...

Slide 8

Slide 8 text

˒ Goal: 103 → 3 (UTF-8, ASCII-8BIT, US-ASCII) ˒ Benchmarks ˒ 1.00x~1.05x slower or faster ˒ No change ˒ Why: non-builtin encodings are dynamically loaded. If unused, they never touch memory — zero hot-path cost. 1. Shrink encodings 8

Slide 9

Slide 9 text

˒ Patched `ONIGENC_PRECISE_MBC_ENC_LEN` etc. with `__builtin_expect(encoding_index == 1, 1)` → call `utf8_mbc_enc_len` directly for UTF-8 ˒ Result: ˒ inspect_binary_long (ASCII-8BIT 1400B .inspect): 1.33x slower ˒ inspect_us_ascii_long (US-ASCII 1400B .inspect): 1.22x slower ˒ valid_encoding_utf8 (.valid_encoding?): 1.16x slower ˒ Rest: noise ˒ Got slower (especially on non-UTF-8 strings). 2. Direct calls 9

Slide 10

Slide 10 text

˒ Why: ˒ 1. Hot paths already skip the indirect (String#length, String#+) ˒ 2. The predictor handles the rest — encoding is stable ˒ 3. Non-UTF-8 pays a dead compare — then takes the indirect anyway 2. Direct calls 10

Slide 11

Slide 11 text

˒ The encoding layer itself has no overhead ˒ String#inspect's bo tt leneck = per-char work itself (3 indirect calls per char) ˒ reduce the whole thing ˒ Add UTF-8-speci fi c fast paths ˒ See also: byroot's blog — h tt ps://byroot.github.io/ruby/ performance/2026/04/18/faster-paths.html So what actually works? 11

Slide 12

Slide 12 text

˒ String#inspect UTF-8 fast path What I built 12

Slide 13

Slide 13 text

Problem 13

Slide 14

Slide 14 text

Patch 14 WIP: https://github.com/ima1zumi/ruby/tree/enc2

Slide 15

Slide 15 text

˒ Pure-ASCII at 1400 bytes: 7–10x faster ˒ inspect_ascii_long (`"Hello, World! "` × 100, UTF-8): 9.89x faster ˒ inspect_us_ascii_long (same bytes, US-ASCII): 7.74x faster ˒ inspect_binary_long (same bytes, ASCII-8BIT): 7.03x faster ˒ inspect_sparse_escape (`"a" × 99 + "\n"` × 14, 1400B): 6.26x faster Results(String#inspect) 15

Slide 16

Slide 16 text

˒ Mixed ASCII + multibyte: also improves ˒ inspect_mixed (`"Hello ͜Μʹͪ͸ "` × 100): 1.71x faster ˒ inspect_utf8_long (`"͜Μʹͪ͸ੈքʂ"` × 100): 1.17x faster ˒ Non-UTF-8 encodings (inspect_eucjp_long / sjis_long / utf16le_long, Japanese re-encoded): noise (±3%, generic path unchanged, no regression) Results(String#inspect) 16

Slide 17

Slide 17 text

˒ Ruby's CSI is not a heavy abstraction — the encoding layer itself is cheap ˒ But there's still room to bolt UTF-8 / ASCII fast paths on top of CSI — `inspect` gives up to 10x ˒ Same trick should work anywhere per-char indirect calls still live in the loop Takeaway 17