The Three-Encoding Problem

Kevin Menard kevin.menard@shopify.com GitHub: @nirvdrum — Twitter: @nirvdrum Mastodon: @nirvdrum@ruby.social
The Three-Encoding Problem

• High performance implementation of Ruby • Focuses on peak
performance • Designed to optimize idiomatic Ruby • Intended to be compatible with CRuby • Even runs native extensions

Leveling Up

Ruby Strings

# encoding: US-ASCII s = 'abc'

s.bytes #=> [97, 98, 99] # encoding: US-ASCII s =
'abc'

s.encoding #=> #<Encoding:US-ASCII> # encoding: US-ASCII s = 'abc'

s.bytes [97, 98, 99] s.encoding #<Encoding:US-ASCII> s = 'abc'

s.chars ["a", "b", "c"] s.bytes [97, 98, 99] s.encoding #<Encoding:US-ASCII>
s = 'abc' + =

Encodings

• encoding(bytes) => characters • Takes in a sequence of
bytes • Outputs a sequence of characters Encodings: A Mapping Function

• Encoding • Character • Code Point • Code Unit
Terminology

• encoding(codepoint) => character • encoding(bytes) => code point Encodings:
Revised

# encoding: US-ASCII s = 'abc' s.bytes #=> [ 97,
98, 99] s.codepoints #=> [ 97, 98, 99] s.chars #=> ["a", "b", "c"]

A Crash Course in Encoding History

• De f ines 128 “characters” (95 printable) — everything
f its in 7 bits • a - z (lowercase) • A - Z (uppercase) • 0 - 9 (digits) • .,?!:; (punctuation) • $%@# (symbols) • Tab, space, newline, carriage return (white space + control) American Standard Code for Information Interchange

ASCII in Ruby # encoding: US-ASCII s = 'abc' s.encoding
#=> #<Encoding:US-ASCII> s.encoding.ascii_compatible? #=> true s.ascii_only? #=> true

Moving Beyond ASCII # encoding: UTF-8 s = 'très' s.bytes
#=> [116, 114, 195, 168, 115] s.codepoints #=> [116, 114, 232, 115] s.chars #=> ["t", "r", "è", "s"]

• Popular encoding system • Supports 150K+ characters in versioned
releases • Includes writing systems for many world languages plus things like emoji • The 128 ASCII code points map to the same code point values in Unicode • Multiple ways to represent the same characters (non-normalized) • Writing systems are non unambiguously resolvable, so there are cultural di ff erences and disagreements that Unicode solves unsatisfactorily for some Unicode

Unicode Transformation Formats Name UTF - 8 UTF - 16
UTF - 16BE UTF - 16LE UTF - 32 UTF - 32BE UTF - 32LE Code Unit Size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits Byte Order N/A <BOM> Big Endian Little Endian <BOM> Big Endian Little Endian Min Char Size 1 byte 2 bytes / 1 code unit 2 bytes 2 bytes 4 bytes / 1 code unit 4 bytes 4 bytes Max Char Size 4 bytes 4 bytes / 2 code units 4 bytes 4 bytes 4 bytes / 1 code unit 4 bytes 4 bytes Adapted from the Unicode UTF FAQ: https://unicode.org/faq/utf_bom.html BOM = Byte Order Marker

Ruby Encodings

• Unlike many languages, Ruby does not have a uni
f ied internal string encoding • E.g., JavaScript uses UTF - 16 and Python uses UTF - 8 for all string storage • Other languages must convert data into their internal encoding • Ruby does no such conversion, meaning multiple encodings can be used at the same time • Consequently, Ruby works very e ff iciently with legacy data • By not tying itself to implicitly to Unicode, Ruby is more inclusive Ruby VS the World

Encoding.list.size #=> 103

s = 'abc' us_ascii = s.encode('US-ASCII') utf_8 = s.encode('UTF-8') shift_jis
= s.encode('Shift_JIS') us_ascii.codepoints #=> [97, 98, 99] utf8_.codepoints #=> [97, 98, 99] shift_jis.codepoints #=> [97, 98, 99] us_ascii.bytes #=> [97, 98, 99] utf8_.bytes #=> [97, 98, 99] shift_jis.bytes #=> [97, 98, 99] ASCII Compatibility

Encoding.list.count(&:ascii_compatible?) #=> 90

s = 'abc' us_ascii = s.encode('US-ASCII') utf_16le = s.encode('UTF-16LE') utf_32le
= s.encode('UTF-32LE') us_ascii.codepoints #=> [97, 98, 99] utf_16le.codepoints #=> [97, 98, 99] utf_32le.codepoints #=> [97, 98, 99] us_ascii.bytes #=> [97, 98, 99] utf_16le.bytes #=> [97, 0, 98, 0, 99, 0] utf_32le.bytes #=> [97, 0, 0, 0, 98, 0, 0, 0, 99, 0, 0, 0] ASCII Incompatibility

Transcoding: A Machine for Converting Characters from One Encoding to
Another

s = 'abc' s.encoding #=> #<Encoding:UTF-8> s.bytes #=> [97, 98,
99] s.encode('US-ASCII').bytes #=> [97, 98, 99] s.encode('UTF-16LE').bytes #=> [97, 0, 98, 0, 99, 0] Transcoding

Transcoding Error: Unde f ined Conversion 'très'.encode('US-ASCII') # -e:1:in `encode':
U+00E8 from UTF-8 to US-ASCII (Encoding::UndefinedConversionError) # from -e:1:in `<main>'

Transcoding Error: Invalid Byte Sequence "très\xe3\x81\xff".encode('UTF-16LE') # -e:1:in `encode': "\\xE3\\x81"
followed by "\\xFF" on UTF-8 (Encoding::InvalidByteSequenceError) # from -e:1:in `<main>'

�� "très\xe3\x81\xff".encode( 'UTF-16LE', invalid: :replace) #=> "tr\u00E8s\uFFFD\uFFFD" (très ) •
Encoding::Unde f inedConversionError: • Encoding::InvalidByteSequenceError: Handling Transcoding Errors 'très'.encode('US-ASCII', undef: :replace) #=> “tr?s"

s = 'très' s.bytes #=> [116, 114, 195, 168, 115]
s.force_encoding('US-ASCII') #=> "tr\xC3\xA8s" s.bytes #=> [116, 114, 195, 168, 115] s.force_encoding('UTF-8') s.bytes #=> [116, 114, 195, 168, 115] Encoding Override: Don’t Do this at Home… (Please)

Broken Strings s = 'très' s.encoding #=> #<Encoding:UTF-8> s.size #=>
4 s.force_encoding('US-ASCII') #=> "tr\xC3\xA8s" s.encoding #=> #<Encoding:US-ASCII> s.valid_encoding? #=> false s.size #=> 5

• Ruby strings = byte arrays + encoding • Clever
idea… let’s make an encoding that doesn’t map to any characters • ASCII - 8BIT is born • Aliased as BINARY, which is more descriptive, but not the default name • Binary strings are ASCII-compatible but probably shouldn’t be • Can never be broken; any byte value is valid in an arbitrary byte array • Useful when reading from I/O or other data source where data is incomplete or unknown Binary Strings

ASCII - 8BIT / BINARY Example require 'stringio' s =
'façade' io = StringIO.new(s) buf = io.read buf.encoding #=> #<Encoding:UTF-8> io.rewind buf = io.read(s.bytesize) buf.encoding #=> #<Encoding:ASCII-8BIT>

ASCII - 8BIT / BINARY: Caveat Utilitor require 'stringio' CHUNK_SIZE
= 4 s = '_façade' buf = String.new StringIO.open(s) do |io| until io.eof? chunk = io.read(CHUNK_SIZE) # Chunk is ASCII-8BIT / BINARY. yield chunk # Hopefully, callee knows it's not given a real string. end end

String Parlor Trick

Let’s Create Some Strings s1 = '' s2 = String.new

Are They Equal? s1 == s2 #=> true

Let’s Proceed to Write Some Bytes s1 << 0x80 s2
<< 0x80

Still Equal? s1 == s2 #=> false

Existential Dread s1.bytes #=> [194, 128] s2.bytes #=> [128]

Et Tu, Brute? ''.encoding #=> #<Encoding:UTF-8> String.new.encoding #=> #<Encoding:ASCII-8BIT>

Encoding Compatibility & Negotiation

• Since Ruby doesn’t normalize encodings, it needs to handle
strings with di ff erent encodings interacting • Forcing the user to deal with that is not in the spirit of Ruby • Ruby’s notion of compatibility is fairly complex • 90% of encodings are ASCII-compatible so ASCII data is trivially compatible Encoding Compatibility

• Checks the compatibility of two objects. • If the
objects are both strings they are compatible when they are concatenatable. • The encoding of the concatenated string will be returned if they are compatible, nil if they are not. Encoding.compatible?(obj1, obj2)

Encoding.compatible? with Encodings US_ASCII = Encoding::US_ASCII UTF_8 = Encoding::UTF_8 BINARY
= Encoding::BINARY # Compatible encodings: Encoding.compatible?(US_ASCII, UTF_8) #=> #<Encoding:UTF-8> Encoding.compatible?(UTF_8, US_ASCII) #=> #<Encoding:UTF-8> # Incompatible encodings: Encoding.compatible?(UTF_8, BINARY) #=> nil Encoding.compatible?(BINARY, UTF_8) #=> nil

Encoding.compatible? with Strings us_ascii = 'abc'.encode('US-ASCII') utf_8 = 'abc'.encode('UTF-8') binary
= 'abc'.encode('BINARY') # Argument order matters. Encoding.compatible?(us_ascii, utf_8) #=> #<Encoding:US-ASCII> Encoding.compatible?(utf_8, us_ascii) #=> #<Encoding:UTF-8>

Encoding.compatible? with Strings us_ascii = 'abc'.encode('US-ASCII') utf_8 = 'abc'.encode('UTF-8') binary
= 'abc'.encode('BINARY') # Strings can be compatible even if their # encodings aren't. Encoding.compatible?(utf_8, binary) #=> #<Encoding:UTF-8> Encoding.compatible?(utf_8.encoding, binary.encoding) #=> nil

Encoding.compatible? with Strings utf_8 = 'abc'.encode('UTF-8') utf_16le = 'abc'.encode('UTF-16LE') #
Incompatible encodings return nil. Encoding.compatible?(utf_8, utf_16le) #=> nil

Encoding.compatible? with Strings utf_8 = 'abc'.encode('UTF-8') utf_16le = 'abc'.encode('UTF-16LE') #
Incompatible encodings return nil. # ... unless one of the strings are empty. Encoding.compatible?(utf_8.dup.clear, utf_16le) #=> #<Encoding:UTF-16LE> Encoding.compatible?(utf_8, utf_16le.dup.clear) #=> #<Encoding:UTF-8>

Encodings: Not Just for Strings

• Symbols • Regexps • IO Other Objects with Encodings

s = 'très' s.encoding #=> Encoding::UTF_8 s.to_sym.encoding #=> Encoding::UTF_8 s.to_sym.to_s.encoding
#=> Encoding::UTF_8

s = 'abc' s.encoding #=> Encoding::UTF_8 s.to_sym.encoding #=> Encoding::US_ASCII s.to_sym.to_s.encoding
#=> Encoding::US_ASCII

Numerics, Too! 3.to_s.encoding #=> #<Encoding:US-ASCII> 3.5.to_s.encoding #=> #<Encoding:US-ASCII>

How many encodings does your application use?

• Default string encoding: UTF - 8 • Symbols and
Numerics converting to string: US - ASCII • Working with IO: ASCII - 8IT / BINARY • Regexps also have nuanced encoding interactions not discussed The Three-Encoding Problem

• Encoding negotiation is very cheap when encodings are the
same • Otherwise, deriving the resulting encoding runs through a convoluted list of rules • Ruby optimizes based on encoding properties • Fixed width encodings like US - ASCII are faster than variable-width encodings • ASCII-compatible encodings can be faster if string is ASCII only • Ruby caches discovered properties of a string in something called a “code range” • Notes if string is ASCII only or if it’s broken • Cache must be conservative, since it a ff ects results • If unknown, must perform full linear scan through string Performance Impact

• All parties must agree on encoding, including any external
• Databases need to agree on both encoding and collation • Paths that usually succeed could suddenly fail when non-ASCII data appears • If working with user-generated data, you need to plan for non-ASCII • Incorrectly dealing with encodings could corrupt or lose data Behavioral Impact

Ruby 3.2 Changes

• String code range is derived rather than scanned in
more places now • Avoids visiting every byte in a string • Reduces Copy on Write faults • String concatenation is now faster • Operations now optimize for most common encodings, not just encoding properties • Even more room for optimization! Encoding-related Performance Improvements

We made it!

• What encodings are and how Ruby uses them •
Ruby’s various controls for setting encodings on di ff erent objects • A bit of history of how we got where we are • How encodings can impact performance and alter behavior What We’ve Learned

• Code Ranges: A Deeper Look at Ruby Strings •
https://shopify.engineering/code-ranges-ruby-strings • PR to preserve code range when resizing without truncation • https://github.com/ruby/ruby/pull/6178 • PR to reduce encoding negotiation for common string concat cases • https://github.com/ruby/ruby/pull/6120 • PR to cheaply derive code range for String#b calls • https://github.com/ruby/ruby/pull/6183 Helpful Resources

Thank you for your time Kevin Menard kevin.menard@shopify.com Twitter: @nirvdrum
GitHub: @nirvdrum Mastodon: @nirvdrum@ruby.social

• Ruby logo: © 2006, Yukihiro Matsumoto. • Licensed under
CC BY - SA 2.5: https://creativecommons.org/licenses/by-sa/2.5/ • Tru ff leRuby logo: © 2017 Talkdesk, Inc. • Licensed under CC BY 4.0: https://creativecommons.org/licenses/by/4.0/ • Rails logo is in the public domain • CC0 1.0 Universal (CC0 1.0) Public Domain Dedication • YJIT logo: © 2021, Shopify, Inc. • Tapioca logo: © Shopify, Inc. • Sorbet logo: © Stripe • “2021” picture: © 2021, Matthew Henry https://burst.shopify.com/photos/the- year-2021-in-black-ink • All other images generated by DALL - E 2 with prompts by Kevin Menard © 2022 Image Licenses

The Three-Encoding Problem

The Three-Encoding Problem

More Decks by Kevin Menard

Other Decks in Programming

Featured

Transcript