The Three-Encoding Problem

Slide 1

Slide 1 text

Kevin Menard [email protected] GitHub: @nirvdrum — Twitter: @nirvdrum Mastodon: @[email protected] The Three-Encoding Problem

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

• High performance implementation of Ruby • Focuses on peak performance • Designed to optimize idiomatic Ruby • Intended to be compatible with CRuby • Even runs native extensions

Slide 4

Slide 4 text

The Three-Encoding Problem

Slide 5

Slide 5 text

Leveling Up

Slide 6

Slide 6 text

Ruby Strings

Slide 7

Slide 7 text

# encoding: US-ASCII s = 'abc'

Slide 8

Slide 8 text

s.bytes #=> [97, 98, 99] # encoding: US-ASCII s = 'abc'

Slide 9

Slide 9 text

s.encoding #=> # # encoding: US-ASCII s = 'abc'

Slide 10

Slide 10 text

s.bytes [97, 98, 99] s.encoding # s = 'abc'

Slide 11

Slide 11 text

s.chars ["a", "b", "c"] s.bytes [97, 98, 99] s.encoding # s = 'abc' + =

Slide 12

Slide 12 text

Encodings

Slide 13

Slide 13 text

• encoding(bytes) => characters • Takes in a sequence of bytes • Outputs a sequence of characters Encodings: A Mapping Function

Slide 14

Slide 14 text

• Encoding • Character • Code Point • Code Unit Terminology

Slide 15

Slide 15 text

• encoding(codepoint) => character • encoding(bytes) => code point Encodings: Revised

Slide 16

Slide 16 text

# encoding: US-ASCII s = 'abc' s.bytes #=> [ 97, 98, 99] s.codepoints #=> [ 97, 98, 99] s.chars #=> ["a", "b", "c"]

Slide 17

Slide 17 text

A Crash Course in Encoding History

Slide 18

Slide 18 text

• De f ines 128 “characters” (95 printable) — everything f its in 7 bits • a - z (lowercase) • A - Z (uppercase) • 0 - 9 (digits) • .,?!:; (punctuation) • $%@# (symbols) • Tab, space, newline, carriage return (white space + control) American Standard Code for Information Interchange

Slide 19

Slide 19 text

ASCII in Ruby # encoding: US-ASCII s = 'abc' s.encoding #=> # s.encoding.ascii_compatible? #=> true s.ascii_only? #=> true

Slide 20

Slide 20 text

Moving Beyond ASCII # encoding: UTF-8 s = 'très' s.bytes #=> [116, 114, 195, 168, 115] s.codepoints #=> [116, 114, 232, 115] s.chars #=> ["t", "r", "è", "s"]

Slide 21

Slide 21 text

• Popular encoding system • Supports 150K+ characters in versioned releases • Includes writing systems for many world languages plus things like emoji • The 128 ASCII code points map to the same code point values in Unicode • Multiple ways to represent the same characters (non-normalized) • Writing systems are non unambiguously resolvable, so there are cultural di ff erences and disagreements that Unicode solves unsatisfactorily for some Unicode

Slide 22

Slide 22 text

Unicode Transformation Formats Name UTF - 8 UTF - 16 UTF - 16BE UTF - 16LE UTF - 32 UTF - 32BE UTF - 32LE Code Unit Size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits Byte Order N/A Big Endian Little Endian Big Endian Little Endian Min Char Size 1 byte 2 bytes / 1 code unit 2 bytes 2 bytes 4 bytes / 1 code unit 4 bytes 4 bytes Max Char Size 4 bytes 4 bytes / 2 code units 4 bytes 4 bytes 4 bytes / 1 code unit 4 bytes 4 bytes Adapted from the Unicode UTF FAQ: https://unicode.org/faq/utf_bom.html BOM = Byte Order Marker

Slide 23

Slide 23 text

Ruby Encodings

Slide 24

Slide 24 text

• Unlike many languages, Ruby does not have a uni f ied internal string encoding • E.g., JavaScript uses UTF - 16 and Python uses UTF - 8 for all string storage • Other languages must convert data into their internal encoding • Ruby does no such conversion, meaning multiple encodings can be used at the same time • Consequently, Ruby works very e ff iciently with legacy data • By not tying itself to implicitly to Unicode, Ruby is more inclusive Ruby VS the World

Slide 25

Slide 25 text

Encoding.list.size #=> 103

Slide 26

Slide 26 text

s = 'abc' us_ascii = s.encode('US-ASCII') utf_8 = s.encode('UTF-8') shift_jis = s.encode('Shift_JIS') us_ascii.codepoints #=> [97, 98, 99] utf8_.codepoints #=> [97, 98, 99] shift_jis.codepoints #=> [97, 98, 99] us_ascii.bytes #=> [97, 98, 99] utf8_.bytes #=> [97, 98, 99] shift_jis.bytes #=> [97, 98, 99] ASCII Compatibility

Slide 27

Slide 27 text

Encoding.list.count(&:ascii_compatible?) #=> 90

Slide 28

Slide 28 text

s = 'abc' us_ascii = s.encode('US-ASCII') utf_16le = s.encode('UTF-16LE') utf_32le = s.encode('UTF-32LE') us_ascii.codepoints #=> [97, 98, 99] utf_16le.codepoints #=> [97, 98, 99] utf_32le.codepoints #=> [97, 98, 99] us_ascii.bytes #=> [97, 98, 99] utf_16le.bytes #=> [97, 0, 98, 0, 99, 0] utf_32le.bytes #=> [97, 0, 0, 0, 98, 0, 0, 0, 99, 0, 0, 0] ASCII Incompatibility

Slide 29

Slide 29 text

Transcoding: A Machine for Converting Characters from One Encoding to Another

Slide 30

Slide 30 text

s = 'abc' s.encoding #=> # s.bytes #=> [97, 98, 99] s.encode('US-ASCII').bytes #=> [97, 98, 99] s.encode('UTF-16LE').bytes #=> [97, 0, 98, 0, 99, 0] Transcoding

Slide 31

Slide 31 text

Transcoding Error: Unde f ined Conversion 'très'.encode('US-ASCII') # -e:1:in `encode': U+00E8 from UTF-8 to US-ASCII (Encoding::UndefinedConversionError) # from -e:1:in `'

Slide 32

Slide 32 text

Transcoding Error: Invalid Byte Sequence "très\xe3\x81\xff".encode('UTF-16LE') # -e:1:in `encode': "\\xE3\\x81" followed by "\\xFF" on UTF-8 (Encoding::InvalidByteSequenceError) # from -e:1:in `'

Slide 33

Slide 33 text

�� "très\xe3\x81\xff".encode( 'UTF-16LE', invalid: :replace) #=> "tr\u00E8s\uFFFD\uFFFD" (très ) • Encoding::Unde f inedConversionError: • Encoding::InvalidByteSequenceError: Handling Transcoding Errors 'très'.encode('US-ASCII', undef: :replace) #=> “tr?s"

Slide 34

Slide 34 text

s = 'très' s.bytes #=> [116, 114, 195, 168, 115] s.force_encoding('US-ASCII') #=> "tr\xC3\xA8s" s.bytes #=> [116, 114, 195, 168, 115] s.force_encoding('UTF-8') s.bytes #=> [116, 114, 195, 168, 115] Encoding Override: Don’t Do this at Home… (Please)

Slide 35

Slide 35 text

Broken Strings s = 'très' s.encoding #=> # s.size #=> 4 s.force_encoding('US-ASCII') #=> "tr\xC3\xA8s" s.encoding #=> # s.valid_encoding? #=> false s.size #=> 5

Slide 36

Slide 36 text

• Ruby strings = byte arrays + encoding • Clever idea… let’s make an encoding that doesn’t map to any characters • ASCII - 8BIT is born • Aliased as BINARY, which is more descriptive, but not the default name • Binary strings are ASCII-compatible but probably shouldn’t be • Can never be broken; any byte value is valid in an arbitrary byte array • Useful when reading from I/O or other data source where data is incomplete or unknown Binary Strings

Slide 37

Slide 37 text

ASCII - 8BIT / BINARY Example require 'stringio' s = 'façade' io = StringIO.new(s) buf = io.read buf.encoding #=> # io.rewind buf = io.read(s.bytesize) buf.encoding #=> #

Slide 38

Slide 38 text

ASCII - 8BIT / BINARY: Caveat Utilitor require 'stringio' CHUNK_SIZE = 4 s = '_façade' buf = String.new StringIO.open(s) do |io| until io.eof? chunk = io.read(CHUNK_SIZE) # Chunk is ASCII-8BIT / BINARY. yield chunk # Hopefully, callee knows it's not given a real string. end end

Slide 39

Slide 39 text

String Parlor Trick

Slide 40

Slide 40 text

Let’s Create Some Strings s1 = '' s2 = String.new

Slide 41

Slide 41 text

Are They Equal? s1 == s2 #=> true

Slide 42

Slide 42 text

Let’s Proceed to Write Some Bytes s1 << 0x80 s2 << 0x80

Slide 43

Slide 43 text

Still Equal? s1 == s2 #=> false

Slide 44

Slide 44 text

Existential Dread s1.bytes #=> [194, 128] s2.bytes #=> [128]

Slide 45

Slide 45 text

Et Tu, Brute? ''.encoding #=> # String.new.encoding #=> #

Slide 46

Slide 46 text

Encoding Compatibility & Negotiation

Slide 47

Slide 47 text

• Since Ruby doesn’t normalize encodings, it needs to handle strings with di ff erent encodings interacting • Forcing the user to deal with that is not in the spirit of Ruby • Ruby’s notion of compatibility is fairly complex • 90% of encodings are ASCII-compatible so ASCII data is trivially compatible Encoding Compatibility

Slide 48

Slide 48 text

• Checks the compatibility of two objects. • If the objects are both strings they are compatible when they are concatenatable. • The encoding of the concatenated string will be returned if they are compatible, nil if they are not. Encoding.compatible?(obj1, obj2)

Slide 49

Slide 49 text

Encoding.compatible? with Encodings US_ASCII = Encoding::US_ASCII UTF_8 = Encoding::UTF_8 BINARY = Encoding::BINARY # Compatible encodings: Encoding.compatible?(US_ASCII, UTF_8) #=> # Encoding.compatible?(UTF_8, US_ASCII) #=> # # Incompatible encodings: Encoding.compatible?(UTF_8, BINARY) #=> nil Encoding.compatible?(BINARY, UTF_8) #=> nil

Slide 50

Slide 50 text

Encoding.compatible? with Strings us_ascii = 'abc'.encode('US-ASCII') utf_8 = 'abc'.encode('UTF-8') binary = 'abc'.encode('BINARY') # Argument order matters. Encoding.compatible?(us_ascii, utf_8) #=> # Encoding.compatible?(utf_8, us_ascii) #=> #

Slide 51

Slide 51 text

Encoding.compatible? with Strings us_ascii = 'abc'.encode('US-ASCII') utf_8 = 'abc'.encode('UTF-8') binary = 'abc'.encode('BINARY') # Strings can be compatible even if their # encodings aren't. Encoding.compatible?(utf_8, binary) #=> # Encoding.compatible?(utf_8.encoding, binary.encoding) #=> nil

Slide 52

Slide 52 text

Encoding.compatible? with Strings utf_8 = 'abc'.encode('UTF-8') utf_16le = 'abc'.encode('UTF-16LE') # Incompatible encodings return nil. Encoding.compatible?(utf_8, utf_16le) #=> nil

Slide 53

Slide 53 text

Encoding.compatible? with Strings utf_8 = 'abc'.encode('UTF-8') utf_16le = 'abc'.encode('UTF-16LE') # Incompatible encodings return nil. # ... unless one of the strings are empty. Encoding.compatible?(utf_8.dup.clear, utf_16le) #=> # Encoding.compatible?(utf_8, utf_16le.dup.clear) #=> #

Slide 54

Slide 54 text

Encodings: Not Just for Strings

Slide 55

Slide 55 text

• Symbols • Regexps • IO Other Objects with Encodings

Slide 56

Slide 56 text

s = 'très' s.encoding #=> Encoding::UTF_8 s.to_sym.encoding #=> Encoding::UTF_8 s.to_sym.to_s.encoding #=> Encoding::UTF_8

Slide 57

Slide 57 text

s = 'abc' s.encoding #=> Encoding::UTF_8 s.to_sym.encoding #=> Encoding::US_ASCII s.to_sym.to_s.encoding #=> Encoding::US_ASCII

Slide 58

Slide 58 text

Numerics, Too! 3.to_s.encoding #=> # 3.5.to_s.encoding #=> #

Slide 59

Slide 59 text

How many encodings does your application use?

Slide 60

Slide 60 text

• Default string encoding: UTF - 8 • Symbols and Numerics converting to string: US - ASCII • Working with IO: ASCII - 8IT / BINARY • Regexps also have nuanced encoding interactions not discussed The Three-Encoding Problem

Slide 61

Slide 61 text

• Encoding negotiation is very cheap when encodings are the same • Otherwise, deriving the resulting encoding runs through a convoluted list of rules • Ruby optimizes based on encoding properties • Fixed width encodings like US - ASCII are faster than variable-width encodings • ASCII-compatible encodings can be faster if string is ASCII only • Ruby caches discovered properties of a string in something called a “code range” • Notes if string is ASCII only or if it’s broken • Cache must be conservative, since it a ff ects results • If unknown, must perform full linear scan through string Performance Impact

Slide 62

Slide 62 text

• All parties must agree on encoding, including any external • Databases need to agree on both encoding and collation • Paths that usually succeed could suddenly fail when non-ASCII data appears • If working with user-generated data, you need to plan for non-ASCII • Incorrectly dealing with encodings could corrupt or lose data Behavioral Impact

Slide 63

Slide 63 text

Ruby 3.2 Changes

Slide 64

Slide 64 text

• String code range is derived rather than scanned in more places now • Avoids visiting every byte in a string • Reduces Copy on Write faults • String concatenation is now faster • Operations now optimize for most common encodings, not just encoding properties • Even more room for optimization! Encoding-related Performance Improvements

Slide 65

Slide 65 text

We made it!

Slide 66

Slide 66 text

• What encodings are and how Ruby uses them • Ruby’s various controls for setting encodings on di ff erent objects • A bit of history of how we got where we are • How encodings can impact performance and alter behavior What We’ve Learned

Slide 67

Slide 67 text

• Code Ranges: A Deeper Look at Ruby Strings • https://shopify.engineering/code-ranges-ruby-strings • PR to preserve code range when resizing without truncation • https://github.com/ruby/ruby/pull/6178 • PR to reduce encoding negotiation for common string concat cases • https://github.com/ruby/ruby/pull/6120 • PR to cheaply derive code range for String#b calls • https://github.com/ruby/ruby/pull/6183 Helpful Resources

Slide 68

Slide 68 text

Thank you for your time Kevin Menard [email protected] Twitter: @nirvdrum GitHub: @nirvdrum Mastodon: @[email protected]

Slide 69

Slide 69 text

• Ruby logo: © 2006, Yukihiro Matsumoto. • Licensed under CC BY - SA 2.5: https://creativecommons.org/licenses/by-sa/2.5/ • Tru ff leRuby logo: © 2017 Talkdesk, Inc. • Licensed under CC BY 4.0: https://creativecommons.org/licenses/by/4.0/ • Rails logo is in the public domain • CC0 1.0 Universal (CC0 1.0) Public Domain Dedication • YJIT logo: © 2021, Shopify, Inc. • Tapioca logo: © Shopify, Inc. • Sorbet logo: © Stripe • “2021” picture: © 2021, Matthew Henry https://burst.shopify.com/photos/the- year-2021-in-black-ink • All other images generated by DALL - E 2 with prompts by Kevin Menard © 2022 Image Licenses