Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Three-Encoding Problem

Kevin Menard
November 16, 2022

The Three-Encoding Problem

You’ve probably heard of UTF-8 and know about strings, but did you know that Ruby supports more than 100 other encodings? In fact, your application probably uses three encodings without you realizing it. Moreover, encodings apply to more than just strings. In this talk, we’ll take a look at Ruby’s fairly unique approach to encodings and better understand the impact they have on the correctness and performance of our applications. We’ll take a look at the rich encoding APIs Ruby provides and by the end of the talk, you won’t just reach for force_encoding when you see an encoding exception.

Kevin Menard

November 16, 2022
Tweet

More Decks by Kevin Menard

Other Decks in Programming

Transcript

  1. • High performance implementation of Ruby • Focuses on peak

    performance • Designed to optimize idiomatic Ruby • Intended to be compatible with CRuby • Even runs native extensions
  2. • encoding(bytes) => characters • Takes in a sequence of

    bytes • Outputs a sequence of characters Encodings: A Mapping Function
  3. # encoding: US-ASCII s = 'abc' s.bytes #=> [ 97,

    98, 99] s.codepoints #=> [ 97, 98, 99] s.chars #=> ["a", "b", "c"]
  4. • De f ines 128 “characters” (95 printable) — everything

    f its in 7 bits • a - z (lowercase) • A - Z (uppercase) • 0 - 9 (digits) • .,?!:; (punctuation) • $%@# (symbols) • Tab, space, newline, carriage return (white space + control) American Standard Code for Information Interchange
  5. ASCII in Ruby # encoding: US-ASCII s = 'abc' s.encoding

    #=> #<Encoding:US-ASCII> s.encoding.ascii_compatible? #=> true s.ascii_only? #=> true
  6. Moving Beyond ASCII # encoding: UTF-8 s = 'très' s.bytes

    #=> [116, 114, 195, 168, 115] s.codepoints #=> [116, 114, 232, 115] s.chars #=> ["t", "r", "è", "s"]
  7. • Popular encoding system • Supports 150K+ characters in versioned

    releases • Includes writing systems for many world languages plus things like emoji • The 128 ASCII code points map to the same code point values in Unicode • Multiple ways to represent the same characters (non-normalized) • Writing systems are non unambiguously resolvable, so there are cultural di ff erences and disagreements that Unicode solves unsatisfactorily for some Unicode
  8. Unicode Transformation Formats Name UTF - 8 UTF - 16

    UTF - 16BE UTF - 16LE UTF - 32 UTF - 32BE UTF - 32LE Code Unit Size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits Byte Order N/A <BOM> Big Endian Little Endian <BOM> Big Endian Little Endian Min Char Size 1 byte 2 bytes / 1 code unit 2 bytes 2 bytes 4 bytes / 1 code unit 4 bytes 4 bytes Max Char Size 4 bytes 4 bytes / 2 code units 4 bytes 4 bytes 4 bytes / 1 code unit 4 bytes 4 bytes Adapted from the Unicode UTF FAQ: https://unicode.org/faq/utf_bom.html BOM = Byte Order Marker
  9. • Unlike many languages, Ruby does not have a uni

    f ied internal string encoding • E.g., JavaScript uses UTF - 16 and Python uses UTF - 8 for all string storage • Other languages must convert data into their internal encoding • Ruby does no such conversion, meaning multiple encodings can be used at the same time • Consequently, Ruby works very e ff iciently with legacy data • By not tying itself to implicitly to Unicode, Ruby is more inclusive Ruby VS the World
  10. s = 'abc' us_ascii = s.encode('US-ASCII') utf_8 = s.encode('UTF-8') shift_jis

    = s.encode('Shift_JIS') us_ascii.codepoints #=> [97, 98, 99] utf8_.codepoints #=> [97, 98, 99] shift_jis.codepoints #=> [97, 98, 99] us_ascii.bytes #=> [97, 98, 99] utf8_.bytes #=> [97, 98, 99] shift_jis.bytes #=> [97, 98, 99] ASCII Compatibility
  11. s = 'abc' us_ascii = s.encode('US-ASCII') utf_16le = s.encode('UTF-16LE') utf_32le

    = s.encode('UTF-32LE') us_ascii.codepoints #=> [97, 98, 99] utf_16le.codepoints #=> [97, 98, 99] utf_32le.codepoints #=> [97, 98, 99] us_ascii.bytes #=> [97, 98, 99] utf_16le.bytes #=> [97, 0, 98, 0, 99, 0] utf_32le.bytes #=> [97, 0, 0, 0, 98, 0, 0, 0, 99, 0, 0, 0] ASCII Incompatibility
  12. s = 'abc' s.encoding #=> #<Encoding:UTF-8> s.bytes #=> [97, 98,

    99] s.encode('US-ASCII').bytes #=> [97, 98, 99] s.encode('UTF-16LE').bytes #=> [97, 0, 98, 0, 99, 0] Transcoding
  13. Transcoding Error: Unde f ined Conversion 'très'.encode('US-ASCII') # -e:1:in `encode':

    U+00E8 from UTF-8 to US-ASCII (Encoding::UndefinedConversionError) # from -e:1:in `<main>'
  14. Transcoding Error: Invalid Byte Sequence "très\xe3\x81\xff".encode('UTF-16LE') # -e:1:in `encode': "\\xE3\\x81"

    followed by "\\xFF" on UTF-8 (Encoding::InvalidByteSequenceError) # from -e:1:in `<main>'
  15. �� "très\xe3\x81\xff".encode( 'UTF-16LE', invalid: :replace) #=> "tr\u00E8s\uFFFD\uFFFD" (très ) •

    Encoding::Unde f inedConversionError: • Encoding::InvalidByteSequenceError: Handling Transcoding Errors 'très'.encode('US-ASCII', undef: :replace) #=> “tr?s"
  16. s = 'très' s.bytes #=> [116, 114, 195, 168, 115]

    s.force_encoding('US-ASCII') #=> "tr\xC3\xA8s" s.bytes #=> [116, 114, 195, 168, 115] s.force_encoding('UTF-8') s.bytes #=> [116, 114, 195, 168, 115] Encoding Override: Don’t Do this at Home… (Please)
  17. Broken Strings s = 'très' s.encoding #=> #<Encoding:UTF-8> s.size #=>

    4 s.force_encoding('US-ASCII') #=> "tr\xC3\xA8s" s.encoding #=> #<Encoding:US-ASCII> s.valid_encoding? #=> false s.size #=> 5
  18. • Ruby strings = byte arrays + encoding • Clever

    idea… let’s make an encoding that doesn’t map to any characters • ASCII - 8BIT is born • Aliased as BINARY, which is more descriptive, but not the default name • Binary strings are ASCII-compatible but probably shouldn’t be • Can never be broken; any byte value is valid in an arbitrary byte array • Useful when reading from I/O or other data source where data is incomplete or unknown Binary Strings
  19. ASCII - 8BIT / BINARY Example require 'stringio' s =

    'façade' io = StringIO.new(s) buf = io.read buf.encoding #=> #<Encoding:UTF-8> io.rewind buf = io.read(s.bytesize) buf.encoding #=> #<Encoding:ASCII-8BIT>
  20. ASCII - 8BIT / BINARY: Caveat Utilitor require 'stringio' CHUNK_SIZE

    = 4 s = '_façade' buf = String.new StringIO.open(s) do |io| until io.eof? chunk = io.read(CHUNK_SIZE) # Chunk is ASCII-8BIT / BINARY. yield chunk # Hopefully, callee knows it's not given a real string. end end
  21. • Since Ruby doesn’t normalize encodings, it needs to handle

    strings with di ff erent encodings interacting • Forcing the user to deal with that is not in the spirit of Ruby • Ruby’s notion of compatibility is fairly complex • 90% of encodings are ASCII-compatible so ASCII data is trivially compatible Encoding Compatibility
  22. • Checks the compatibility of two objects. • If the

    objects are both strings they are compatible when they are concatenatable. • The encoding of the concatenated string will be returned if they are compatible, nil if they are not. Encoding.compatible?(obj1, obj2)
  23. Encoding.compatible? with Encodings US_ASCII = Encoding::US_ASCII UTF_8 = Encoding::UTF_8 BINARY

    = Encoding::BINARY # Compatible encodings: Encoding.compatible?(US_ASCII, UTF_8) #=> #<Encoding:UTF-8> Encoding.compatible?(UTF_8, US_ASCII) #=> #<Encoding:UTF-8> # Incompatible encodings: Encoding.compatible?(UTF_8, BINARY) #=> nil Encoding.compatible?(BINARY, UTF_8) #=> nil
  24. Encoding.compatible? with Strings us_ascii = 'abc'.encode('US-ASCII') utf_8 = 'abc'.encode('UTF-8') binary

    = 'abc'.encode('BINARY') # Argument order matters. Encoding.compatible?(us_ascii, utf_8) #=> #<Encoding:US-ASCII> Encoding.compatible?(utf_8, us_ascii) #=> #<Encoding:UTF-8>
  25. Encoding.compatible? with Strings us_ascii = 'abc'.encode('US-ASCII') utf_8 = 'abc'.encode('UTF-8') binary

    = 'abc'.encode('BINARY') # Strings can be compatible even if their # encodings aren't. Encoding.compatible?(utf_8, binary) #=> #<Encoding:UTF-8> Encoding.compatible?(utf_8.encoding, binary.encoding) #=> nil
  26. Encoding.compatible? with Strings utf_8 = 'abc'.encode('UTF-8') utf_16le = 'abc'.encode('UTF-16LE') #

    Incompatible encodings return nil. Encoding.compatible?(utf_8, utf_16le) #=> nil
  27. Encoding.compatible? with Strings utf_8 = 'abc'.encode('UTF-8') utf_16le = 'abc'.encode('UTF-16LE') #

    Incompatible encodings return nil. # ... unless one of the strings are empty. Encoding.compatible?(utf_8.dup.clear, utf_16le) #=> #<Encoding:UTF-16LE> Encoding.compatible?(utf_8, utf_16le.dup.clear) #=> #<Encoding:UTF-8>
  28. • Default string encoding: UTF - 8 • Symbols and

    Numerics converting to string: US - ASCII • Working with IO: ASCII - 8IT / BINARY • Regexps also have nuanced encoding interactions not discussed The Three-Encoding Problem
  29. • Encoding negotiation is very cheap when encodings are the

    same • Otherwise, deriving the resulting encoding runs through a convoluted list of rules • Ruby optimizes based on encoding properties • Fixed width encodings like US - ASCII are faster than variable-width encodings • ASCII-compatible encodings can be faster if string is ASCII only • Ruby caches discovered properties of a string in something called a “code range” • Notes if string is ASCII only or if it’s broken • Cache must be conservative, since it a ff ects results • If unknown, must perform full linear scan through string Performance Impact
  30. • All parties must agree on encoding, including any external

    • Databases need to agree on both encoding and collation • Paths that usually succeed could suddenly fail when non-ASCII data appears • If working with user-generated data, you need to plan for non-ASCII • Incorrectly dealing with encodings could corrupt or lose data Behavioral Impact
  31. • String code range is derived rather than scanned in

    more places now • Avoids visiting every byte in a string • Reduces Copy on Write faults • String concatenation is now faster • Operations now optimize for most common encodings, not just encoding properties • Even more room for optimization! Encoding-related Performance Improvements
  32. • What encodings are and how Ruby uses them •

    Ruby’s various controls for setting encodings on di ff erent objects • A bit of history of how we got where we are • How encodings can impact performance and alter behavior What We’ve Learned
  33. • Code Ranges: A Deeper Look at Ruby Strings •

    https://shopify.engineering/code-ranges-ruby-strings • PR to preserve code range when resizing without truncation • https://github.com/ruby/ruby/pull/6178 • PR to reduce encoding negotiation for common string concat cases • https://github.com/ruby/ruby/pull/6120 • PR to cheaply derive code range for String#b calls • https://github.com/ruby/ruby/pull/6183 Helpful Resources
  34. • Ruby logo: © 2006, Yukihiro Matsumoto. • Licensed under

    CC BY - SA 2.5: https://creativecommons.org/licenses/by-sa/2.5/ • Tru ff leRuby logo: © 2017 Talkdesk, Inc. • Licensed under CC BY 4.0: https://creativecommons.org/licenses/by/4.0/ • Rails logo is in the public domain • CC0 1.0 Universal (CC0 1.0) Public Domain Dedication • YJIT logo: © 2021, Shopify, Inc. • Tapioca logo: © Shopify, Inc. • Sorbet logo: © Stripe • “2021” picture: © 2021, Matthew Henry https://burst.shopify.com/photos/the- year-2021-in-black-ink • All other images generated by DALL - E 2 with prompts by Kevin Menard © 2022 Image Licenses