Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rethinking Strings

Rethinking Strings

Ruby strings serve two distinct purposes: the representation of textual data and the representation of binary data. These two use cases generally require different operations, but today they're both accessible via String. Combining the two creates a discoverability issue and can be error-prone. Many String operations have no logical meaning for arbitrary binary data. Having to use strings with a special encoding to pass binary data around is a non-obvious solution and hampers Ruby's usability. Moreover, binary data can sometimes look like ASCII text, which may help build false trust in code with logic errors. Such errors are nuanced and difficult to debug.

This talk takes a high-level look at Ruby's strings and encodings, highlighting potentially problematic areas and suggesting ways to improve. While the emphasis is on the logical interface for text and binary data, we'll also look at the performance ramifications of the current design and how that might improve as well.

Kevin Menard

May 13, 2023
Tweet

More Decks by Kevin Menard

Other Decks in Programming

Transcript

  1. You String me Round Primary way to communicate data to

    users Web templates Email processing Log output Common exchange format for API calls Configuration data
  2. ByteList: TruffleRuby Strings 1.0 Straightforward port of CRuby strings to

    TruffleRuby Pulled in Ruby source from Rubinius, reimplement C++ in Java Use JRuby’s ByteList to support efficient substring Worked… but not great for JIT compilation
  3. Ropes: TruffleRuby Strings Reloaded (2.0) Reimplement with ropes Presented at

    RubyKaigi in 2016 Uses trees of lazy operations rather than byte manipulation Persistent data structure; easy to reuse Overhead for the tree, but could share memory easier a b c a b c x y z x y z LeafNode a b c LeafNode x y z ConcatNode
  4. Rope Performance Lazy operations ERB does a lot of concatenations

    but only needs end result Immutability Cache code range and string length Can cheaply derive new code ranges and string lengths String#setbyte presents pathological case Setting each byte turns tree into linked list
  5. TruffleString: TruffleRuby Strings Revolutions (3.0) Blended approach Ropes generally, but

    mutable byte storage when needed (work in progress) More storage strategies Improved efficiency in polyglot applications
  6. String API Text-oriented • Case conversion • Substring/index • Strip

    whitespace • Concatenation • Transcode Byte-oriented • Buffer pre-allocation • Direct byte setting • Byte indices • Byte slicing • Encoding override Modern? Legacy?
  7. Strings Today Support mutable & immutable operations Or imperative and

    functional if you’d prefer Encoding-aware without single system encoding Encoding can be changed: System-wide (at start-up or runtime) On a per-file basis On a per-String basis Automatic encoding conversion (transcoding)
  8. The Meaning of Bytes encode(bytes) → code point Takes in

    a sequence of bytes Outputs a code point encode(code points) → character Takes in a sequence of code points Outputs a grapheme cluster (“character”) encode(encode(bytes)) → character
  9. Bytes, Bytes, Baby bytes = [0xC2, 0xA9] 1 s =

    bytes.pack("C*") 2 3 s.dup.force_encoding("UTF-8") # => © 4 s.dup.force_encoding("Shift_JIS") # => ツゥ 5
  10. Default Encodings Name Usage UTF-8 General purpose strings US-ASCII Symbols,

    regex, and numbers ASCII-8BIT (BINARY) Data read from I/O and binary data
  11. Code Range Values Name Usage Unknown String needs to be

    scanned 7 Bit • Encoding is ASCII-compatible • All bytes are 0x00 - 0x7F (ASCII-only) Valid • Bytes map correctly in encoding • String is not ASCII-only Broken Bytes don’t map correctly in encoding
  12. Code Range Values Name Usage Unknown — 7 Bit str.ascii_only?

    Valid str.valid_encoding? && !str.ascii_only? Broken !str.valid_encoding?
  13. How can we Represent Binary Data? Bytes are just integer

    values in 0x00 - 0xFF We could use an Array of Integer …but it’s not space efficient require 'objspace' 1 2 ObjectSpace.memsize_of([0x2A] * 1) # => 160 3 ObjectSpace.memsize_of([0x2A] * 1024) # => 8,232 4 ObjectSpace.memsize_of([0x2A] * 1024 ** 2) # => 8,388,648 5
  14. How can we Represent Binary Data? Let’s use Ruby 1.8

    strings require 'objspace' 1 2 ObjectSpace.memsize_of('*'.b * 1) # => 40 3 ObjectSpace.memsize_of('*'.b * 1024) # => 1,065 4 ObjectSpace.memsize_of('*'.b * 1024 ** 2) # => 1,048,617 5 6 ('*' * 1024 ** 2).bytesize # => 1,048,576 7 # (Δ: 41 bytes) 8
  15. require 'zlib' 1 compressed = Zlib::Deflate.deflate('aasy') 2 # => "x\x9CKL,\xAE\x04\x00\x04\n\x01\xAF"

    3 4 compressed.encoding # => #<Encoding:ASCII-8BIT> 5 compressed.ascii_only? # => false 6 compressed.valid_encoding? # => true 7
  16. require 'zlib' 1 compressed = Zlib::Deflate.deflate('aasy') 2 # => "x\x9CKL,\xAE\x04\x00\x04\n\x01\xAF"

    3 4 compressed.encoding # => #<Encoding:ASCII-8BIT> 5 compressed.ascii_only? # => false 6 compressed.valid_encoding? # => true 7 8 compressed.lines 9 # => ["x\x9CKL,\xAE\x04\x00\x04\n", "\x01\xAF"] 10
  17. require 'zlib' 1 compressed = Zlib::Deflate.deflate('aasy') 2 # => "x\x9CKL,\xAE\x04\x00\x04\n\x01\xAF"

    3 4 compressed.encoding # => #<Encoding:ASCII-8BIT> 5 compressed.ascii_only? # => false 6 compressed.valid_encoding? # => true 7 8 compressed.lines 9 # => ["x\x9CKL,\xAE\x04\x00\x04\n", "\x01\xAF"] 10 11 compressed.succ 12 # => "x\x9CKM,\xAE\x04\x00\x04\n\x01\xAF" 13
  18. Working with Binary Strings s1 = '' 1 s2 =

    String.new 2 3 s1 == s2 # => true 4
  19. Working with Binary Strings s1 = '' 1 s2 =

    String.new 2 3 s1 == s2 # => true 4 5 s1 << 0x80 6 s2 << 0x80 7
  20. Working with Binary Strings s1 = '' 1 s2 =

    String.new 2 3 s1 == s2 # => true 4 5 s1 << 0x80 6 s2 << 0x80 7 8 s1 == s2 # => false 9
  21. Working with Binary Strings s1 = '' 1 s2 =

    String.new 2 3 s1 == s2 # => true 4 5 s1 << 0x80 6 s2 << 0x80 7 8 s1 == s2 # => false 9 10 s1.bytes # => [194, 128] 11 s2.bytes # => [128] 12
  22. Working with Binary Strings String.new ≠ '' (empty string literal)

    String.new defaults to ASCII-8BIT/BINARY '' uses default internal encoding (UTF-8 in this case) String#<< takes codepoint values, not bytes 0x80 is a codepoint that takes two bytes in UTF-8 0x80 is a “codepoint” that takes 1 byte in BINARY
  23. What About IO::Buffer? Can store and efficiently work with binary

    data … but people think it’s only for IO It still relies on ASCII-8BIT/BINARY encodings at boundaries Supports more than bytes, making it harder to use Not exposed to users File.read(<n>) does not return a buffer Zlib::Deflate.deflate does not return a buffer
  24. Broken Strings A String’s bytes have no mapping in the

    associated encoding Ruby hands you a broken string IO.read File.read You break the string yourself String#setbyte String#force_encoding
  25. Scaling Images in Ruby image = File.read(File.expand_path("~/ruby.png")) 1 image.encoding #

    => #<Encoding:UTF-8> 2 image.ascii_only? # => false 3 4 image.upcase # => input string invalid (ArgumentError) 5
  26. Scaling Images in Ruby image = File.read(File.expand_path("~/ruby.png")) 1 image.encoding #

    => #<Encoding:UTF-8> 2 image.ascii_only? # => false 3 4 image.upcase # => input string invalid (ArgumentError) 5 6 image.valid_encoding? # => false 7
  27. Scaling Images in Ruby image = File.read(File.expand_path("~/ruby.png")) 1 image.encoding #

    => #<Encoding:UTF-8> 2 image.ascii_only? # => false 3 4 image.upcase # => input string invalid (ArgumentError) 5 6 image.valid_encoding? # => false 7 8 image.force_encoding(Encoding::BINARY) 9 image.valid_encoding? # => true 10
  28. Scaling Images in Ruby image = File.read(File.expand_path("~/ruby.png")) 1 image.encoding #

    => #<Encoding:UTF-8> 2 image.ascii_only? # => false 3 4 image.upcase # => input string invalid (ArgumentError) 5 6 image.valid_encoding? # => false 7 8 image.force_encoding(Encoding::BINARY) 9 image.valid_encoding? # => true 10 11 bigger_image = image.upcase # => Victory! 12
  29. Broken Strings s = 'très' 1 s.encoding #=> #<Encoding:UTF-8> 2

    s.size #=> 4 3 4 s.force_encoding('US-ASCII') #=> "tr\xC3\xA8s" 5 6 s.encoding #=> #<Encoding:US-ASCII> 7 s.valid_encoding? #=> false 8 s.size #=> 5 9
  30. Caveats Backwards compatibility is important Phased-in approach Issue a warning

    Raise an exception (controllable via flag) Deprecate, but maybe never remove
  31. Add a ByteList type Much smaller API for working with

    a list of bytes None of the text-oriented API Simplifies String to work only with text More efficient No encoding compatibility checks appending bytes Safer << would not be a code point
  32. Deprecate String#force_encoding It’s unfortunately misused and has a narrow use

    case The primary use case is reading from IO and overriding the encoding Users actually use it to: “Fix” encoding exceptions (Checking after with String#valid_encoding? never happens) As a faster transcoding operation
  33. Ruby strings are complex Difficult to optimize Difficult for Rubyists

    to work with API makes it too easy to corrupt data Automatic transcoding masks errors
  34. Broken strings are often a logic error Lazily discovering them

    masks error origin Users often corrupt data trying to “fix” them Also hinder optimizations Inefficient UTF-8 character boundary detection
  35. A proper binary data interface benefits all Simpler string API

    Fixes a leaky abstraction Easier target for text-oriented optimizations Normalizes encoding logic No more special handling for one specific encoding Faster & simpler binary data handling Far less error-prone for Rubyists Easier target for byte-oriented optimizations
  36. Resources Code Ranges: A Deeper Look at Strings https://shopify.engineering/code-ranges-ruby-strings Specializing

    Ropes for Ruby https://dl.acm.org/doi/abs/10.1145/3237009.3237026 A Tale of Two String Representations RubyKaigi 2016 Rename ASCII-8BIT encoding to BINARY Ruby Feature #18576 Feature Request: Byte Arrays for Ruby 3 Ruby Feature #13166