Slide 1

Slide 1 text

Rethinking Strings Kevin Menard 2023-05-13 [email protected]

Slide 2

Slide 2 text

About Me 2021 2008 2014

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

You String me Round Primary way to communicate data to users Web templates Email processing Log output Common exchange format for API calls Configuration data

Slide 6

Slide 6 text

…Right Round Metaprogramming send, __send__ instance_variable_{get,set} method_missing respond_to?, respond_to_missing?

Slide 7

Slide 7 text

Rethinking Strings

Slide 8

Slide 8 text

ByteList: TruffleRuby Strings 1.0 Straightforward port of CRuby strings to TruffleRuby Pulled in Ruby source from Rubinius, reimplement C++ in Java Use JRuby’s ByteList to support efficient substring Worked… but not great for JIT compilation

Slide 9

Slide 9 text

Ropes: TruffleRuby Strings Reloaded (2.0) Reimplement with ropes Presented at RubyKaigi in 2016 Uses trees of lazy operations rather than byte manipulation Persistent data structure; easy to reuse Overhead for the tree, but could share memory easier a b c a b c x y z x y z LeafNode a b c LeafNode x y z ConcatNode

Slide 10

Slide 10 text

Rope Performance Lazy operations ERB does a lot of concatenations but only needs end result Immutability Cache code range and string length Can cheaply derive new code ranges and string lengths String#setbyte presents pathological case Setting each byte turns tree into linked list

Slide 11

Slide 11 text

TruffleString: TruffleRuby Strings Revolutions (3.0) Blended approach Ropes generally, but mutable byte storage when needed (work in progress) More storage strategies Improved efficiency in polyglot applications

Slide 12

Slide 12 text

A History of Strings Better than a time machine!

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

String API Text-oriented • Case conversion • Substring/index • Strip whitespace • Concatenation • Transcode Byte-oriented • Buffer pre-allocation • Direct byte setting • Byte indices • Byte slicing • Encoding override Modern? Legacy?

Slide 16

Slide 16 text

Strings Today Support mutable & immutable operations Or imperative and functional if you’d prefer Encoding-aware without single system encoding Encoding can be changed: System-wide (at start-up or runtime) On a per-file basis On a per-String basis Automatic encoding conversion (transcoding)

Slide 17

Slide 17 text

Encodings

Slide 18

Slide 18 text

The Meaning of Bytes encode(bytes) → code point Takes in a sequence of bytes Outputs a code point encode(code points) → character Takes in a sequence of code points Outputs a grapheme cluster (“character”) encode(encode(bytes)) → character

Slide 19

Slide 19 text

Bytes, Bytes, Baby bytes = [0xC2, 0xA9] 1 s = bytes.pack("C*") 2 3 s.dup.force_encoding("UTF-8") # => © 4 s.dup.force_encoding("Shift_JIS") # => ツゥ 5

Slide 20

Slide 20 text

Encoding.list.size # => 103

Slide 21

Slide 21 text

Encoding.name_list .size # => 175

Slide 22

Slide 22 text

Default Encodings Name Usage UTF-8 General purpose strings US-ASCII Symbols, regex, and numbers ASCII-8BIT (BINARY) Data read from I/O and binary data

Slide 23

Slide 23 text

ASCII-8BIT

Slide 24

Slide 24 text

ASCII-8BIT ≠ US-ASCII

Slide 25

Slide 25 text

Encoding::ASCII_8BIT.names # => ["ASCII-8BIT", "BINARY"]

Slide 26

Slide 26 text

Encoding::ASCII_8BIT .ascii_compatible? # => true

Slide 27

Slide 27 text

'abc'.b.ascii_only? # => true

Slide 28

Slide 28 text

'abc'.b.encoding # => #

Slide 29

Slide 29 text

Code Ranges

Slide 30

Slide 30 text

Code Range Values Name Usage Unknown String needs to be scanned 7 Bit • Encoding is ASCII-compatible • All bytes are 0x00 - 0x7F (ASCII-only) Valid • Bytes map correctly in encoding • String is not ASCII-only Broken Bytes don’t map correctly in encoding

Slide 31

Slide 31 text

Code Range Values Name Usage Unknown — 7 Bit str.ascii_only? Valid str.valid_encoding? && !str.ascii_only? Broken !str.valid_encoding?

Slide 32

Slide 32 text

Binary Data

Slide 33

Slide 33 text

How can we Represent Binary Data? Bytes are just integer values in 0x00 - 0xFF We could use an Array of Integer …but it’s not space efficient require 'objspace' 1 2 ObjectSpace.memsize_of([0x2A] * 1) # => 160 3 ObjectSpace.memsize_of([0x2A] * 1024) # => 8,232 4 ObjectSpace.memsize_of([0x2A] * 1024 ** 2) # => 8,388,648 5

Slide 34

Slide 34 text

How can we Represent Binary Data? Let’s use Ruby 1.8 strings require 'objspace' 1 2 ObjectSpace.memsize_of('*'.b * 1) # => 40 3 ObjectSpace.memsize_of('*'.b * 1024) # => 1,065 4 ObjectSpace.memsize_of('*'.b * 1024 ** 2) # => 1,048,617 5 6 ('*' * 1024 ** 2).bytesize # => 1,048,576 7 # (Δ: 41 bytes) 8

Slide 35

Slide 35 text

require 'zlib' 1 compressed = Zlib::Deflate.deflate('aasy') 2 # => "x\x9CKL,\xAE\x04\x00\x04\n\x01\xAF" 3 4 compressed.encoding # => # 5 compressed.ascii_only? # => false 6 compressed.valid_encoding? # => true 7

Slide 36

Slide 36 text

require 'zlib' 1 compressed = Zlib::Deflate.deflate('aasy') 2 # => "x\x9CKL,\xAE\x04\x00\x04\n\x01\xAF" 3 4 compressed.encoding # => # 5 compressed.ascii_only? # => false 6 compressed.valid_encoding? # => true 7 8 compressed.lines 9 # => ["x\x9CKL,\xAE\x04\x00\x04\n", "\x01\xAF"] 10

Slide 37

Slide 37 text

require 'zlib' 1 compressed = Zlib::Deflate.deflate('aasy') 2 # => "x\x9CKL,\xAE\x04\x00\x04\n\x01\xAF" 3 4 compressed.encoding # => # 5 compressed.ascii_only? # => false 6 compressed.valid_encoding? # => true 7 8 compressed.lines 9 # => ["x\x9CKL,\xAE\x04\x00\x04\n", "\x01\xAF"] 10 11 compressed.succ 12 # => "x\x9CKM,\xAE\x04\x00\x04\n\x01\xAF" 13

Slide 38

Slide 38 text

Binary Data ≠ Strings

Slide 39

Slide 39 text

Working with Binary Strings s1 = '' 1 s2 = String.new 2

Slide 40

Slide 40 text

Working with Binary Strings s1 = '' 1 s2 = String.new 2 3 s1 == s2 # => true 4

Slide 41

Slide 41 text

Working with Binary Strings s1 = '' 1 s2 = String.new 2 3 s1 == s2 # => true 4 5 s1 << 0x80 6 s2 << 0x80 7

Slide 42

Slide 42 text

Working with Binary Strings s1 = '' 1 s2 = String.new 2 3 s1 == s2 # => true 4 5 s1 << 0x80 6 s2 << 0x80 7 8 s1 == s2 # => false 9

Slide 43

Slide 43 text

Working with Binary Strings s1 = '' 1 s2 = String.new 2 3 s1 == s2 # => true 4 5 s1 << 0x80 6 s2 << 0x80 7 8 s1 == s2 # => false 9 10 s1.bytes # => [194, 128] 11 s2.bytes # => [128] 12

Slide 44

Slide 44 text

Working with Binary Strings String.new ≠ '' (empty string literal) String.new defaults to ASCII-8BIT/BINARY '' uses default internal encoding (UTF-8 in this case) String#<< takes codepoint values, not bytes 0x80 is a codepoint that takes two bytes in UTF-8 0x80 is a “codepoint” that takes 1 byte in BINARY

Slide 45

Slide 45 text

What About IO::Buffer? Can store and efficiently work with binary data … but people think it’s only for IO It still relies on ASCII-8BIT/BINARY encodings at boundaries Supports more than bytes, making it harder to use Not exposed to users File.read() does not return a buffer Zlib::Deflate.deflate does not return a buffer

Slide 46

Slide 46 text

Broken Strings

Slide 47

Slide 47 text

Broken Strings A String’s bytes have no mapping in the associated encoding Ruby hands you a broken string IO.read File.read You break the string yourself String#setbyte String#force_encoding

Slide 48

Slide 48 text

Scaling Images in Ruby image = File.read(File.expand_path("~/ruby.png")) 1 image.encoding # => # 2 image.ascii_only? # => false 3

Slide 49

Slide 49 text

Scaling Images in Ruby image = File.read(File.expand_path("~/ruby.png")) 1 image.encoding # => # 2 image.ascii_only? # => false 3 4 image.upcase # => input string invalid (ArgumentError) 5

Slide 50

Slide 50 text

Scaling Images in Ruby image = File.read(File.expand_path("~/ruby.png")) 1 image.encoding # => # 2 image.ascii_only? # => false 3 4 image.upcase # => input string invalid (ArgumentError) 5 6 image.valid_encoding? # => false 7

Slide 51

Slide 51 text

Scaling Images in Ruby image = File.read(File.expand_path("~/ruby.png")) 1 image.encoding # => # 2 image.ascii_only? # => false 3 4 image.upcase # => input string invalid (ArgumentError) 5 6 image.valid_encoding? # => false 7 8 image.force_encoding(Encoding::BINARY) 9 image.valid_encoding? # => true 10

Slide 52

Slide 52 text

Scaling Images in Ruby image = File.read(File.expand_path("~/ruby.png")) 1 image.encoding # => # 2 image.ascii_only? # => false 3 4 image.upcase # => input string invalid (ArgumentError) 5 6 image.valid_encoding? # => false 7 8 image.force_encoding(Encoding::BINARY) 9 image.valid_encoding? # => true 10 11 bigger_image = image.upcase # => Victory! 12

Slide 53

Slide 53 text

Broken Strings s = 'très' 1 s.encoding #=> # 2 s.size #=> 4 3 4 s.force_encoding('US-ASCII') #=> "tr\xC3\xA8s" 5 6 s.encoding #=> # 7 s.valid_encoding? #=> false 8 s.size #=> 5 9

Slide 54

Slide 54 text

Can we fix this in Ruby?

Slide 55

Slide 55 text

Proposals for Ruby 3.3+

Slide 56

Slide 56 text

Caveats Backwards compatibility is important Phased-in approach Issue a warning Raise an exception (controllable via flag) Deprecate, but maybe never remove

Slide 57

Slide 57 text

1. Invert ASCII-8BIT and BINARY

Slide 58

Slide 58 text

'abc'.b.encoding 1 # => # 2 Instead of 'abc'.b.encoding 1 # => # 2

Slide 59

Slide 59 text

2. Remove ASCII compatibility from BINARY

Slide 60

Slide 60 text

3. Eagerly handle broken strings

Slide 61

Slide 61 text

> Break apart the text and byte-oriented APIs

Slide 62

Slide 62 text

4. Return IO::Buffer in IO calls

Slide 63

Slide 63 text

5. Add a ByteList type

Slide 64

Slide 64 text

Add a ByteList type Much smaller API for working with a list of bytes None of the text-oriented API Simplifies String to work only with text More efficient No encoding compatibility checks appending bytes Safer << would not be a code point

Slide 65

Slide 65 text

> Make it harder to break strings

Slide 66

Slide 66 text

6. Deprecate String#setbyte

Slide 67

Slide 67 text

7. Deprecate #force_encoding

Slide 68

Slide 68 text

Deprecate String#force_encoding It’s unfortunately misused and has a narrow use case The primary use case is reading from IO and overriding the encoding Users actually use it to: “Fix” encoding exceptions (Checking after with String#valid_encoding? never happens) As a faster transcoding operation

Slide 69

Slide 69 text

Re-cap

Slide 70

Slide 70 text

Ruby strings are complex Difficult to optimize Difficult for Rubyists to work with API makes it too easy to corrupt data Automatic transcoding masks errors

Slide 71

Slide 71 text

Broken strings are often a logic error Lazily discovering them masks error origin Users often corrupt data trying to “fix” them Also hinder optimizations Inefficient UTF-8 character boundary detection

Slide 72

Slide 72 text

A proper binary data interface benefits all Simpler string API Fixes a leaky abstraction Easier target for text-oriented optimizations Normalizes encoding logic No more special handling for one specific encoding Faster & simpler binary data handling Far less error-prone for Rubyists Easier target for byte-oriented optimizations

Slide 73

Slide 73 text

Resources Code Ranges: A Deeper Look at Strings https://shopify.engineering/code-ranges-ruby-strings Specializing Ropes for Ruby https://dl.acm.org/doi/abs/10.1145/3237009.3237026 A Tale of Two String Representations RubyKaigi 2016 Rename ASCII-8BIT encoding to BINARY Ruby Feature #18576 Feature Request: Byte Arrays for Ruby 3 Ruby Feature #13166

Slide 74

Slide 74 text

Thank you for your time  [email protected]  @nirvdrum  @[email protected]

Slide 75

Slide 75 text

No content