Upgrade to Pro — share decks privately, control downloads, hide ads and more …

WTF-8

Leigh Caplan
February 04, 2014

 WTF-8

Recently, I had to fix a pernicious encoding bug in a Rails app I work on. Along the way, I learned a lot about string encodings, how they work, *why* they work, and how Ruby handles them. Join me as I share the fruits of my journey of discovery with you!

Leigh Caplan

February 04, 2014
Tweet

More Decks by Leigh Caplan

Other Decks in Programming

Transcript

  1. class Event def self.add(format_string, user_name) doc = Nokogiri::XML::DocumentFragment.parse(format_string) subject =

    doc.css('subject').first subject.inner_html = subject.inner_html.gsub( "$0", user_name) create(:body => doc.to_s) end end
  2. current_user.name = "Leigh Caplan" "The user <subject>Leigh Caplan</subject> viewed the

    page." current_user.name = "José Valim" "The user <subject></subject> viewed the page."
  3. describe Event do describe ".add" do it "should preserve special

    characters in names" do event = Event.add( "The user <subject>$0</subject> viewed the page.", "José Valim" ) event.body.should =~ /José/ end end end
  4. Way to represent characters as binary data "hello" # =>

    104 101 108 108 111 # => 1101000 1100101 1101100 1101100 1101111
  5. ASCII ‣ Invented in 1960s, prominent from the 1970s onward

    ‣ 7 bit encoding - 128 characters ‣ 0-31 control characters ‣ 32-127 characters (0-9, a-z, A-Z, etc)
  6. Unicode ‣ Character mapping, not an encoding ‣ Attempt to

    establish universal character set ‣ Covers all writing systems that ever existed ‣ Disambiguation of character definitions ‣ System of mapping characters to intermediate list of numbers (code points)
  7. UTF-16 ‣ 2 bytes per character (16 bit) ‣ 65,536

    total characters ‣ Inefficient storage for primarily-european strings ‣ Endinanness matters, byte order mark optional
  8. "hello" # => FEFF 0068 0065 006C 006C 006F #

    => FFFE 6800 6500 6C00 6C00 6F00
  9. UTF-8 ‣ Single byte for code points 0-127 ‣ Variable

    length (up to 6 bytes) ‣ Most common encoding found on the Web as of 2008 (reported by Google) ‣ Backwards-compatible with ASCII
  10. In Ruby <=1.8 ‣ Strings were treated as raw bytes

    ‣ Concatenating strings of different encodings would result in corrupt strings ‣ Counting chars in multi-byte encoded strings was inaccurate
  11. In Ruby 1.9+ ‣ Strings keep their native encoding ‣

    Strings are “tagged” with an encoding ‣ String methods are encoding-aware
  12. Source file encodings ‣ Assumed to be US-ASCII unless explicitly

    specified ‣ Any source file w/ non-ASCII characters needs an encoding comment ‣ Determines encoding of all string literals in file
  13. Convert encoding string = "omg ❤" string.encode('ISO-8859-1') # Encoding::UndefinedConversionError: U+2764

    from UTF-8 to ISO-8859-1 string.encode('ISO-8859-1', :undef => :replace) # "omg ?"
  14. Default external ‣ Encoding used by default when reading from

    files and streams ‣ Global to entire Ruby process ‣ Doesn't affect string literals! ‣ Set automatically using computer's locale, but overrideable using assignment
  15. Default internal ‣ Normally Ruby leaves data from files and

    streams in its native encoding. ‣ Encoding.default_internal causes transcoding to specified encoding (rather than just tagging)
  16. class Event def self.add(format_string, user_name) binding.pry # Debugging! doc =

    Nokogiri::XML::DocumentFragment.parse(format_string) subject = doc.css('subject').first subject.inner_html = subject.inner_html.gsub( "$0", user_name) create(:body => doc.to_s) end end
  17. # encoding: utf-8 describe Event do describe ".add" do it

    "should preserve special characters in names" do event = Event.add( "The user <subject>$0</subject> viewed the page.", "José Valim" ) event.body.should =~ /José/ end end end
  18. Source file encodings ‣ Assumed to be US-ASCII unless explicitly

    specified ‣ Any source file w/ non-ASCII characters needs an encoding comment ‣ Determines encoding of all string literals in file
  19. describe Event do describe ".add" do it "should preserve special

    characters in names" do format_string = "The user <subject>$0</subject> viewed the page." event = Event.add( format_string.force_encoding('US-ASCII'), "José Valim" ) event.body.should =~ /José/ end end end