Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ruby, Unicode and UTF-8

Ruby, Unicode and UTF-8

This talk introduces you to Unicode and UTF-8 encodings and techniques, how to work with these techniques in the Ruby language, and considerations of them in your full application stack.

Allen Fair

June 13, 2012
Tweet

More Decks by Allen Fair

Other Decks in Technology

Transcript

  1. This talk... ✦ Unicode Basics ✦ “The Genome Project for

    Linguists” ✦ Ruby Unicode Usage ✦ Unicode in the Application Stack Tuesday, June 12, 2012
  2. Encodings ✦ 7-Bit: ASCII ✦ 8-Bit: Latin-* (ISO-8859-*) ✦ 16-Bit:

    Asian Languages (CJK) ✦ 31-Bit: Unicode for Global Writing Systems Tuesday, June 12, 2012
  3. Unicode Glossary ✦ Character (smallest component of WS) ✦ Diacritic

    Mark (accent, umlaut, etc.) ✦ Code Point (Unicode value) ✦ Ideograph (Pictograph, symbol) ✦ Grapheme (end user characters) Tuesday, June 12, 2012
  4. UTF-8 Encoding ✦ Variable Width (1-6 Bytes) ✦ Backwards Compatible

    with ASCII ✦ Values Compatible with Latin-1 Tuesday, June 12, 2012
  5. UTF-8 Format 0xxxxxxx 110xxxxx 10xxxxx 1110xxxx 10xxxxx 10xxxxxx 11110xxx 10xxxxx

    10xxxxxx 10xxxxxx 111110xx 10xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110x 10xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx x = character encoding bit position Tuesday, June 12, 2012
  6. Unicode “E-Acute”:é Name Latin small e with acute Decomposition e(U+0065)+Acute(U+0301)

    Unicode U+00E9 (Dec 233) Latin-1 233 HTML é é &xe9; UTF-8 0xC3A9, 11000011:1010100 Error! “é” (Latin-1) Tuesday, June 12, 2012
  7. ✦ a = “\u00e9” #=> é ✦ a.chars.count #=> 1

    ✦ b = “e\u0301” #=> é ✦ b.chars.count #=> 2 ✦ a == b #=> false Working with “é” Tuesday, June 12, 2012
  8. ✦ Normalization decomposes all glyphs ✦ “\u0039” #=> “e\u0301” ✦

    UnicodeUtils.nfkd(a)== UnicodeUtils.nfkd(b) ✦ Can recompose them to identical code points ✦ UnicodeUtils.nfkc(a) ✦ We can now compare, sort, etc. Normalization: Making “é” == “é” Tuesday, June 12, 2012
  9. Removing Accents ✦ For URL slugs, searching, etc. ✦ Decompose

    (Letter+Marks) ✦ Strip Marks after Letters UnicodeUtils.nfkd(data).gsub(/(\p{Letter})\p{Mark}+/,'\\1') Tuesday, June 12, 2012
  10. Byte Order Mark ✦ At Start of File, indicates ✦

    Encoding ✦ Endianess ✦ Deprecated? ✦ Bytes: 0xEF,0xBB,0xBF Tuesday, June 12, 2012
  11. Environment & Locale ✦ locale command: ✦ LANG="en_US.UTF-8" ✦ LC_COLLATE="en_US.UTF-8"

    ✦ LC_CTYPE="en_US.UTF-8" ✦ Environment Variables Tuesday, June 12, 2012
  12. ICONV: Conversion ✦ From Character Set (iso-8859-1) ✦ To Character

    Set (UTF-8) ✦ -c option, or UTF-8//IGNORE ✦ Latin-1//TRANSLIT (transliteration) ✦ Unix: `iconv -f enc -t enc <file1 >file2` Tuesday, June 12, 2012
  13. Ruby Encoding Conundrum ✦ Ruby is developed in Japan ✦

    Shift-JIS Encoding is popular there ✦ Different versions of Shift-JIS have conflicting code points ✦ Some codes do not match 1:1 between Unicode and Shift-JIS Tuesday, June 12, 2012
  14. Ruby 1.8 ✦ # encoding: UTF-8 (second line) ✦ $KCODE

    = ‘UTF-8’ ✦ require ‘jcode’ (EUC/SJIS) ✦ require ‘iconv’ ✦ gems: unicode, rchardet Tuesday, June 12, 2012
  15. Ruby 1.8 ✦ Strings are bytes, not characters ✦ Methods

    splitting strings can fail ✦ Regular Expression Suffix: /regex/u ✦ Expands \w word character set ✦ Respects multibyte characters ✦ Iconv.conv(to, from, str) Tuesday, June 12, 2012
  16. Ruby 1.9 ✦ $KCODE deprecated ✦ Each string has an

    .encoding ✦ Can mix encodings in the app ✦ File Encodings ✦ gems: unicode, unicode_utils, rchardet19 Tuesday, June 12, 2012
  17. Ruby 1.9 Unicode gems ✦ Unicode.downcase(string) ✦ upcase, capitalize, etc.

    ✦ String.downcase only does ASCII ✦ UnicodeUtils.nfkd(string), nfkc() ✦ UnicodeUtils.canonical_equivalents?() Tuesday, June 12, 2012
  18. 1.9 File Encodings ✦ ruby -E external:internal ✦ File.open(name, mode,

    external_encoding: “iso-8859-2”, internal_encoding: “UTF-8”) ✦ Does not discover or translate data Tuesday, June 12, 2012
  19. Ruby 1.9 Unicode Regular Expressions ✦ Matches “code points” not

    graphemes ✦ /./ matches a “code point” ✦ /\u0034/ matches this code point ✦ /\p{name}/ matches by property name Tuesday, June 12, 2012
  20. Character Access ✦ str.each_byte {} ✦ str.each_char {} ✦ str.each_codepoint

    {} ✦ str.bytesize ✦ str.size Tuesday, June 12, 2012
  21. Unicode Properties ✦ \p{Letter}, \p{Lowercase_Letter} ✦ \P{Punctuation}, \p{^CurrencySymbol} ✦ \P{name}

    and \p{^name} to negate ✦ \p{Hebrew}, \p{Latin} ✦ \p{InBasic_Latin} (U+000..U+007F) ✦ Not implemented (1.9.3) Tuesday, June 12, 2012
  22. Application Stack ✦ Database (SQL, NoSQL) ✦ File System ✦

    Mail Server ✦ Web Server ✦ Search Engine ✦ Javascript and Libraries Tuesday, June 12, 2012
  23. Files ✦ Files have no Encoding setting ✦ Specify UTF-8

    encodings when you can ✦ file command or rchardet to determine Tuesday, June 12, 2012
  24. Int’l Domain Names ✦ http://‘→❄→‚→‗→☺→’→☹" " .ws/ ✦ Beware Unicode

    IDN Spoofing ✦ Punycode ✦ http://xn--55gaaaaaa281gfaqg86dja792anqa.ws/ ✦ gem install punycode4r Tuesday, June 12, 2012
  25. Email Addresses ✦ Domain to Punycode ✦ Locale part is

    locally defined ✦ No standard encoding of local part ✦ SMTP does not support non-7-bit data ✦ So: Stick to 7-bit encoding Tuesday, June 12, 2012
  26. Unicode & SMTP ✦ MIME Words: ✦ Subject: =?iso-8859-1?Q =A1Hola,_se=F1or!?=

    ✦ Subject: ¡Hola, señor! ✦ Content-Type: text/html; charset=UTF-8 ✦ Bad data can be sent (Usually Spam) ✦ Don’t trust your results, inspect it! Tuesday, June 12, 2012
  27. URL Encoding ✦ URL’s are ASCII, not Unicode :-( ✦

    Searching for 㟬޷ੈք ... ✦ http://www.google.com/?q=%E4%BD %A0%E5%A5%BD%E4%B8%96%E7%95%8C ✦ http://www.reddit.com/search?q=%E4%BD %A0%E5%A5%BD%E4%B8%96%E7%95%8C ✦ You broke Reddit! Tuesday, June 12, 2012
  28. Unicode HTTP/HTML HTTP Content-Type: text/html; encoding=UTF-8 Accept-Charset: utf-8 HTML <meta

    http-equiv="Content-Type" content="text/html; charset=UTF-8"> <script type="text/javascript" src="javascript.js" charset="utf-8"> <form accept-charset="utf-8" action='/posts' method='post'> <input name="_utf8" type="hidden" value="&#2603;"> </form> Snowman‚ Tuesday, June 12, 2012
  29. Search/Autocomplete ✦ Search for “pinata” finds “piñata” ✦ Keep canonical

    values of main fields ✦ Remove “combining marks” (accents) ✦ Transliterate where applicable ✦ Downcase where applicable ✦ Consider search index stemming rules ✦ Respect Internationalization of Names Tuesday, June 12, 2012
  30. Unicode and web API ✦ API calls bypass forms, snowmen,

    etc. ✦ Can submit in random encodings! ✦ Be friendly to your clients, expect the unexpected ✦ Transform API input with rchardet19 Tuesday, June 12, 2012