Ruby, Unicode and UTF-8

Ruby, Unicode & UTF-8 Allen Fair, BIGLIST.com @allenfair http://github.com/afair Tuesday,
June 12, 2012

This talk... ✦ Unicode Basics ✦ “The Genome Project for
Linguists” ✦ Ruby Unicode Usage ✦ Unicode in the Application Stack Tuesday, June 12, 2012

Encodings ✦ 7-Bit: ASCII ✦ 8-Bit: Latin-* (ISO-8859-*) ✦ 16-Bit:
Asian Languages (CJK) ✦ 31-Bit: Unicode for Global Writing Systems Tuesday, June 12, 2012

Unicode Glossary ✦ Character (smallest component of WS) ✦ Diacritic
Mark (accent, umlaut, etc.) ✦ Code Point (Unicode value) ✦ Ideograph (Pictograph, symbol) ✦ Grapheme (end user characters) Tuesday, June 12, 2012

UTF-8 Encoding ✦ Variable Width (1-6 Bytes) ✦ Backwards Compatible
with ASCII ✦ Values Compatible with Latin-1 Tuesday, June 12, 2012

UTF-8 Format 0xxxxxxx 110xxxxx 10xxxxx 1110xxxx 10xxxxx 10xxxxxx 11110xxx 10xxxxx
10xxxxxx 10xxxxxx 111110xx 10xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110x 10xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx x = character encoding bit position Tuesday, June 12, 2012

Unicode “E-Acute”:é Name Latin small e with acute Decomposition e(U+0065)+Acute(U+0301)
Unicode U+00E9 (Dec 233) Latin-1 233 HTML é é &xe9; UTF-8 0xC3A9, 11000011:1010100 Error! “Ã©” (Latin-1) Tuesday, June 12, 2012

✦ a = “\u00e9” #=> é ✦ a.chars.count #=> 1
✦ b = “e\u0301” #=> é ✦ b.chars.count #=> 2 ✦ a == b #=> false Working with “é” Tuesday, June 12, 2012

✦ Normalization decomposes all glyphs ✦ “\u0039” #=> “e\u0301” ✦
UnicodeUtils.nfkd(a)== UnicodeUtils.nfkd(b) ✦ Can recompose them to identical code points ✦ UnicodeUtils.nfkc(a) ✦ We can now compare, sort, etc. Normalization: Making “é” == “é” Tuesday, June 12, 2012

Removing Accents ✦ For URL slugs, searching, etc. ✦ Decompose
(Letter+Marks) ✦ Strip Marks after Letters UnicodeUtils.nfkd(data).gsub(/(\p{Letter})\p{Mark}+/,'\\1') Tuesday, June 12, 2012

Byte Order Mark ✦ At Start of File, indicates ✦
Encoding ✦ Endianess ✦ Deprecated? ✦ Bytes: 0xEF,0xBB,0xBF Tuesday, June 12, 2012

Environment & Locale ✦ locale command: ✦ LANG="en_US.UTF-8" ✦ LC_COLLATE="en_US.UTF-8"
✦ LC_CTYPE="en_US.UTF-8" ✦ Environment Variables Tuesday, June 12, 2012

Unix File Command ✦ ﬁle --brief --mime-encoding ﬁlename ✦ #=>
iso-8859-1 Tuesday, June 12, 2012

ICONV: Conversion ✦ From Character Set (iso-8859-1) ✦ To Character
Set (UTF-8) ✦ -c option, or UTF-8//IGNORE ✦ Latin-1//TRANSLIT (transliteration) ✦ Unix: ìconv -f enc -t enc <file1 >file2` Tuesday, June 12, 2012

Ruby Encoding Conundrum ✦ Ruby is developed in Japan ✦
Shift-JIS Encoding is popular there ✦ Different versions of Shift-JIS have conﬂicting code points ✦ Some codes do not match 1:1 between Unicode and Shift-JIS Tuesday, June 12, 2012

Ruby 1.8 ✦ # encoding: UTF-8 (second line) ✦ $KCODE
= ‘UTF-8’ ✦ require ‘jcode’ (EUC/SJIS) ✦ require ‘iconv’ ✦ gems: unicode, rchardet Tuesday, June 12, 2012

Ruby 1.8 ✦ Strings are bytes, not characters ✦ Methods
splitting strings can fail ✦ Regular Expression Sufﬁx: /regex/u ✦ Expands \w word character set ✦ Respects multibyte characters ✦ Iconv.conv(to, from, str) Tuesday, June 12, 2012

Ruby 1.9 ✦ $KCODE deprecated ✦ Each string has an
.encoding ✦ Can mix encodings in the app ✦ File Encodings ✦ gems: unicode, unicode_utils, rchardet19 Tuesday, June 12, 2012

Ruby 1.9 Unicode gems ✦ Unicode.downcase(string) ✦ upcase, capitalize, etc.
✦ String.downcase only does ASCII ✦ UnicodeUtils.nfkd(string), nfkc() ✦ UnicodeUtils.canonical_equivalents?() Tuesday, June 12, 2012

1.9 File Encodings ✦ ruby -E external:internal ✦ File.open(name, mode,
external_encoding: “iso-8859-2”, internal_encoding: “UTF-8”) ✦ Does not discover or translate data Tuesday, June 12, 2012

Ruby 1.9 Unicode Regular Expressions ✦ Matches “code points” not
graphemes ✦ /./ matches a “code point” ✦ /\u0034/ matches this code point ✦ /\p{name}/ matches by property name Tuesday, June 12, 2012

Character Access ✦ str.each_byte {} ✦ str.each_char {} ✦ str.each_codepoint
{} ✦ str.bytesize ✦ str.size Tuesday, June 12, 2012

Unicode Properties ✦ \p{Letter}, \p{Lowercase_Letter} ✦ \P{Punctuation}, \p{^CurrencySymbol} ✦ \P{name}
and \p{^name} to negate ✦ \p{Hebrew}, \p{Latin} ✦ \p{InBasic_Latin} (U+000..U+007F) ✦ Not implemented (1.9.3) Tuesday, June 12, 2012

Application Stack ✦ Database (SQL, NoSQL) ✦ File System ✦
Mail Server ✦ Web Server ✦ Search Engine ✦ Javascript and Libraries Tuesday, June 12, 2012

Files ✦ Files have no Encoding setting ✦ Specify UTF-8
encodings when you can ✦ ﬁle command or rchardet to determine Tuesday, June 12, 2012

Int’l Domain Names ✦ http://‘→❄→‚→‗→☺→’→☹" " .ws/ ✦ Beware Unicode
IDN Spooﬁng ✦ Punycode ✦ http://xn--55gaaaaaa281gfaqg86dja792anqa.ws/ ✦ gem install punycode4r Tuesday, June 12, 2012

Email Addresses ✦ Domain to Punycode ✦ Locale part is
locally deﬁned ✦ No standard encoding of local part ✦ SMTP does not support non-7-bit data ✦ So: Stick to 7-bit encoding Tuesday, June 12, 2012

Unicode & SMTP ✦ MIME Words: ✦ Subject: =?iso-8859-1?Q =A1Hola,_se=F1or!?=
✦ Subject: ¡Hola, señor! ✦ Content-Type: text/html; charset=UTF-8 ✦ Bad data can be sent (Usually Spam) ✦ Don’t trust your results, inspect it! Tuesday, June 12, 2012

URL Encoding ✦ URL’s are ASCII, not Unicode :-( ✦
Searching for 㟬޷ੈք ... ✦ http://www.google.com/?q=%E4%BD %A0%E5%A5%BD%E4%B8%96%E7%95%8C ✦ http://www.reddit.com/search?q=%E4%BD %A0%E5%A5%BD%E4%B8%96%E7%95%8C ✦ You broke Reddit! Tuesday, June 12, 2012

Unicode HTTP/HTML HTTP Content-Type: text/html; encoding=UTF-8 Accept-Charset: utf-8 HTML <meta
http-equiv="Content-Type" content="text/html; charset=UTF-8"> <script type="text/javascript" src="javascript.js" charset="utf-8"> <form accept-charset="utf-8" action='/posts' method='post'> <input name="_utf8" type="hidden" value="ਫ"> </form> Snowman‚ Tuesday, June 12, 2012

Search/Autocomplete ✦ Search for “pinata” ﬁnds “piñata” ✦ Keep canonical
values of main ﬁelds ✦ Remove “combining marks” (accents) ✦ Transliterate where applicable ✦ Downcase where applicable ✦ Consider search index stemming rules ✦ Respect Internationalization of Names Tuesday, June 12, 2012

Unicode and web API ✦ API calls bypass forms, snowmen,
etc. ✦ Can submit in random encodings! ✦ Be friendly to your clients, expect the unexpected ✦ Transform API input with rchardet19 Tuesday, June 12, 2012

Ruby UTF-8 Examples ✦ 1.9: https://gist.github.com/2877685 ✦ 1.8: https://gist.github.com/2911107 Tuesday,
June 12, 2012

✦ http://www.joelonsoftware.com/articles/Unicode.html ✦ http://blog.grayproductions.net/articles/understanding_m17n ✦ http://www.regular-expressions.info/unicode.html ✦ http://training.perl.com/tcpc/OSCON2011/gbu/gbu.html (Unicode: Good
Bad & Ugly) ✦ http://www.w3.org/International/wiki/Personal_names Thanks! Tuesday, June 12, 2012

Ruby, Unicode and UTF-8

Ruby, Unicode and UTF-8

More Decks by Allen Fair

Other Decks in Technology

Featured

Transcript