This talk introduces you to Unicode and UTF-8 encodings and techniques, how to work with these techniques in the Ruby language, and considerations of them in your full application stack.
10xxxxxx 10xxxxxx 111110xx 10xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110x 10xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx x = character encoding bit position Tuesday, June 12, 2012
UnicodeUtils.nfkd(a)== UnicodeUtils.nfkd(b) ✦ Can recompose them to identical code points ✦ UnicodeUtils.nfkc(a) ✦ We can now compare, sort, etc. Normalization: Making “é” == “é” Tuesday, June 12, 2012
Shift-JIS Encoding is popular there ✦ Different versions of Shift-JIS have conflicting code points ✦ Some codes do not match 1:1 between Unicode and Shift-JIS Tuesday, June 12, 2012
✦ Subject: ¡Hola, señor! ✦ Content-Type: text/html; charset=UTF-8 ✦ Bad data can be sent (Usually Spam) ✦ Don’t trust your results, inspect it! Tuesday, June 12, 2012
values of main fields ✦ Remove “combining marks” (accents) ✦ Transliterate where applicable ✦ Downcase where applicable ✦ Consider search index stemming rules ✦ Respect Internationalization of Names Tuesday, June 12, 2012
etc. ✦ Can submit in random encodings! ✦ Be friendly to your clients, expect the unexpected ✦ Transform API input with rchardet19 Tuesday, June 12, 2012