Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Everything you always wanted to know about UTF-8 (but never dared to ask)

Everything you always wanted to know about UTF-8 (but never dared to ask)

Presented on October 28th 2013 at the International PHP Conference, Munich, Germany.
http://phpconference.com/2013/en
---------------------------------------------------------------
For any application with even the remotest ambition of international use, the only way to go is to use UTF-8. And even without that ambition, using UTF-8 might still bring you more benefits than you currently realize. Unfortunately most developers at one point or another run into problems implementing UTF-8 and get discouraged. That ends now! In this talk I will cover UTF-8 from the basic linguistics, through client-side aspects to all the steps you need to take to tackle the most common (and some more obscure) issues when using UTF-8 in a database driven web application.

Juliette Reinders Folmer

October 28, 2013
Tweet

More Decks by Juliette Reinders Folmer

Other Decks in Programming

Transcript

  1. Juliette Reinders Folmer | Advies en zo Everything you always

    wanted to know about UTF-8 * * But never dared to ask
  2. “Internationalization is like parenting: a lifelong cycle of hardship in

    which no cumulative knowledge is gained.” Mark Pilgrim, april 2004 “Mark believes that because Unicode is harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N becomes an issue.” J. Graham, april 2004 “Internationalization is like parenting: a lifelong cycle of hardship in which no cumulative knowledge is gained.” Mark Pilgrim, april 2004
  3. Some common misconceptions • Unicode !== UTF-8 • UTF-8 !==

    internationalization • UTF-8 !== charset
  4. Why worry about it anyway ? • It’s all about

    being prepared: – Company/Client gets taken over by a foreign company – Mergers – Expansion to other regions – Local users/employees from other origins • Code efficiency • Cost • Global market
  5. Some language statistics • 7105 ‘living’ languages • +/- 308

    languages with > 1 million speakers • Nr 1 language in the world ? • Nr 2 ? Did you know: • That Germany has 27* officially recognized languages ? • That the country with the most languages is Papua New Guinea ? * Alemannic, Bavarian, Danish, Frankish, Eastern Frisian, Northern Frisian, Standard German, Kabardian, Kölsch, Limburgish, Luxembourgeois, Mainfränkisch, Pfaelzisch, Plautdietsch, Polish, Balkan Romani, Sinte Romani, Vlax Romani, Saterfriesisch, Low Saxon, Upper Saxon, Lower Sorbian, Upper Sorbian, Swabian, Westphalien, Yeniche, Western Yiddish. (836) Mandarin Chinese Spanish Source: Ethnologue 2013
  6. Top 20 languages in the world * Roman 61 Italian

    20 Javanese 84 Javanese 10 Arabic 63 Urdu 19 Hiragana, Katakana, and Kanji 122 Japanese 9 Korean (Hangul) 66 Korean 18 Cyrillic 162 Russian 8 Roman 68 Vietnamese 17 Bengali 193 Bengali 7 Roman 69 French 16 Roman 202 Portuguese 6 Tamil 69 Tamil 15 Arabic 223 Arabic (standard) 5 Devanagari 72 Marathi 14 Devanagari 260 Hindi 4 Telugu 74 Telugu 13 Roman 335 English 3 Lahnda, Arabic 83 Lahnda ( Western Punjabi) 12 Roman 406 Spanish 2 Roman 84 German (standard) 11 Vernacular Chinese 1.197 Mandarin Chinese 1 Script Total speakers (M) Language Script Total speakers (M) Language * Source: Ethnologue 2013
  7. About writing systems • There are approximately writing systems in

    active use • Most are used (with or without extensions) for several languages • Some languages use more than one writing system • Numerous other writing systems for ceremonial or religious use • Or for fun ;-) 180 * * Source: Omniglot
  8. Writing system resources • Info on languages: http://www.ethnologue.com/ • Info

    on writing systems: http://www.omniglot.com/ • Which characters are used in language X ? http://www.eki.ee/letter/ • Info on Latin extensions for African languages http://www.bisharat.net/A12N/ • And of course: http://en.wikipedia.org/wiki/ Writing_system
  9. On character sets and encoding “A coded character set is

    a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points.” (W3C) “The character encoding reflects the way these abstract characters are mapped to bytes for manipulation in a computer.” (W3C)
  10. Unicode “Unicode is a computing industry standard allowing computers to

    consistently represent and manipulate text expressed in most of the world's writing systems.” (Wikipedia) • Unicode Code charts: http://www.unicode.org/charts/
  11. UTF • UTF = Unicode Transform Format • UTF-8 is

    one of the character encodings for implementing Unicode • Alternatives are UTF-7 (legacy), UTF-16, UTF-32 • UTF-8 is (backward) compatible with ASCII, UTF-16/32 are not. * Image source: W3C
  12. Advantages of UTF-8 • Backward compatible with ASCII • UTF-8

    can encode any Unicode character • XML requires UTF-8 or UTF-16 • UTF-8 and UTF-16 are the standards for having Unicode in HTML. UTF-8 is preferred. • Can be fairly reliably recognized with small chance of confusion. • Sorting UTF-8 as arrays of unsigned bytes with result in same order as sorting on Unicode code point.
  13. So, what’s the problem ? • Everything defaults to non-UTF-8

    Mostly latin writing system, ISO-8859-1 or US-ASCII So, what’s the solution ? • Be EXPLICIT everywhere (and I don’t mean $%&@-explicit)
  14. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  15. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  16. User’s computer Potential issues: • Extended language support ? •

    Code pages ? • Font ? • Browsers, browsers, browsers
  17. Characteristics of text • Language • Writing system • Writing

    direction • Writing direction • Character (sub)set • Character encoding • Font • Meaning • English • Roman/Latin • Left to right • Top to bottom • Basic Latin • UTF-16 (can vary) • Arial “What I really love” “ار ا يذ ا ا “ • Arabic • Arabic • Right to left • Top to bottom • Arabic • UTF-16 (...) • Arial
  18. Languages: •English •Greek •Ukrainian •Mandarin Chinese •Japanese •Hindi •Korean •Kannada

    •Punjabi Gurmuki •Tamil •Tigre •Myanmar •Arabic •Farsi •Hebrew
  19. About Fonts • Unicode versus non-unicode fonts Be aware &

    be wary ! • Few fonts capable of handling a wide range of Unicode characters. Examples: Arial Unicode MS, Bitstream Cyberbit, Code2000, GNU Unifont
  20. Fonts used: •Verdana •Verdana •Verdana •SimSun •MS Mincho •Code2000 •Batang

    •Arial Unicode MS •Lohit Punjabi •Latha •GS GeezMahtemUnicode •WinInnwa •Arial Unicode MS •Arial •Arial Unicode MS
  21. These are the same phrases converted to the Verdana font.

    WinInnwa is a non-Unicode compliant font...
  22. These are the same phrases again, now converted to the

    Arial Unicode MS font. The Ge’ez script Character (sub)set (Ethiopic range) are not included in Arial Unicode MS.
  23. Useful font-related resources: • Galary of Unicode fonts – find

    fonts per writing system/language: http://www.wazu.jp/ • Unicode test pages and more: http://www.alanwood.net/unicode/ • Unicode typefaces: http://en.wikipedia.org/wiki/ Unicode_typefaces • Font frequency on computers: http://www.codestyle.org/
  24. Some font viewers: • Free and easy Font viewer: http://www.styopkin.com/details_

    free_and_easy_fonts_viewer.html • ListFont: http://www.heiner-eichmann.de/ software/listfont/listfont.htm
  25. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  26. Client side – HTML • Use meta-headers: <meta http-equiv="Charset" content="utf-8">

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> • Tell the browser (and the search engines) the language too if you can: <html dir="[DIRECTION – ltr/rtl]" lang="[LANGCODE – eg de-DE or zh-cn for Mandarin Chinese]"> <meta name="language" content="[LANGCODE]"> <meta http-equiv="Content-Language" content="[LANGNAME - German]">
  27. Client side – XHTML • Add an XML declaration at

    the top <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1 -transitional.dtd"> <html xmlns="http ://www.w3.org/1999/xhtml" xml:lang="[LANGCODE]" lang="[LANGCODE]"> • Be aware of quirksmode vs standards mode
  28. Client side – CSS • Add an encoding to the

    CSS file – must be on the very first line of the file! @charset "utf-8"; • Use Unicode compliant fonts and tell the browser which to use with CSS <p lang="zh-cn">我很喜欢</p> P[LANG|="zh"] { font-family: SimSun, "MS Song", "Adobe Song Std L", sans-serif;} P[LANG="zh-cn"] { font-family: SimSun, “MS Song”, "Adobe Song Std L", sans-serif;} P[LANG|="ar"] { font-family: Arial, "Arial Unicode MS", sans-serif; direction: rtl; }
  29. Useful client side resources: • W3C on best practices for

    internationalization: http://www.w3.org/International/ techniques/authoring-html • ISO language codes: http://www.sil.org/iso639-3/codes.asp • ISO country codes: http://www.iso.org/iso/country_codes/ iso_3166_code_lists/country_names_and_ code_elements • W3C on Language tags in HTML and XML: http://www.w3.org/International/ articles/language-tags/
  30. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  31. Putting files on the server • Save the file(s) as

    encoded in UTF-8 • Don’t forget to upload the file as binary rather than ascii !
  32. Client-server communication Sending data to the client • Send a

    HTTP header: header( 'Content-Type: text/html; charset=utf-8' ); • .htaccess/ Apache’s httpd.conf Settings in .htaccess overrule HTML headers !
  33. Client-server communication .htaccess examples • # Maps file extensions to

    a character encoding. Especially useful in content negotiation situations. (httpd.conf) AddCharset utf-8 .utf8 • # Pass the default character encoding for content-type text/plain and text/html AddDefaultCharset On|Off|charset AddDefaultCharset UTF-8 AddDefaultCharset On => iso-8859-1 • # Add a default character encoding per file extension AddType 'text/html; charset=UTF-8' html • # Identify the encoding for a particular file: <Files ~ "events\.html"> ForceType 'text/html; charset=UTF-8‘ </Files>
  34. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  35. Client-server communication Receiving data from the client • Make sure

    you send information to the server in the correct encoding • This is especially important for user-input, i.e. Forms!!! <form accept-charset="utf-8">
  36. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  37. PHP • Currently not very friendly for UTF-8 • PHP6

    development dormant • Some PHP extensions come to the rescue: MBstring iconv ~Intl • There are also some nifty function collections / classes available to help you. Take note of: http://sourceforge.net/projects/phputf8
  38. PHP UTF-8 safe functions Safe: • explode() • str_replace() •

    PHP5+ ~htmlentities() NOT Safe: • Everything else MUST READ: http://www.phpwact.org/php/i18n/utf-8
  39. Test for well-formedness function utf8_compliant( $string ) { if (

    strlen( $string ) == 0 ) { return true; } return ( preg_match( '/^.{1}/us', $string , $array ) == 1 ); }
  40. PRCE • PRCE can be relatively UTF-8 safe if compiled

    with Unicode. Use: preg_match(‘/^.+$/u’, $string); • Test whether PRCE has been compiled with Unicode support: if( preg_match('/^.{1}$/u',"ñ", $UTF8_ar) != 1 ){ trigger_error('PCRE is not compiled with UTF-8 support',E_USER_ERROR); }
  41. Handling text • You don’t need htmlentities() anymore. Use htmlspecialchars()

    instead: $html = htmlspecialchars($utf8_string, ENT_COMPAT, 'UTF-8'); • strlen() will count bytes, so use: function utf8_strlen( $string ){ return strlen( utf8_decode( $str ) ); }
  42. MBstring extension • Multibyte aware implementations of some of the

    most common PHP string functions, the POSIX extended regex extension and the mail function. • Mbstring supports many different character sets, most importantly UTF-8. • Allows for conversion between character sets and implements some level of encoding detection.
  43. Iconv extension • Bundled since PHP 5+. • Main purpose

    of iconv : converting between different character sets. • From PHP 5+, iconv has implementations of some common string functions, but is slower than mbstring for UTF-8.
  44. Intl extension • Bundled since PHP 5.3+, but not always

    enabled. • Modules: – Collator – Number Formatter – Message Formatter – Normalizer – Locale
  45. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  46. Communicating with MySQL • The connection between PHP and MySQL

    defaults to a latin1 connection. • The first query you should run after making your connection: mysql_query( 'SET NAMES "utf8" [COLLATE "collation_name"]' ); OR mysql_query( 'SET CHARACTER SET utf8' ); • PHP 5.2+: mysql_set_charset( 'utf8', $conn );
  47. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues 4.1+
  48. Finding out current settings • Find out how your system

    is set up: mysql> SHOW VARIABLES LIKE 'character_set%'; mysql> SHOW VARIABLES LIKE 'collation%'; +--------------------------+-------------------+ | Variable_name | Value | +--------------------------+-------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | collation_connection | latin1_swedish_ci | | collation_database | latin1_swedish_ci | | collation_server | latin1_general_ci | +--------------------------+-------------------+
  49. Finding out what’s available • To find out which character

    encodings are available and what their default collation is: SHOW CHARACTER SET; • To find out which collations are available: SHOW COLLATION LIKE 'utf8%';
  50. Setting up a server • Add the following to your

    /etc/my.cnf file: [mysqld] ... default-character-set=utf8 default-collation=utf8_general_ci • If you are the only user you could even do: (MySQL 5.x and later) (not executed for super-user logins) init_connect=’SET NAMES utf8′ ′ ′ ′
  51. Setting up databases & tables • Make sure that both

    database, tables as well as text columns are in UTF-8: (CREATE | ALTER) DATABASE / TABLE ... ( ... ) [DEFAULT] CHARACTER SET utf8 [[DEFAULT] COLLATE collation] • Don’t forget Field widths
  52. Choosing the collation • Collation == Sort order • Guideline

    to the collations: _ci = case insensitive _cs = case sensitive _bin = binary • Test ! Image Source: Mysql.com
  53. Collation Resources • Collation charts: http://www.collation-charts.org/ • Unicode collation charts:

    http://www.unicode.org/charts/uca/ • Examples of collation choice effects: http://dev.mysql.com/doc/refman/5.7/en/ charset-collation-effect.html
  54. Converting an existing database • Using MySQL’s CONVERT function you

    can migrate ‘old’ data: INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table; • For a complete php script to convert your database: http://www.phpwact.org/php/i18n/utf-8/mysql
  55. Querying a database • You can specify the collation to

    use for a specific query: SELECT k FROM t1 ORDER BY k COLLATE utf8_spanish_ci; • You can even use it in the WHERE clause: SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
  56. Common issue • If you run into the following error

    message when running a query: Illegal mix of collations (utf8_bin,IMPLICIT) and (latin1_swedish_ci,COERCIBLE) for operation You may want to try and make your query more explicit with a character string literal: SELECT * FROM table WHERE col = _utf8'xyz';
  57. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  58. GetText / .po files • Poedit understands all encodings supported

    by operating system and works in Unicode internally: http://www.poedit.net/
  59. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  60. BOM ! Run for your life ! •  or

    extra blank line • BOM = Byte Order Mark The character is the ZERO WIDTH NON-BREAKING SPACE. If not placed at the top, it shouldn’t give any problems. • In UTF-16/32 the BOM is necessary to determine the byte order. In UTF-8 it is not.
  61. Bom squad • Some browsers may display the BOM. •

    ‘headers already send’ problems. It can also cause problems when in front of “#!” in a shell script. • Check your editor settings • Open the file in (another) editor, if you can see the BOM manually delete it and save the file. Sometimes even just opening the file and saving it again as UTF-8 will solve it.
  62. Useful BOM resources: • W3C on the BOM character: http://www.w3.org/International/questio

    ns/qa-utf8-bom • Webbased BOM-testing: http://people.w3.org/rishida/utils/bomt ester/ • Unicode Consortium on BOM character: http://www.unicode.org/unicode/faq/utf_ bom.html#bom1
  63. We’ve covered: •Dependancy on user’s computer setup •Client side code

    – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  64. Keep in touch! (I’m self-employed, you can hire me ;-)

    ) Juliette Reinders Folmer Email: [email protected] Web: http://www.adviesenzo.nl/ LinkedIn: http://nl.linkedin.com/in/julietterf Twitter: http://twitter.com/jrf_nl GitHub: http://github.com/jrfnl/ Please rate this talk on joined.in/9519 Endorsements and recommendations on LinkedIn are much appreciated too!