Save 37% off PRO during our Black Friday Sale! »

Everything you always wanted to know about UTF-8 (but never dared to ask)

Everything you always wanted to know about UTF-8 (but never dared to ask)

Presented on June 24th 2014 at the PHP Tour, Lyon, France.
http://afup.org/pages/phptourlyon2014/
http://www.joind.in/11233
---------------------------------------------------------------
For any application with even the remotest ambition of international use, the only way to go is to use UTF-8. And even without that ambition, using UTF-8 might still bring you more benefits than you currently realize. Unfortunately most developers at one point or another run into problems implementing UTF-8 and get discouraged. That ends now! In this talk I will cover UTF-8 from the basic linguistics, through client-side aspects to all the steps you need to take to tackle the most common (and some more obscure) issues when using UTF-8 in a database driven web application.
---------------------------------------------------------------
Links:

Slide2:
http://intertwingly.net/blog/2004/04/25/utf-8-musings#c1082919794
http://intertwingly.net/blog/2004/04/25/utf-8-musings#c1082929502

Slide 5/6:
http://www.ethnologue.com/

Slide 7:
http://www.omniglot.com/

Slide 8:
http://en.wikipedia.org/wiki/Writing_system

Slide 11:
http://geek-and-poke.com/

Slide 12:
http://www.unicode.org/charts/

Slide 39:
http://sourceforge.net/projects/phputf8

Slide 40:
http://www.phpwact.org/php/i18n/utf-8

Slide 44:
http://www.php.net/regexp.reference.unicode

Slide 46:
http://www.php.net/mbstring

Slide 47:
http://www.php.net/iconv

Slide 48:
http://www.php.net/intl

Slide 57:
http://www.phpwact.org/php/i18n/utf-8/mysql

Slide 62:
http://www.poedit.net/

---------------------------------------------------------------
Other interesting links:

http://www.eki.ee/letter/
http://www.bisharat.net/A12N/

http://www.wazu.jp/
http://www.alanwood.net/unicode/
http://en.wikipedia.org/wiki/Unicode_typefaces

http://www.styopkin.com/details_free_and_easy_fonts_viewer.html
http://www.heiner-eichmann.de/software/listfont/listfont.htm

http://www.w3.org/International/techniques/authoring-html
http://www.sil.org/iso639-3/codes.asp
http://www.iso.org/iso/country_codes/iso_3166_code_lists/country_names_and_code_elements
http://www.w3.org/International/articles/language-tags/

http://httpd.apache.org/docs/2.4/mod/mod_charset_lite.html

http://www.collation-charts.org/
http://www.unicode.org/charts/uca/
http://dev.mysql.com/doc/refman/5.7/en/charset-collation-effect.html

http://www.w3.org/International/questions/qa-utf8-bom
http://people.w3.org/rishida/utils/bomtester/
http://www.unicode.org/unicode/faq/utf_bom.html#bom1

2776198ea9584b6c0d4b494293b8d635?s=128

Juliette Reinders Folmer

June 24, 2014
Tweet

Transcript

  1. ‡ ‡ Disclaimer: <rant> As you never dared to ask,

    how am I supposed to know what you wanted to know ? Do you think I’m a mind-reader of something ? Anyways, my lawyer advised me against trying to mind-read, so I’m just going to guess and hope I get it right. Don’t come and complain afterwards that I didn’t tell you the things you wanted to know. At the end of the day: if you wanted to get your questions answered, you had better have the courage to ask them.</rant> By: Juliette Reinders Folmer @jrf_nl
  2. “Internationalization is like parenting: a lifelong cycle of hardship in

    which no cumulative knowledge is gained.” Mark Pilgrim, april 2004 “Mark believes that because Unicode is harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N becomes an issue.” J. Graham, april 2004 “Internationalization is like parenting: a lifelong cycle of hardship in which no cumulative knowledge is gained.” Mark Pilgrim, april 2004
  3. Some common misconceptions • Unicode !== UTF-8 • UTF-8 !==

    internationalization • UTF-8 !== charset
  4. Why worry about it anyway ? • Local is an

    illusion, always think global: – Company/Client gets taken over by a foreign company – Mergers – Expansion to other regions – Local users/employees from other origins • Code efficiency • Cost Helgi Þormar Þorbjörnsson
  5. Some language statistics • 7105 ‘living’ languages • +/- 308

    languages with > 1 million speakers • Nr 1 language in the world ? • Nr 2 ? Did you know: • That France has more than 9 officially recognized languages ? • That the country with the most languages is Papua New Guinea ? * Alsatian, Catalan, Corsican, Breton, French, Gallo, Occitan, Tahitian, some languages of New Caledonia (837) Mandarin Chinese Spanish Source: Ethnologue 2013
  6. Top 20 languages in the world * Arabic 64 Urdu

    20 Javanese 84 Javanese 10 Roman 68 Vietnamese 19 Hiragana, Katakana, and Kanji 122 Japanese 9 Tamil 69 Tamil 18 Cyrillic 167 Russian 8 Roman 71 Turkish 17 Bengali 193 Bengali 7 Devanagari 72 Marathi 16 Roman 203 Portuguese 6 Telugu 74 Telugu 15 Arabic 237 Arabic (standard) 5 Roman 75 French 14 Devanagari 260 Hindi 4 Korean (Hangul) 77 Korean 13 Roman 335 English 3 Roman 78 German (standard) 12 Roman 414 Spanish 2 Lahnda, Arabic 83 Lahnda ( Western Punjabi) 11 Vernacular Chinese 1.197 Mandarin Chinese 1 Script Total speakers (M) Language Script Total speakers (M) Language * Source: Ethnologue 2013/14
  7. About writing systems • There are approximately writing systems in

    active use • Most are used (with or without extensions) for several languages • Some languages use more than one writing system • Numerous other writing systems for ceremonial or religious use • Or for fun ;-) 180 * * Source: Omniglot
  8. Distribution of writing systems Source: Wikipedia

  9. There Ain't No Such Thing As Plain Text

  10. On character sets and encoding 11000111 10111010 * Ǻ Encoding

    UTF-8 Charset
  11. © Geek and Poke

  12. Unicode Unicode is a computing industry standard for the consistent

    encoding, representation and handling of text expressed in most of the world's writing systems. (Wikipedia) • Unicode Code charts: http://www.unicode.org/charts/
  13. UTF • UTF = Unicode Transform Format • UTF-8 is

    one of the character encodings for implementing Unicode • Alternatives are UTF-7 (legacy), UTF-16, UTF-32 • UTF-8 is (backward) compatible with ASCII, UTF-16/32 are not. * Image source: W3C
  14. Advantages of UTF-8 • Backward compatible with ASCII • UTF-8

    can encode any Unicode character • XML requires UTF-8 or UTF-16 • UTF-8 and UTF-16 are the standards for having Unicode in HTML. UTF-8 is preferred. • Can be fairly reliably recognized with small chance of confusion. • Sorting UTF-8 as arrays of unsigned bytes will result in same order as sorting on Unicode code point.
  15. So, what’s the problem ? • Everything defaults to non-UTF-8

    Mostly latin, ISO-8859-1 or US-ASCII So, what’s the solution ? • Be EXPLICIT everywhere (and I don’t mean $%&@-explicit)
  16. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  17. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  18. User’s computer Potential issues: • Extended language support ? •

    Code pages ? • Font ? • Browsers, browsers, browsers
  19. Characteristics of text • Language • Writing system • Writing

    direction • Writing direction • Character (sub)set • Character encoding • Font • Meaning • English • Roman/Latin • Left to right • Top to bottom • Basic Latin • UTF-16 (can vary) • Arial “What I really love” “ار ا يذ ا ا “ • Arabic • Arabic • Right to left • Top to bottom • Arabic • UTF-16 (...) • Arial
  20. Languages: •English •Greek •Ukrainian •Mandarin Chinese •Japanese •Hindi •Korean •Kannada

    •Punjabi Gurmuki •Tamil •Tigre •Myanmar •Arabic •Farsi •Hebrew
  21. About Fonts • Unicode versus non-unicode fonts Be aware &

    be wary ! • Few fonts capable of handling a wide range of Unicode characters. Examples: Arial Unicode MS, Bitstream Cyberbit, Code2000, GNU Unifont
  22. Fonts used: •Verdana •Verdana •Verdana •SimSun •MS Mincho •Code2000 •Batang

    •Arial Unicode MS •Lohit Punjabi •Latha •GS GeezMahtemUnicode •WinInnwa •Arial Unicode MS •Arial •Arial Unicode MS
  23. These are the same phrases converted to the Verdana font.

    WinInnwa is a non-Unicode compliant font...
  24. To stress the importance of Unicode and Unicode-compliant fonts:

  25. These are the same phrases again, now converted to the

    Arial Unicode MS font. The Ge’ez script Character (sub)set (Ethiopic range) are not included in Arial Unicode MS.
  26. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  27. Client side • Always declare the character encoding * Image

    source: W3C
  28. Client side – HTML • Use meta-headers: <meta http-equiv="Charset" content="utf-8">

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> • Tell the browser (and the search engines) the language too if you can: <html dir="[DIRECTION – ltr/rtl]" lang="[LANGCODE – eg de-DE or zh-cn for Mandarin Chinese]"> <meta name="language" content="[LANGCODE]"> <meta http-equiv="Content-Language" content="[LANGNAME - German]">
  29. Client side – XHTML • Add an XML declaration at

    the top <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- transitional.dtd"> <html xmlns="http ://www.w3.org/1999/xhtml" xml:lang="[LANGCODE]" lang="[LANGCODE]"> • Be aware of quirksmode vs standards mode (IE 6)
  30. Client side – CSS • Add an encoding to the

    CSS file – must be on the very first line of the file! @charset "utf-8"; • Use Unicode compliant fonts and tell the browser which to use with CSS <p lang="zh-cn">我很喜欢</p> P[LANG|="zh"] { font-family: SimSun, "MS Song", "Adobe Song Std L", sans-serif;} P[LANG="zh-cn"] { font-family: SimSun, “MS Song”, "Adobe Song Std L", sans-serif;} P[LANG|="ar"] { font-family: Arial, "Arial Unicode MS", sans-serif; direction: rtl; }
  31. Client-side - Javascript • Specify content-type AND charset headers! •

    Can use Unicode escape sequences \uHHHH
  32. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  33. Putting files on the server • Save the file(s) as

    encoded in UTF-8 • Don’t forget to upload the file as binary rather than ascii !
  34. URL’s • Can you see a difference ?

  35. Client-server communication Receiving data from the client • Make sure

    you send information to the server in the correct encoding • This is especially important for user-input !!! <form accept-charset="utf-8">
  36. Client-server communication Sending data to the client • Send a

    HTTP header: header( 'Content-Type: text/html; charset=utf-8' ); • .htaccess/ Apache’s httpd.conf Settings in .htaccess overrule HTML headers !
  37. Client-server communication .htaccess examples • # Maps file extensions to

    a character encoding. Especially useful in content negotiation situations. (httpd.conf) AddCharset utf-8 .utf8 • # Pass the default character encoding for content-type text/plain and text/html AddDefaultCharset On|Off|charset AddDefaultCharset UTF-8 AddDefaultCharset On => iso-8859-1 • # Add a default character encoding per file extension AddType 'text/html; charset=UTF-8' html • # Identify the encoding for a particular file: <Files ~ "events\.html"> ForceType 'text/html; charset=UTF-8‘ </Files>
  38. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  39. PHP • Currently not very friendly for UTF-8 • PHP6

    development dormant • Some PHP extensions come to the rescue: MBstring iconv ~Intl • There are also some nifty function collections / classes available to help you. Take note of: http://sourceforge.net/projects/phputf8
  40. PHP UTF-8 safe functions Safe: • explode() • str_replace() •

    PHP5+ ~htmlentities() NOT Safe: • Everything else MUST READ: http://www.phpwact.org/php/i18n/utf-8
  41. Danger zone • setlocale() – And all functions which use

    locale • strtoupper() / strtolower() • number_format() / money_format() • ucfirst() / ucwords() • strftime() • Gettext extension • Filter extension text functions • Ctype
  42. Test for well-formedness function utf8_compliant( $string ) { if (

    strlen( $string ) == 0 ) { return true; } return ( preg_match( '/^.{1}/us', $string , $array ) == 1 ); }
  43. Handling text • You don’t need htmlentities() anymore. Use htmlspecialchars()

    instead: $html = htmlspecialchars($utf8_string, ENT_COMPAT, 'UTF-8'); • strlen() will count bytes, so use: function utf8_strlen( $string ){ return strlen( utf8_decode( $str ) ); }
  44. PRCE • PRCE can be relatively UTF-8 safe if compiled

    with Unicode. Use: preg_match('/^.+$/u', $string); • Test whether PRCE has been compiled with Unicode support: if( preg_match('/^.{1}$/u',"ñ", $UTF8_ar) != 1 ){ trigger_error('PCRE is not compiled with UTF-8 support',E_USER_ERROR); }
  45. utf8_encode() & utf8_decode() • Only useful for converting between ISO-8859-1

    and UTF-8.
  46. MBstring extension • Multibyte aware implementations of some of the

    most common PHP string functions, the POSIX extended regex extension and the mail function. • Mbstring supports many different character sets, most importantly UTF-8. • Allows for conversion between character sets and implements some level of encoding detection.
  47. Iconv extension • Bundled since PHP 5+. • Main purpose

    of iconv : converting between different character sets. • From PHP 5+, iconv has implementations of some common string functions, but is slower than mbstring for UTF-8. • Great for dealing with files, filtering streams and handling output buffers.
  48. Intl extension • Bundled since PHP 5.3+, but not always

    enabled. • Wrapper around the excellent ICU library • Modules: – Collator – Number Formatting – Currency Formatting – Message Formatter (replaces gettext) – Normalizer – Locale – Convertors – Transliterators – Spoof checker – And more...
  49. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  50. Communicating with MySQL • The connection between PHP and MySQL

    defaults to a latin1 connection. • The first query you should run after making your connection: mysqli_query( 'SET NAMES "utf8" [COLLATE "collation_name"]' ); OR mysqli_query( 'SET CHARACTER SET utf8' ); • PHP 5.2+: mysqli_set_charset( 'utf8', $conn );
  51. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues 4.1+
  52. Finding out current settings • Find out how your system

    is set up: mysql> SHOW VARIABLES LIKE 'character_set%'; mysql> SHOW VARIABLES LIKE 'collation%'; +--------------------------+-------------------+ | Variable_name | Value | +--------------------------+-------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | collation_connection | latin1_swedish_ci | | collation_database | latin1_swedish_ci | | collation_server | latin1_general_ci | +--------------------------+-------------------+
  53. Finding out what’s available • To find out which character

    encodings are available and what their default collation is: SHOW CHARACTER SET; • To find out which collations are available: SHOW COLLATION LIKE 'utf8%';
  54. Setting up a server • Add the following to your

    /etc/my.cnf file: [mysqld] ... default-character-set=utf8 default-collation=utf8_general_ci • If you are the only user you could even do: (MySQL 5.x and later) (not executed for super-user logins) init_connect=’SET NAMES utf8′
  55. Setting up databases & tables • Make sure that both

    database, tables as well as text columns are in UTF-8: (CREATE | ALTER) DATABASE / TABLE ... ( ... ) [DEFAULT] CHARACTER SET utf8 [[DEFAULT] COLLATE collation] • Don’t forget Field widths
  56. Choosing the collation • Collation == Sort order • Guideline

    to the collations: _ci = case insensitive _cs = case sensitive _bin = binary • Test ! Image Source: Mysql.com
  57. Converting an existing database • Using MySQL’s CONVERT function you

    can migrate ‘old’ data: INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table; • For a complete php script to convert your database: http://www.phpwact.org/php/i18n/utf-8/mysql
  58. Querying a database • You can specify the collation to

    use for a specific query: SELECT k FROM t1 ORDER BY k COLLATE utf8_spanish_ci; • You can even use it in the WHERE clause: SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
  59. Common issue • If you run into the following error

    message when running a query: Illegal mix of collations (utf8_bin,IMPLICIT) and (latin1_swedish_ci,COERCIBLE) for operation You may want to try and make your query more explicit with a character string literal: SELECT * FROM table WHERE col = _utf8'xyz';
  60. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  61. Working with files • Use the b flag in fopen()

    • Don’t use unicode in filenames
  62. .po files • Poedit understands all encodings supported by operating

    system and works in Unicode internally: http://www.poedit.net/
  63. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  64. BOM ! Run for your life ! •  or

    extra blank line • BOM = Byte Order Mark The character is the ZERO WIDTH NON-BREAKING SPACE. If not placed at the top, it shouldn’t give any problems. • In UTF-16/32 the BOM is necessary to determine the byte order. In UTF-8 it is not.
  65. Bom squad • Some browsers may display the BOM. •

    ‘headers already send’ problems. It can also cause problems when in front of “#!” in a shell script. • Check your editor settings • Open the file in (another) editor, if you can see the BOM manually delete it and save the file. Sometimes even just opening the file and saving it again as UTF-8 will solve it.
  66. We’ve covered: •Dependancy on user’s computer setup •Client side code

    – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  67. Keep in touch! (I’m self-employed, you can hire me ;-)

    ) Juliette Reinders Folmer Email: juliette@adviesenzo.nl Web: http://www.adviesenzo.nl/ LinkedIn: http://nl.linkedin.com/in/julietterf Twitter: http://twitter.com/jrf_nl GitHub: http://github.com/jrfnl/ Please rate this talk on joined.in/11233 Slides: speakerdeck.com/jrf Endorsements and recommendations on LinkedIn are much appreciated too!
  68. Anything else you never dared to ask before ?