Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Everything you always wanted to know about UTF-...

Everything you always wanted to know about UTF-8 for WordPress

Presented on April 15nd 2013 at the WordPress Amsterdam Meetup, Amsterdam, The Netherlands.
http://www.meetup.com/WordPress-Amsterdam/events/108154562/

Juliette Reinders Folmer

April 15, 2013
Tweet

More Decks by Juliette Reinders Folmer

Other Decks in Programming

Transcript

  1. ‡ ‡ Disclaimer: <rant> As you never dared to ask,

    how am I supposed to know what you wanted to know ? Do you think I’m a mind-reader of something ? Anyways, my lawyer advised me against trying to mind-read, so I’m just going to guess and hope I get it right. Don’t come and complain afterwards that I didn’t tell you the things you wanted to know. At the end of the day: if you wanted to get your questions answered, you had better have the courage to ask them.</rant> By: Juliette Reinders Folmer
  2. “Internationalization is like parenting: a lifelong cycle of hardship in

    which no cumulative knowledge is gained.” Mark Pilgrim, april 2004 “Mark believes that because Unicode is harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N becomes an issue.” J. Graham, april 2004
  3. Some common misconceptions • Unicode !== UTF-8 UTF-8 is one

    of the character encodings for Unicode • UTF-8 !== internationalization Using UTF-8 prepares your application to receive input in different scripts – independently of the language
  4. Why worry about it anyway ? • It’s all about

    being prepared: – Company/Client gets taken over by a foreign company – Mergers – Expansion to other regions – Users/employees with international names • Code efficiency • Code flexibility: you don’t know who will use it when you publish themes and plugins
  5. Some language statistics • There are 7299 languages in the

    world • 6352 ‘living’ languages • +/- 390 languages with > 1 million speakers • Nr 1 language in the world ? • Nr 2 ? (820) Mandarin Chinese Spanish Did you know: • That the Netherlands has 14* officially recognized languages and 2 more under consideration? • That the country with the most languages is Papua New Guinea ? * Dutch, Achterhoeks, Drents, Western Frisian, Gronings, Limburgisch, Sinte Romani, Vlax Romani, Sallands, Stellingwerfs, Twents, Veluws, Western Yiddish, Zeeuws. (820)
  6. Top 20 languages in the world * Korean (Hangul) 67

    68 Korean 20 Roman and Arabic 163 140 23 Indonesian 10 Roman 67 68 Vietnamese 19 Arabic 165 104 61 Urdu 9 Devanagari 68 0,3 68 Marathi 18 Bengali 171 172 Bengali 8 Tamil 74 8 66 Tamil 17 Roman 193 15 178 Portuguese 7 Telugu 75 5 70 Telugu 16 Cyrillic 255 110 145 Russian 6 Javanese 76 76 Javanese 15 Devanagari 301 120 181 Hindi 5 Vernacular Chinese 77 77 Wu Chinese 14 Roman 382 60 322 Spanish 4 Roman 115 50 65 French 13 Arabic 452 146 206 Arabic (standard) 3 Roman 123 28 95 German (standard) 12 Roman 508 199 309 English 2 Hiragana, Katakana, and Kanji 123 1 122 Japanese 11 Vernacular Chinese 1.051 178 873 Mandarin Chinese 1 Script Total speakers (million) 2nd language (million) 1st language (million) Language Script Total speakers (million) 2nd language (million) 1st language (million) Language * Based on http://www.ethnologue.com/
  7. Some writing system stats • There are approximately ?? writing

    systems in active use • Most of these writing systems are used (with or without extensions) for several languages • There are numerous other writing systems for ceremonial or religious use • Languages are sometimes written in more than one writing system. Some examples: Punjabi (Gurmukhi, Sharmukhi – variant of arabic/persian script) Serbian (Roman, Cyrilic) 63
  8. Useful writing system resources • Info on languages http://www.ethnologue.com/ •

    Info on writing systems http://www.omniglot.com/ • Which characters are used in language X ? http://www.eki.ee/letter/ • Info on Latin extensions for African languages http://www.bisharat.net/A12N/ • General language info http://en.wikipedia.org/
  9. On character sets and encoding • “A coded character set

    is a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points.” (W3C) • “The character encoding reflects the way these abstract characters are mapped to bytes for manipulation in a computer.” (W3C) Unfortunately the terms ‘character set’ and ‘character encoding’ may have confused some of our colleagues in years gone by which results in the term ‘charset’ being used when ‘charenc’ is meant or would be more logical.
  10. Unicode • “Unicode is a computing industry standard allowing computers

    to consistently represent and manipulate text expressed in most of the world's writing systems.” (Wikipedia) • Tries to capture ‘all’ writing systems • Currently contains > 100.000 characters • Synchronous with the UCS (Universal Character Set) as defined by ISO/IEC 10646 • Useful resources: Unicode Code charts: http://www.unicode.org/charts/
  11. UTF-8 • UTF = Unicode Transform Format • Is one

    of the character encodings for implementing Unicode • Alternatives are UTF-7 (legacy), UTF-16, UTF-32 • UTF-8 is (backward) compatible with ASCII, UTF-16/32 are not. * Image source: W3C
  12. So, what’s the problem ? • Everything defaults to non-UTF-8

    • Mostly latin writing system, ISO-8859-1 or US- ASCII So, what’s the solution ? • Be EXPLICIT everywhere • Don’t reinvent the wheel – use the functions already available in WP
  13. WordPress & UTF-8 • Default charset is already set to

    UTF-8 • Lots of things already sorted like database setup and connection /** Database Charset to use in creating database tables. */ define( 'DB_CHARSET', 'utf8' ); /** The Database Collate type. Don't change this if in doubt. */ define( 'DB_COLLATE', '' ); • Functions available for text handling • BUT... you still need to know what to use when to get it right
  14. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  15. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  16. User’s computer Potential issues: • Does the user have extended

    language support installed ? • Are the relevant code pages installed ? • Which fonts does the user have installed ? • Browsers, browsers, browsers • "auto-detect character encoding" feature in browsers
  17. Characteristics of text • Language • Writing system • Writing

    direction • Writing direction • Character (sub)set • Character encoding • Font • Meaning • English • Roman/Latin • Left to right • Top to bottom • Basic Latin • UTF-16 (can vary) • Arial “What I really love” “ار ا يذ ا ا “ • Arabic • Arabic • Right to left • Top to bottom • Arabic • UTF-16 (...) • Arial
  18. Languages: •English •Greek •Ukrainian •Mandarin Chinese •Japanese •Hindi •Korean •Kannada

    •Punjabi Gurmuki •Tamil •Tigre •Myanmar •Arabic •Farsi •Hebrew
  19. About Fonts • Especially for writing systems which have only

    been added to Unicode in the last 10 years, there are a wide range of non-Unicode fonts available. Be aware & be wary ! • There are only a few fonts capable of handling a wide range of Unicode characters. Examples: Arial Unicode MS, Bitstream Cyberbit, Code2000, GNU Unifont • Most frequently encountered fonts are latin based or specialized.
  20. Fonts used: •Verdana •Verdana •Verdana •SimSun •MS Mincho •Code2000 •Batang

    •Arial Unicode MS •Lohit Punjabi •Latha •GS GeezMahtemUnicode •WinInnwa •Arial Unicode MS •Arial •Arial Unicode MS
  21. These are the same phrases converted to the Verdana font.

    What happened to Myanmar ? WinInnwa is a non-Unicode compliant font...
  22. These are the same phrases again, now converted to the

    Arial Unicode MS font. What happened to Tigre ? The Ge’ez script Character (sub)set (Ethiopic range) are not included in Arial Unicode MS.
  23. Useful font-related resources: • Galary of Unicode fonts – find

    fonts per writing system/language: http://www.wazu.jp/ • Unicode test pages and more: http://www.alanwood.net/unicode/ • Unicode typefaces: http://en.wikipedia.org/wiki/Unicode_typefaces • Font frequency on computers: http://www.codestyle.org/ • Free and easy Font viewer: http://www.styopkin.com/details_free_and_easy_fonts_vie wer.html • ListFont: http://www.heiner-eichmann.de/software/ listfont/listfont.htm
  24. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  25. Client side – HTML • Use meta-headers: <meta http-equiv="Charset" content="utf-8">

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> • Tell the browser (and the search engines) the language too if you can: <html dir="[DIRECTION – ltr/rtl]" lang="[LANGCODE – eg nl-NL or zh-cn for Mandarin Chinese]"> <meta name="language" content="[LANGCODE]"> <meta http-equiv="Content-Language" content="[LANGNAME - eg Dutch]">
  26. Client side – XHTML • Add an XML declaration at

    the top <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- transitional.dtd"> <html xmlns="http ://www.w3.org/1999/xhtml" xml:lang="[LANGCODE]" lang="[LANGCODE]"> • Be aware of quirksmode vs standards mode
  27. Client side – CSS • Add an encoding to the

    CSS file – must be on the very first line of the file! @charset "utf-8"; • Use Unicode compliant fonts and tell the browser which to use with CSS <p lang="zh-cn">我很喜欢</p> P[LANG|="zh"] { font-family: SimSun, "MS Song", "Adobe Song Std L", sans-serif;} P[LANG="zh-cn"] { font-family: SimSun, “MS Song”, "Adobe Song Std L", sans-serif;} P[LANG|="ar"] { font-family: Arial, "Arial Unicode MS", sans-serif; direction: rtl; }
  28. Useful client side resources: • W3C on best practices for

    internationalization: http://www.w3.org/International/techniques/authori ng-html • ISO language codes: http://www.sil.org/iso639-3/codes.asp • ISO country codes: http://www.iso.org/iso/country_codes/iso_3166_code _lists/country_names_and_code_elements • W3C on Language tags in HTML and XML: http://www.w3.org/International/articles/language- tags/
  29. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  30. Putting files on the server • Save the file(s) as

    encoded in UTF-8 • On that note: don’t forget to upload the file as binary rather than ascii ! You can normally check/change this in the FTP programs options.
  31. URL’s • Can you see a difference ? • As

    a best practice, I’d advice against using UTF-8 characters in URL’s
  32. Client-server communication Sending data to the client • Send a

    HTTP header with the character encoding: header('Content-Type: text/html; charset=utf-8' ); • Did you know that you can also use .htaccess/ Apache’s httpd.conf to set the file encoding defaults ? And the default language ? • Settings in .htaccess overrule HTML headers ! • Often not the best solution, but can be helpful as an interim solution.
  33. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  34. Client-server communication Receiving data from the client • Make sure

    you send information to the server in the correct encoding • This is especially important for user-input, i.e. Forms!!! <form accept-charset="utf-8"> • If you don’t specify the character encoding in the form tag, you are at the browser’s mercy...
  35. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  36. PHP • Currently not very friendly for UTF-8 • PHP6

    will (supposedly ;-) be fully UTF-8 prepared, but development currently dormant • Some PHP extensions come to the rescue: MBstring iconv • There are also some nifty function collections / classes available to help you. • Use WP build-in text handling functions whenever possible!
  37. PHP UTF-8 safe functions Safe: • explode() • str_replace() •

    PHP5+ ~htmlentities() NOT Safe: • Everything else MUST READ: http://www.phpwact.org/php/i18n/utf-8
  38. WP functions (1) • esc_html(), esc_attr(), esc_js(), esc_url(), esc_textarea() etc

    • sanitize_text_field(), sanitize_title(), sanitize_meta() etc function group • wp_html_excerpt($string, $length) – name is misleading! Will strip html, but is utf-8 safe. • WP3.3+ wp_trim_words( $text, $num_words = 55, $more = null )
  39. WP functions (2) • seems_utf8($str) • wp_check_invalid_utf8( $string, $strip =

    false ) • utf8_uri_encode( $utf8_string, $length = 0 ) • convert_chars($content) MUST READ: http://codex.wordpress.org/Data_Validation
  40. PRCE • PRCE can be relatively UTF-8 safe if compilled

    with Unicode. Use: preg_match(‘/^.+$/u’, $string); Warning: if you forget the u, your data might become corrupt ! Warning: UTF-8 allows for 5/6 byte character sequences, but these are not used in Unicode. This can cause problems. • Test whether PRCE has been compiled with Unicode support: if ( preg_match('/^.{1}$/u',"ñ",$UTF8_ar) != 1 ) { trigger_error('PCRE is not compiled with UTF-8 support',E_USER_ERROR); }
  41. Handling text gotcha’s • utf8_encode() & utf8_decode() Only useful for

    converting between ISO-8859-1 and UTF-8. • strlen() will count bytes, so use: function utf8_strlen($string){ return strlen(utf8_decode($str)); } • CType for validation is dependent on the locale. Be aware of this if/when you use ctype for validation.
  42. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues Not an issue - handled by WordPress!
  43. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues 4.1+
  44. • • When you set up theme/plugin specific tables, make

    sure that text columns are in UTF-8: (CREATE | ALTER) TABLE ... ( ... ) [DEFAULT] CHARACTER SET utf8 [COLLATE collation] Setting up databases & tables • Keep in mind that field widths will need to take multi- byte characters into account • MySQL metadata is stored as UTF-8
  45. Querying a database • You can specify the collation to

    use for a specific query: SELECT k FROM t1 ORDER BY k COLLATE utf8_spanish_ci; • You can even use it in the WHERE clause: SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci; • More info at: http://dev.mysql.com/doc/refman/5.1/en/charset- collate.html
  46. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  47. GetText / .po files • Poedit understands all encodings supported

    by operating system and works in Unicode internally: http://www.poedit.net/ • Set the character encoding in the Project settings:
  48. We’ll be covering: •Dependancy on user’s computer setup •Client side

    code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  49. BOM ! Run for your life ! •  or

    extra blank line (normally at the top of file/page) • BOM = Byte Order Mark The character is the ZERO WIDTH NON-BREAKING SPACE. If not placed at the top, it shouldn’t give any problems. • In UTF-16/32 the BOM is necessary to determine the byte order. In UTF-8 it is not. • Sometimes added by editors and incorrectly interpreted by browsers.
  50. Bom squad • Some browsers may display the BOM. •

    Can cause ‘headers already send’ problems. It can also cause problems when in front of “#!” in a shell script. • Check whether your editor has a setting you can (un)check for (not) adding the BOM / UTF-8 signature • Open the file in (another) editor, if you can see the BOM manually delete it and save the file. Sometimes even just opening the file and saving it again as UTF-8 will solve it.
  51. Useful BOM resources: • W3C on the BOM character: http://www.w3.org/International/questions/qa-utf8-

    bom • Webbased BOM-testing: http://people.w3.org/rishida/utils/bomtester/ • Unicode Consortium on BOM character: http://www.unicode.org/unicode/faq/utf_bom.html#bo m1
  52. We’ve covered: •Dependancy on user’s computer setup •Client side code

    – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues
  53. Keep in touch! Juliette Reinders Folmer Email: [email protected] Web: http://www.adviesenzo.nl/

    LinkedIn: http://nl.linkedin.com/in/julietterf WordPress: http://wordpress.org/support/profile/jrf