Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Everything you always wanted to know about UTF-8 (but never dared to ask)

Everything you always wanted to know about UTF-8 (but never dared to ask)

Presented on June 24th 2014 at the PHP Tour, Lyon, France.
http://afup.org/pages/phptourlyon2014/
http://www.joind.in/11233
---------------------------------------------------------------
For any application with even the remotest ambition of international use, the only way to go is to use UTF-8. And even without that ambition, using UTF-8 might still bring you more benefits than you currently realize. Unfortunately most developers at one point or another run into problems implementing UTF-8 and get discouraged. That ends now! In this talk I will cover UTF-8 from the basic linguistics, through client-side aspects to all the steps you need to take to tackle the most common (and some more obscure) issues when using UTF-8 in a database driven web application.
---------------------------------------------------------------
Links:

Slide2:
http://intertwingly.net/blog/2004/04/25/utf-8-musings#c1082919794
http://intertwingly.net/blog/2004/04/25/utf-8-musings#c1082929502

Slide 5/6:
http://www.ethnologue.com/

Slide 7:
http://www.omniglot.com/

Slide 8:
http://en.wikipedia.org/wiki/Writing_system

Slide 11:
http://geek-and-poke.com/

Slide 12:
http://www.unicode.org/charts/

Slide 39:
http://sourceforge.net/projects/phputf8

Slide 40:
http://www.phpwact.org/php/i18n/utf-8

Slide 44:
http://www.php.net/regexp.reference.unicode

Slide 46:
http://www.php.net/mbstring

Slide 47:
http://www.php.net/iconv

Slide 48:
http://www.php.net/intl

Slide 57:
http://www.phpwact.org/php/i18n/utf-8/mysql

Slide 62:
http://www.poedit.net/

---------------------------------------------------------------
Other interesting links:

http://www.eki.ee/letter/
http://www.bisharat.net/A12N/

http://www.wazu.jp/
http://www.alanwood.net/unicode/
http://en.wikipedia.org/wiki/Unicode_typefaces

http://www.styopkin.com/details_free_and_easy_fonts_viewer.html
http://www.heiner-eichmann.de/software/listfont/listfont.htm

http://www.w3.org/International/techniques/authoring-html
http://www.sil.org/iso639-3/codes.asp
http://www.iso.org/iso/country_codes/iso_3166_code_lists/country_names_and_code_elements
http://www.w3.org/International/articles/language-tags/

http://httpd.apache.org/docs/2.4/mod/mod_charset_lite.html

http://www.collation-charts.org/
http://www.unicode.org/charts/uca/
http://dev.mysql.com/doc/refman/5.7/en/charset-collation-effect.html

http://www.w3.org/International/questions/qa-utf8-bom
http://people.w3.org/rishida/utils/bomtester/
http://www.unicode.org/unicode/faq/utf_bom.html#bom1

Juliette Reinders Folmer

June 24, 2014
Tweet

More Decks by Juliette Reinders Folmer

Other Decks in Programming

Transcript


  1. ‡ Disclaimer: As you never dared to ask, how am I supposed to know what you wanted to
    know ? Do you think I’m a mind-reader of something ? Anyways, my lawyer advised me against
    trying to mind-read, so I’m just going to guess and hope I get it right.
    Don’t come and complain afterwards that I didn’t tell you the things you wanted to know. At the
    end of the day: if you wanted to get your questions answered, you had better have the courage to
    ask them.
    By: Juliette Reinders Folmer
    @jrf_nl

    View Slide

  2. “Internationalization is like parenting: a
    lifelong cycle of hardship in which no
    cumulative knowledge is gained.”
    Mark Pilgrim, april 2004
    “Mark believes that because Unicode is
    harder than not-Unicode people will always
    create systems that fail to use Unicode and
    so break in unpleasant ways only after they
    are widely enough deployed that I18N
    becomes an issue.”
    J. Graham, april 2004
    “Internationalization is like parenting: a
    lifelong cycle of hardship in which no
    cumulative knowledge is gained.”
    Mark Pilgrim, april 2004

    View Slide

  3. Some common misconceptions
    • Unicode !== UTF-8
    • UTF-8 !== internationalization
    • UTF-8 !== charset

    View Slide

  4. Why worry about it anyway ?
    • Local is an illusion, always think global:
    – Company/Client gets taken over by a foreign
    company
    – Mergers
    – Expansion to other regions
    – Local users/employees from other origins
    • Code efficiency
    • Cost
    Helgi Þormar Þorbjörnsson

    View Slide

  5. Some language statistics
    • 7105 ‘living’ languages
    • +/- 308 languages with
    > 1 million speakers
    • Nr 1 language in the
    world ?
    • Nr 2 ?
    Did you know:
    • That France has more than
    9 officially recognized
    languages ?
    • That the country with the most
    languages is Papua New
    Guinea ?
    * Alsatian, Catalan, Corsican, Breton,
    French, Gallo, Occitan, Tahitian, some
    languages of New Caledonia
    (837)
    Mandarin Chinese
    Spanish
    Source: Ethnologue 2013

    View Slide

  6. Top 20 languages in the world *
    Arabic
    64
    Urdu
    20
    Javanese
    84
    Javanese
    10
    Roman
    68
    Vietnamese
    19
    Hiragana,
    Katakana, and
    Kanji
    122
    Japanese
    9
    Tamil
    69
    Tamil
    18
    Cyrillic
    167
    Russian
    8
    Roman
    71
    Turkish
    17
    Bengali
    193
    Bengali
    7
    Devanagari
    72
    Marathi
    16
    Roman
    203
    Portuguese
    6
    Telugu
    74
    Telugu
    15
    Arabic
    237
    Arabic (standard)
    5
    Roman
    75
    French
    14
    Devanagari
    260
    Hindi
    4
    Korean (Hangul)
    77
    Korean
    13
    Roman
    335
    English
    3
    Roman
    78
    German (standard)
    12
    Roman
    414
    Spanish
    2
    Lahnda, Arabic
    83
    Lahnda ( Western
    Punjabi)
    11
    Vernacular
    Chinese
    1.197
    Mandarin Chinese
    1
    Script
    Total
    speakers (M)
    Language
    Script
    Total
    speakers (M)
    Language
    * Source: Ethnologue 2013/14

    View Slide

  7. About writing systems
    • There are approximately writing
    systems in active use
    • Most are used (with or without extensions) for
    several languages
    • Some languages use more than one writing
    system
    • Numerous other writing systems for
    ceremonial or religious use
    • Or for fun ;-)
    180 *
    * Source: Omniglot

    View Slide

  8. Distribution of writing systems
    Source: Wikipedia

    View Slide

  9. There Ain't No
    Such Thing As
    Plain Text

    View Slide

  10. On character sets and encoding
    11000111
    10111010 *
    Ǻ
    Encoding
    UTF-8
    Charset

    View Slide

  11. © Geek and Poke

    View Slide

  12. Unicode
    Unicode is a computing industry
    standard for the consistent encoding,
    representation and handling of text
    expressed in most of the world's
    writing systems.
    (Wikipedia)
    • Unicode Code charts:
    http://www.unicode.org/charts/

    View Slide

  13. UTF
    • UTF = Unicode Transform Format
    • UTF-8 is one of the character encodings for
    implementing Unicode
    • Alternatives are UTF-7 (legacy), UTF-16, UTF-32
    • UTF-8 is (backward) compatible with ASCII,
    UTF-16/32 are not.
    * Image source: W3C

    View Slide

  14. Advantages of UTF-8
    • Backward compatible with ASCII
    • UTF-8 can encode any Unicode character
    • XML requires UTF-8 or UTF-16
    • UTF-8 and UTF-16 are the standards for having
    Unicode in HTML. UTF-8 is preferred.
    • Can be fairly reliably recognized with small
    chance of confusion.
    • Sorting UTF-8 as arrays of unsigned bytes will
    result in same order as sorting on Unicode code
    point.

    View Slide

  15. So, what’s the problem ?
    • Everything defaults to non-UTF-8
    Mostly latin, ISO-8859-1 or US-ASCII
    So, what’s the solution ?
    • Be EXPLICIT everywhere (and I don’t mean
    $%&@-explicit)

    View Slide

  16. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  17. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  18. User’s computer
    Potential issues:
    • Extended language support ?
    • Code pages ?
    • Font ?
    • Browsers, browsers, browsers

    View Slide

  19. Characteristics of text
    • Language
    • Writing system
    • Writing direction
    • Writing direction
    • Character (sub)set
    • Character encoding
    • Font
    • Meaning
    • English
    • Roman/Latin
    • Left to right
    • Top to bottom
    • Basic Latin
    • UTF-16 (can vary)
    • Arial
    “What I really love”
    “ار ا يذ ا ا “
    • Arabic
    • Arabic
    • Right to left
    • Top to bottom
    • Arabic
    • UTF-16 (...)
    • Arial

    View Slide

  20. Languages:
    •English
    •Greek
    •Ukrainian
    •Mandarin Chinese
    •Japanese
    •Hindi
    •Korean
    •Kannada
    •Punjabi Gurmuki
    •Tamil
    •Tigre
    •Myanmar
    •Arabic
    •Farsi
    •Hebrew

    View Slide

  21. About Fonts
    • Unicode versus non-unicode fonts
    Be aware & be wary !
    • Few fonts capable of handling a wide range of
    Unicode characters.
    Examples:
    Arial Unicode MS, Bitstream Cyberbit,
    Code2000, GNU Unifont

    View Slide

  22. Fonts used:
    •Verdana
    •Verdana
    •Verdana
    •SimSun
    •MS Mincho
    •Code2000
    •Batang
    •Arial Unicode MS
    •Lohit Punjabi
    •Latha
    •GS GeezMahtemUnicode
    •WinInnwa
    •Arial Unicode MS
    •Arial
    •Arial Unicode MS

    View Slide

  23. These are the
    same phrases
    converted to the
    Verdana font.
    WinInnwa is a
    non-Unicode
    compliant font...

    View Slide

  24. To stress the importance of Unicode
    and Unicode-compliant fonts:

    View Slide

  25. These are the same
    phrases again, now
    converted to the
    Arial Unicode MS
    font.
    The Ge’ez script
    Character (sub)set
    (Ethiopic range)
    are not included in
    Arial Unicode MS.

    View Slide

  26. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  27. Client side
    • Always declare the character encoding
    * Image source: W3C

    View Slide

  28. Client side – HTML
    • Use meta-headers:

    content="text/html; charset=utf-8">
    • Tell the browser (and the search engines) the
    language too if you can:

    View Slide

  29. Client side – XHTML
    • Add an XML declaration at the top

    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-
    transitional.dtd">

    View Slide

  30. Client side – CSS
    • Add an encoding to the CSS file – must be
    on the very first line of the file!
    @charset "utf-8";
    • Use Unicode compliant fonts and tell the
    browser which to use with CSS
    我很喜欢
    P[LANG|="zh"] { font-family: SimSun, "MS
    Song", "Adobe Song Std L", sans-serif;}
    P[LANG="zh-cn"] { font-family: SimSun, “MS
    Song”, "Adobe Song Std L", sans-serif;}
    P[LANG|="ar"] { font-family: Arial, "Arial
    Unicode MS", sans-serif; direction: rtl; }

    View Slide

  31. Client-side - Javascript
    • Specify content-type AND charset headers!
    • Can use Unicode escape sequences \uHHHH

    View Slide

  32. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  33. Putting files on the server
    • Save the file(s) as encoded in UTF-8
    • Don’t forget to upload the file as binary rather
    than ascii !

    View Slide

  34. URL’s
    • Can you see a difference ?

    View Slide

  35. Client-server communication
    Receiving data from the client
    • Make sure you send information to the server in
    the correct encoding
    • This is especially important for user-input !!!

    View Slide

  36. Client-server communication
    Sending data to the client
    • Send a HTTP header:
    header( 'Content-Type: text/html;
    charset=utf-8' );
    • .htaccess/ Apache’s httpd.conf
    Settings in .htaccess overrule HTML headers !

    View Slide

  37. Client-server communication
    .htaccess examples
    • # Maps file extensions to a character encoding. Especially
    useful in content negotiation situations. (httpd.conf)
    AddCharset utf-8 .utf8
    • # Pass the default character encoding for content-type
    text/plain and text/html
    AddDefaultCharset On|Off|charset
    AddDefaultCharset UTF-8
    AddDefaultCharset On => iso-8859-1
    • # Add a default character encoding per file extension
    AddType 'text/html; charset=UTF-8' html
    • # Identify the encoding for a particular file:

    ForceType 'text/html; charset=UTF-8‘

    View Slide

  38. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  39. PHP
    • Currently not very friendly for UTF-8
    • PHP6 development dormant
    • Some PHP extensions come to the rescue:
    MBstring
    iconv
    ~Intl
    • There are also some nifty function collections /
    classes available to help you.
    Take note of:
    http://sourceforge.net/projects/phputf8

    View Slide

  40. PHP UTF-8 safe functions
    Safe:
    • explode()
    • str_replace()
    • PHP5+ ~htmlentities()
    NOT Safe:
    • Everything else
    MUST READ: http://www.phpwact.org/php/i18n/utf-8

    View Slide

  41. Danger zone
    • setlocale()
    – And all functions which use locale
    • strtoupper() / strtolower()
    • number_format() / money_format()
    • ucfirst() / ucwords()
    • strftime()
    • Gettext extension
    • Filter extension text functions
    • Ctype

    View Slide

  42. Test for well-formedness
    function utf8_compliant( $string ) {
    if ( strlen( $string ) == 0 ) {
    return true;
    }
    return ( preg_match( '/^.{1}/us',
    $string , $array ) == 1 );
    }

    View Slide

  43. Handling text
    • You don’t need htmlentities() anymore.
    Use htmlspecialchars() instead:
    $html = htmlspecialchars($utf8_string,
    ENT_COMPAT, 'UTF-8');
    • strlen() will count bytes, so use:
    function utf8_strlen( $string ){
    return strlen( utf8_decode( $str ) );
    }

    View Slide

  44. PRCE
    • PRCE can be relatively UTF-8 safe if compiled
    with Unicode.
    Use: preg_match('/^.+$/u', $string);
    • Test whether PRCE has been compiled with
    Unicode support:
    if( preg_match('/^.{1}$/u',"ñ", $UTF8_ar) != 1 ){
    trigger_error('PCRE is not compiled with UTF-8
    support',E_USER_ERROR);
    }

    View Slide

  45. utf8_encode() & utf8_decode()
    • Only useful for converting between
    ISO-8859-1 and UTF-8.

    View Slide

  46. MBstring extension
    • Multibyte aware implementations of some of the
    most common PHP string functions, the POSIX
    extended regex extension and the mail function.
    • Mbstring supports many different character sets,
    most importantly UTF-8.
    • Allows for conversion between character sets
    and implements some level of encoding
    detection.

    View Slide

  47. Iconv extension
    • Bundled since PHP 5+.
    • Main purpose of iconv : converting between
    different character sets.
    • From PHP 5+, iconv has implementations of
    some common string functions, but is slower
    than mbstring for UTF-8.
    • Great for dealing with files, filtering streams and
    handling output buffers.

    View Slide

  48. Intl extension
    • Bundled since PHP 5.3+, but not always
    enabled.
    • Wrapper around the excellent ICU library
    • Modules:
    – Collator
    – Number Formatting
    – Currency Formatting
    – Message Formatter
    (replaces gettext)
    – Normalizer
    – Locale
    – Convertors
    – Transliterators
    – Spoof checker
    – And more...

    View Slide

  49. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  50. Communicating with MySQL
    • The connection between PHP and MySQL
    defaults to a latin1 connection.
    • The first query you should run after making your
    connection:
    mysqli_query( 'SET NAMES "utf8" [COLLATE
    "collation_name"]' );
    OR
    mysqli_query( 'SET CHARACTER SET utf8' );
    • PHP 5.2+:
    mysqli_set_charset( 'utf8', $conn );

    View Slide

  51. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues
    4.1+

    View Slide

  52. Finding out current settings
    • Find out how your system is set up:
    mysql> SHOW VARIABLES LIKE 'character_set%';
    mysql> SHOW VARIABLES LIKE 'collation%';
    +--------------------------+-------------------+
    | Variable_name | Value |
    +--------------------------+-------------------+
    | character_set_client | latin1 |
    | character_set_connection | latin1 |
    | character_set_database | latin1 |
    | character_set_results | latin1 |
    | character_set_server | latin1 |
    | character_set_system | utf8 |
    | collation_connection | latin1_swedish_ci |
    | collation_database | latin1_swedish_ci |
    | collation_server | latin1_general_ci |
    +--------------------------+-------------------+

    View Slide

  53. Finding out what’s available
    • To find out which character encodings are
    available and what their default collation is:
    SHOW CHARACTER SET;
    • To find out which collations are available:
    SHOW COLLATION LIKE 'utf8%';

    View Slide

  54. Setting up a server
    • Add the following to your /etc/my.cnf file:
    [mysqld]
    ...
    default-character-set=utf8
    default-collation=utf8_general_ci
    • If you are the only user you could even do:
    (MySQL 5.x and later)
    (not executed for super-user logins)
    init_connect=’SET NAMES utf8′

    View Slide

  55. Setting up databases & tables
    • Make sure that both database, tables as well as
    text columns are in UTF-8:
    (CREATE | ALTER) DATABASE / TABLE ... (
    ...
    ) [DEFAULT] CHARACTER SET utf8
    [[DEFAULT] COLLATE collation]
    • Don’t forget Field widths

    View Slide

  56. Choosing the collation
    • Collation == Sort order
    • Guideline to the collations:
    _ci = case insensitive
    _cs = case sensitive
    _bin = binary
    • Test !
    Image Source: Mysql.com

    View Slide

  57. Converting an existing database
    • Using MySQL’s CONVERT function you can
    migrate ‘old’ data:
    INSERT INTO utf8table (utf8column)
    SELECT CONVERT(latin1field USING utf8)
    FROM latin1table;
    • For a complete php script to convert your
    database:
    http://www.phpwact.org/php/i18n/utf-8/mysql

    View Slide

  58. Querying a database
    • You can specify the collation to use for a
    specific query:
    SELECT k
    FROM t1
    ORDER BY k COLLATE utf8_spanish_ci;
    • You can even use it in the WHERE clause:
    SELECT *
    FROM t1
    WHERE k LIKE _latin1 'Müller' COLLATE
    latin1_german2_ci;

    View Slide

  59. Common issue
    • If you run into the following error message when
    running a query:
    Illegal mix of collations
    (utf8_bin,IMPLICIT) and
    (latin1_swedish_ci,COERCIBLE) for operation
    You may want to try and make your query more
    explicit with a character string literal:
    SELECT *
    FROM table
    WHERE col = _utf8'xyz';

    View Slide

  60. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  61. Working with files
    • Use the b flag in fopen()
    • Don’t use unicode in filenames

    View Slide

  62. .po files
    • Poedit understands all encodings supported by
    operating system and works in Unicode
    internally: http://www.poedit.net/

    View Slide

  63. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  64. BOM ! Run for your life !
    •  or extra blank line
    • BOM = Byte Order Mark
    The character is the ZERO WIDTH NON-BREAKING
    SPACE. If not placed at the top, it shouldn’t give any
    problems.
    • In UTF-16/32 the BOM is necessary to
    determine the byte order. In UTF-8 it is not.

    View Slide

  65. Bom squad
    • Some browsers may display the BOM.
    • ‘headers already send’ problems.
    It can also cause problems when in front of “#!” in a shell
    script.
    • Check your editor settings
    • Open the file in (another) editor, if you can see
    the BOM manually delete it and save the file.
    Sometimes even just opening the file and saving it again
    as UTF-8 will solve it.

    View Slide

  66. We’ve covered:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  67. Keep in touch!
    (I’m self-employed, you can hire me ;-) )
    Juliette Reinders Folmer
    Email: [email protected]
    Web: http://www.adviesenzo.nl/
    LinkedIn: http://nl.linkedin.com/in/julietterf
    Twitter: http://twitter.com/jrf_nl
    GitHub: http://github.com/jrfnl/
    Please rate this talk on joined.in/11233
    Slides: speakerdeck.com/jrf
    Endorsements and recommendations on
    LinkedIn are much appreciated too!

    View Slide

  68. Anything else you never
    dared to ask before ?

    View Slide