Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Everything you always wanted to know about UTF-8 (but never dared to ask)

Everything you always wanted to know about UTF-8 (but never dared to ask)

Presented on October 28th 2013 at the International PHP Conference, Munich, Germany.
http://phpconference.com/2013/en
---------------------------------------------------------------
For any application with even the remotest ambition of international use, the only way to go is to use UTF-8. And even without that ambition, using UTF-8 might still bring you more benefits than you currently realize. Unfortunately most developers at one point or another run into problems implementing UTF-8 and get discouraged. That ends now! In this talk I will cover UTF-8 from the basic linguistics, through client-side aspects to all the steps you need to take to tackle the most common (and some more obscure) issues when using UTF-8 in a database driven web application.

Juliette Reinders Folmer

October 28, 2013
Tweet

More Decks by Juliette Reinders Folmer

Other Decks in Programming

Transcript

  1. Juliette Reinders Folmer | Advies en zo
    Everything you always wanted to know
    about UTF-8 *
    * But never dared to ask

    View Slide

  2. “Internationalization is like parenting: a
    lifelong cycle of hardship in which no
    cumulative knowledge is gained.”
    Mark Pilgrim, april 2004
    “Mark believes that because Unicode is
    harder than not-Unicode people will always
    create systems that fail to use Unicode and
    so break in unpleasant ways only after they
    are widely enough deployed that I18N
    becomes an issue.”
    J. Graham, april 2004
    “Internationalization is like parenting: a
    lifelong cycle of hardship in which no
    cumulative knowledge is gained.”
    Mark Pilgrim, april 2004

    View Slide

  3. Some common misconceptions
    • Unicode !== UTF-8
    • UTF-8 !== internationalization
    • UTF-8 !== charset

    View Slide

  4. Why worry about it anyway ?
    • It’s all about being prepared:
    – Company/Client gets taken over by a foreign
    company
    – Mergers
    – Expansion to other regions
    – Local users/employees from other origins
    • Code efficiency
    • Cost
    • Global market

    View Slide

  5. Some language statistics
    • 7105 ‘living’
    languages
    • +/- 308 languages
    with > 1 million
    speakers
    • Nr 1 language in the
    world ?
    • Nr 2 ?
    Did you know:
    • That Germany has 27* officially
    recognized languages ?
    • That the country with the most
    languages is Papua New
    Guinea ?
    * Alemannic, Bavarian, Danish, Frankish,
    Eastern Frisian, Northern Frisian,
    Standard German, Kabardian, Kölsch,
    Limburgish, Luxembourgeois,
    Mainfränkisch, Pfaelzisch, Plautdietsch,
    Polish, Balkan Romani, Sinte Romani,
    Vlax Romani, Saterfriesisch, Low Saxon,
    Upper Saxon, Lower Sorbian, Upper
    Sorbian, Swabian, Westphalien, Yeniche,
    Western Yiddish.
    (836)
    Mandarin Chinese
    Spanish
    Source: Ethnologue 2013

    View Slide

  6. Top 20 languages in the world *
    Roman
    61
    Italian
    20
    Javanese
    84
    Javanese
    10
    Arabic
    63
    Urdu
    19
    Hiragana,
    Katakana, and
    Kanji
    122
    Japanese
    9
    Korean (Hangul)
    66
    Korean
    18
    Cyrillic
    162
    Russian
    8
    Roman
    68
    Vietnamese
    17
    Bengali
    193
    Bengali
    7
    Roman
    69
    French
    16
    Roman
    202
    Portuguese
    6
    Tamil
    69
    Tamil
    15
    Arabic
    223
    Arabic (standard)
    5
    Devanagari
    72
    Marathi
    14
    Devanagari
    260
    Hindi
    4
    Telugu
    74
    Telugu
    13
    Roman
    335
    English
    3
    Lahnda, Arabic
    83
    Lahnda ( Western
    Punjabi)
    12
    Roman
    406
    Spanish
    2
    Roman
    84
    German (standard)
    11
    Vernacular
    Chinese
    1.197
    Mandarin Chinese
    1
    Script
    Total
    speakers (M)
    Language
    Script
    Total
    speakers (M)
    Language
    * Source: Ethnologue 2013

    View Slide

  7. About writing systems
    • There are approximately writing
    systems in active use
    • Most are used (with or without extensions) for
    several languages
    • Some languages use more than one writing
    system
    • Numerous other writing systems for
    ceremonial or religious use
    • Or for fun ;-)
    180 *
    * Source: Omniglot

    View Slide

  8. Distribution of writing systems
    Source: Wikipedia

    View Slide

  9. Writing system resources
    • Info on languages:
    http://www.ethnologue.com/
    • Info on writing systems:
    http://www.omniglot.com/
    • Which characters are used in language X ?
    http://www.eki.ee/letter/
    • Info on Latin extensions for African languages
    http://www.bisharat.net/A12N/
    • And of course:
    http://en.wikipedia.org/wiki/
    Writing_system

    View Slide

  10. There Ain't No Such Thing As
    Plain Text

    View Slide

  11. On character sets and encoding
    “A coded character set is a set of characters
    for which a unique number has been
    assigned to each character. Units of a coded
    character set are known as code points.”
    (W3C)
    “The character encoding reflects the way
    these abstract characters are mapped to
    bytes for manipulation in a computer.” (W3C)

    View Slide

  12. Unicode
    “Unicode is a computing industry standard
    allowing computers to consistently
    represent and manipulate text expressed
    in most of the world's writing systems.”
    (Wikipedia)
    • Unicode Code charts:
    http://www.unicode.org/charts/

    View Slide

  13. UTF
    • UTF = Unicode Transform Format
    • UTF-8 is one of the character encodings for
    implementing Unicode
    • Alternatives are UTF-7 (legacy), UTF-16, UTF-32
    • UTF-8 is (backward) compatible with ASCII,
    UTF-16/32 are not.
    * Image source: W3C

    View Slide

  14. Advantages of UTF-8
    • Backward compatible with ASCII
    • UTF-8 can encode any Unicode character
    • XML requires UTF-8 or UTF-16
    • UTF-8 and UTF-16 are the standards for
    having Unicode in HTML. UTF-8 is preferred.
    • Can be fairly reliably recognized with small
    chance of confusion.
    • Sorting UTF-8 as arrays of unsigned bytes
    with result in same order as sorting on
    Unicode code point.

    View Slide

  15. So, what’s the problem ?
    • Everything defaults to non-UTF-8
    Mostly latin writing system, ISO-8859-1 or US-ASCII
    So, what’s the solution ?
    • Be EXPLICIT everywhere (and I don’t mean
    $%&@-explicit)

    View Slide

  16. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  17. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  18. User’s computer
    Potential issues:
    • Extended language support ?
    • Code pages ?
    • Font ?
    • Browsers, browsers, browsers

    View Slide

  19. Characteristics of text
    • Language
    • Writing system
    • Writing direction
    • Writing direction
    • Character (sub)set
    • Character encoding
    • Font
    • Meaning
    • English
    • Roman/Latin
    • Left to right
    • Top to bottom
    • Basic Latin
    • UTF-16 (can vary)
    • Arial
    “What I really love”
    “ار ا يذ ا ا “
    • Arabic
    • Arabic
    • Right to left
    • Top to bottom
    • Arabic
    • UTF-16 (...)
    • Arial

    View Slide

  20. Languages:
    •English
    •Greek
    •Ukrainian
    •Mandarin Chinese
    •Japanese
    •Hindi
    •Korean
    •Kannada
    •Punjabi Gurmuki
    •Tamil
    •Tigre
    •Myanmar
    •Arabic
    •Farsi
    •Hebrew

    View Slide

  21. About Fonts
    • Unicode versus non-unicode fonts
    Be aware & be wary !
    • Few fonts capable of handling a wide range
    of Unicode characters.
    Examples:
    Arial Unicode MS, Bitstream Cyberbit,
    Code2000, GNU Unifont

    View Slide

  22. Fonts used:
    •Verdana
    •Verdana
    •Verdana
    •SimSun
    •MS Mincho
    •Code2000
    •Batang
    •Arial Unicode MS
    •Lohit Punjabi
    •Latha
    •GS GeezMahtemUnicode
    •WinInnwa
    •Arial Unicode MS
    •Arial
    •Arial Unicode MS

    View Slide

  23. These are the same
    phrases converted
    to the Verdana font.
    WinInnwa is a
    non-Unicode
    compliant font...

    View Slide

  24. To stress the importance of Unicode and
    Unicode-compliant fonts:

    View Slide

  25. These are the same
    phrases again, now
    converted to the Arial
    Unicode MS font.
    The Ge’ez script
    Character (sub)set
    (Ethiopic range)
    are not included in
    Arial Unicode MS.

    View Slide

  26. Useful font-related resources:
    • Galary of Unicode fonts – find fonts per
    writing system/language:
    http://www.wazu.jp/
    • Unicode test pages and more:
    http://www.alanwood.net/unicode/
    • Unicode typefaces:
    http://en.wikipedia.org/wiki/
    Unicode_typefaces
    • Font frequency on computers:
    http://www.codestyle.org/

    View Slide

  27. Some font viewers:
    • Free and easy Font viewer:
    http://www.styopkin.com/details_
    free_and_easy_fonts_viewer.html
    • ListFont:
    http://www.heiner-eichmann.de/
    software/listfont/listfont.htm

    View Slide

  28. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  29. Client side
    • Always declare the character encoding
    * Image source: W3C

    View Slide

  30. Client side – HTML
    • Use meta-headers:

    content="text/html; charset=utf-8">
    • Tell the browser (and the search engines) the
    language too if you can:

    View Slide

  31. Client side – XHTML
    • Add an XML declaration at the top

    "http://www.w3.org/TR/xhtml1/DTD/xhtml1
    -transitional.dtd">

    View Slide

  32. Client side – CSS
    • Add an encoding to the CSS file – must be
    on the very first line of the file!
    @charset "utf-8";
    • Use Unicode compliant fonts and tell the
    browser which to use with CSS
    我很喜欢
    P[LANG|="zh"] { font-family: SimSun, "MS
    Song", "Adobe Song Std L", sans-serif;}
    P[LANG="zh-cn"] { font-family: SimSun, “MS
    Song”, "Adobe Song Std L", sans-serif;}
    P[LANG|="ar"] { font-family: Arial, "Arial
    Unicode MS", sans-serif; direction: rtl; }

    View Slide

  33. Useful client side resources:
    • W3C on best practices for internationalization:
    http://www.w3.org/International/
    techniques/authoring-html
    • ISO language codes:
    http://www.sil.org/iso639-3/codes.asp
    • ISO country codes:
    http://www.iso.org/iso/country_codes/
    iso_3166_code_lists/country_names_and_
    code_elements
    • W3C on Language tags in HTML and XML:
    http://www.w3.org/International/
    articles/language-tags/

    View Slide

  34. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  35. Putting files on the server
    • Save the file(s) as encoded in UTF-8
    • Don’t forget to upload the file as binary rather
    than ascii !

    View Slide

  36. URL’s
    • Can you see a difference ?

    View Slide

  37. Client-server communication
    Sending data to the client
    • Send a HTTP header:
    header( 'Content-Type: text/html;
    charset=utf-8' );
    • .htaccess/ Apache’s httpd.conf
    Settings in .htaccess overrule HTML headers !

    View Slide

  38. Client-server communication
    .htaccess examples
    • # Maps file extensions to a character encoding. Especially
    useful in content negotiation situations. (httpd.conf)
    AddCharset utf-8 .utf8
    • # Pass the default character encoding for content-type
    text/plain and text/html
    AddDefaultCharset On|Off|charset
    AddDefaultCharset UTF-8
    AddDefaultCharset On => iso-8859-1
    • # Add a default character encoding per file extension
    AddType 'text/html; charset=UTF-8' html
    • # Identify the encoding for a particular file:

    ForceType 'text/html; charset=UTF-8‘

    View Slide

  39. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  40. Client-server communication
    Receiving data from the client
    • Make sure you send information to the server
    in the correct encoding
    • This is especially important for user-input,
    i.e. Forms!!!

    View Slide

  41. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  42. PHP
    • Currently not very friendly for UTF-8
    • PHP6 development dormant
    • Some PHP extensions come to the rescue:
    MBstring
    iconv
    ~Intl
    • There are also some nifty function collections
    / classes available to help you.
    Take note of:
    http://sourceforge.net/projects/phputf8

    View Slide

  43. PHP UTF-8 safe functions
    Safe:
    • explode()
    • str_replace()
    • PHP5+ ~htmlentities()
    NOT Safe:
    • Everything else
    MUST READ: http://www.phpwact.org/php/i18n/utf-8

    View Slide

  44. Test for well-formedness
    function utf8_compliant( $string ) {
    if ( strlen( $string ) == 0 ) {
    return true;
    }
    return ( preg_match( '/^.{1}/us',
    $string , $array ) == 1 );
    }

    View Slide

  45. PRCE
    • PRCE can be relatively UTF-8 safe if compiled
    with Unicode.
    Use: preg_match(‘/^.+$/u’, $string);
    • Test whether PRCE has been compiled with
    Unicode support:
    if( preg_match('/^.{1}$/u',"ñ", $UTF8_ar) != 1 ){
    trigger_error('PCRE is not compiled with UTF-8
    support',E_USER_ERROR);
    }

    View Slide

  46. Handling text
    • You don’t need htmlentities() anymore.
    Use htmlspecialchars() instead:
    $html = htmlspecialchars($utf8_string,
    ENT_COMPAT, 'UTF-8');
    • strlen() will count bytes, so use:
    function utf8_strlen( $string ){
    return strlen( utf8_decode( $str ) );
    }

    View Slide

  47. utf8_encode() & utf8_decode()
    • Only useful for converting between
    ISO-8859-1 and UTF-8.

    View Slide

  48. MBstring extension
    • Multibyte aware implementations of some of
    the most common PHP string functions, the
    POSIX extended regex extension and the
    mail function.
    • Mbstring supports many different character
    sets, most importantly UTF-8.
    • Allows for conversion between character sets
    and implements some level of encoding
    detection.

    View Slide

  49. Iconv extension
    • Bundled since PHP 5+.
    • Main purpose of iconv : converting between
    different character sets.
    • From PHP 5+, iconv has implementations of
    some common string functions, but is slower
    than mbstring for UTF-8.

    View Slide

  50. Intl extension
    • Bundled since PHP 5.3+, but not always
    enabled.
    • Modules:
    – Collator
    – Number Formatter
    – Message Formatter
    – Normalizer
    – Locale

    View Slide

  51. Useful resources
    • http://www.php.net/mbstring
    • http://www.php.net/iconv
    • http://www.php.net/intl
    • http://www.php.net/regexp.reference.u
    nicode
    • http://www.phpwact.org/php/i18n
    • http://sourceforge.net/projects/phput
    f8

    View Slide

  52. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  53. Communicating with MySQL
    • The connection between PHP and MySQL
    defaults to a latin1 connection.
    • The first query you should run after making your
    connection:
    mysql_query( 'SET NAMES "utf8" [COLLATE
    "collation_name"]' );
    OR
    mysql_query( 'SET CHARACTER SET utf8' );
    • PHP 5.2+:
    mysql_set_charset( 'utf8', $conn );

    View Slide

  54. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues
    4.1+

    View Slide

  55. Finding out current settings
    • Find out how your system is set up:
    mysql> SHOW VARIABLES LIKE 'character_set%';
    mysql> SHOW VARIABLES LIKE 'collation%';
    +--------------------------+-------------------+
    | Variable_name | Value |
    +--------------------------+-------------------+
    | character_set_client | latin1 |
    | character_set_connection | latin1 |
    | character_set_database | latin1 |
    | character_set_results | latin1 |
    | character_set_server | latin1 |
    | character_set_system | utf8 |
    | collation_connection | latin1_swedish_ci |
    | collation_database | latin1_swedish_ci |
    | collation_server | latin1_general_ci |
    +--------------------------+-------------------+

    View Slide

  56. Finding out what’s available
    • To find out which character encodings are
    available and what their default collation is:
    SHOW CHARACTER SET;
    • To find out which collations are available:
    SHOW COLLATION LIKE 'utf8%';

    View Slide

  57. Setting up a server
    • Add the following to your /etc/my.cnf file:
    [mysqld]
    ...
    default-character-set=utf8
    default-collation=utf8_general_ci
    • If you are the only user you could even do:
    (MySQL 5.x and later)
    (not executed for super-user logins)
    init_connect=’SET NAMES utf8′



    View Slide

  58. Setting up databases & tables
    • Make sure that both database, tables as well as
    text columns are in UTF-8:
    (CREATE | ALTER) DATABASE / TABLE ... (
    ...
    ) [DEFAULT] CHARACTER SET utf8
    [[DEFAULT] COLLATE collation]
    • Don’t forget Field widths

    View Slide

  59. Choosing the collation
    • Collation == Sort order
    • Guideline to the collations:
    _ci = case insensitive
    _cs = case sensitive
    _bin = binary
    • Test !
    Image Source: Mysql.com

    View Slide

  60. Collation Resources
    • Collation charts:
    http://www.collation-charts.org/
    • Unicode collation charts:
    http://www.unicode.org/charts/uca/
    • Examples of collation choice effects:
    http://dev.mysql.com/doc/refman/5.7/en/
    charset-collation-effect.html

    View Slide

  61. Converting an existing database
    • Using MySQL’s CONVERT function you can
    migrate ‘old’ data:
    INSERT INTO utf8table (utf8column)
    SELECT CONVERT(latin1field USING utf8)
    FROM latin1table;
    • For a complete php script to convert your
    database:
    http://www.phpwact.org/php/i18n/utf-8/mysql

    View Slide

  62. Querying a database
    • You can specify the collation to use for a
    specific query:
    SELECT k
    FROM t1
    ORDER BY k COLLATE utf8_spanish_ci;
    • You can even use it in the WHERE clause:
    SELECT *
    FROM t1
    WHERE k LIKE _latin1 'Müller' COLLATE
    latin1_german2_ci;

    View Slide

  63. Common issue
    • If you run into the following error message when
    running a query:
    Illegal mix of collations
    (utf8_bin,IMPLICIT) and
    (latin1_swedish_ci,COERCIBLE) for operation
    You may want to try and make your query more
    explicit with a character string literal:
    SELECT *
    FROM table
    WHERE col = _utf8'xyz';

    View Slide

  64. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  65. GetText / .po files
    • Poedit understands all encodings supported
    by operating system and works in Unicode
    internally: http://www.poedit.net/

    View Slide

  66. We’ll be covering:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  67. BOM ! Run for your life !
    •  or extra blank line
    • BOM = Byte Order Mark
    The character is the ZERO WIDTH NON-BREAKING
    SPACE. If not placed at the top, it shouldn’t give any
    problems.
    • In UTF-16/32 the BOM is necessary to
    determine the byte order. In UTF-8 it is not.

    View Slide

  68. Bom squad
    • Some browsers may display the BOM.
    • ‘headers already send’ problems.
    It can also cause problems when in front of “#!” in a shell
    script.
    • Check your editor settings
    • Open the file in (another) editor, if you can see
    the BOM manually delete it and save the file.
    Sometimes even just opening the file and saving it again
    as UTF-8 will solve it.

    View Slide

  69. Useful BOM resources:
    • W3C on the BOM character:
    http://www.w3.org/International/questio
    ns/qa-utf8-bom
    • Webbased BOM-testing:
    http://people.w3.org/rishida/utils/bomt
    ester/
    • Unicode Consortium on BOM character:
    http://www.unicode.org/unicode/faq/utf_
    bom.html#bom1

    View Slide

  70. We’ve covered:
    •Dependancy on user’s computer setup
    •Client side code – HTML, CSS, JS
    •Communication with the client
    •Server side code – PHP
    •Communicating with a MySQL database
    •MySQL
    •Communicating with files
    •Other common issues

    View Slide

  71. Keep in touch!
    (I’m self-employed, you can hire me ;-) )
    Juliette Reinders Folmer
    Email: [email protected]
    Web: http://www.adviesenzo.nl/
    LinkedIn: http://nl.linkedin.com/in/julietterf
    Twitter: http://twitter.com/jrf_nl
    GitHub: http://github.com/jrfnl/
    Please rate this talk on joined.in/9519
    Endorsements and recommendations on
    LinkedIn are much appreciated too!

    View Slide

  72. Anything else you never
    dared to ask before ?

    View Slide