$30 off During Our Annual Pro Sale. View Details »

Internationalization & Localization

Internationalization & Localization

The slides corresponding to the talk presented in NijmegenPHP @2015-09-16

Camilo Sperberg

September 16, 2015
Tweet

More Decks by Camilo Sperberg

Other Decks in Programming

Transcript

  1. Internationalization and
    Localization:
    The Basics®
    NijmegenPHP
    2015 - 09 - 16
    Camilo Sperberg / @unreal4u

    View Slide

  2. Disclaimer
    Don't be offended by anything said or shown in this talk!

    View Slide

  3. What is internationalization?
    Internationalization (i18n):
    Preparing your application to be localized
    (Pro-tip: How? Watch this talk!)
    Localization (L10n):
    Translating, adding icons and other things of certain zone
    3

    View Slide

  4. ….
    ….
    Most discussed thing in RFC's
    4
    Bilbo Baggins
    1290 - ?
    RFC 3066
    Jan. 2001
    RFC 4646
    Sep. 2006
    RFC 4647
    Sep. 2006
    Bungo Baggins
    1246 - 1326
    RFC 5646
    Sep. 2009
    Balbo Baggins
    1167 - 1258
    Mungo Baggins
    1207 - 1300
    RFC 1766
    Mar. 1995
    Son of
    Son of
    Son of
    Replaces
    Replaces
    Replaces
    Belladonna Took
    Laura Grubb
    Berylla Boffin
    1172 - 1265
    LOTR RFC i18n

    View Slide

  5. i18n and PHP: needed extensions
    5
    ‣ For gettext: php-gettext
    ‣ Needs gettext
    ‣ https://www.gnu.org/software/gettext/
    ‣ For intl: php-intl
    ‣ Needs ICU4C
    ‣ http://site.icu-project.org/
    ‣ mb_* functions: multibyte support

    View Slide

  6. L10n: Definition
    "A set of parameters that defines the user's language, country and any special variant
    preferences that the user wants to see in their user interface"
    Wikipedia
    What it means: rules for a specific region
    ‣ Current code standard: RFC 4646
    ‣ Language in lowercase (ISO 639-1), hyphen, region in uppercase (ISO 3166-1 alpha-2)
    6

    View Slide

  7. L10n: Code standards
    nl-NL
    Dutch as SUISHE*® in the Netherlands
    nl-BE
    Dutch as SUISHE*® in Belgium
    7
    * Spoken, Used, Interpreted, Seen, Heard, Etc

    View Slide

  8. L10n: Code standards
    8
    * Spoken, Used, Interpreted, Seen, Heard, Etc
    pt-BR
    Portuguese as SUISHE*® in Brazil
    nl-PT
    Portuguese as SUISHE*® in Portugal

    View Slide

  9. L10n: Code standards
    9
    * Spoken, Used, Interpreted, Seen, Heard, Etc
    es-ES, es-CL, es-AR, es-PE, es-*
    Spanish as SUISHE*® in Spain, Chile,
    Argentina, Peru, etc.

    View Slide

  10. L10n: Code standards
    de-CH-1901
    German as used in Switzerland using the 1901 variant
    10

    View Slide

  11. L10n: Code standards
    sl-IT-nedis
    Slovenian as used in Italy, Nadiza dialect
    11

    View Slide

  12. L10n: Code standards
    hy-Latn-IT-arevela
    Eastern Armenian written in Latin script, as used in Italy
    12

    View Slide

  13. L10n: Determining which to use
    Introducing the
    \Locale
    object!
    13
    Best practice tips:
    ‣ Get it from your $_GET or $_POST
    ‣ Get it from the headers
    ‣ Get it based on ip

    View Slide

  14. i18n: Detecting locale
    14
    public function getLocaleFromClient() {

    $this->locale = $this->getLocaleFromGetRequest();

    if (empty($this->locale)) {

    $this->locale = $this->getLocaleFromHeaders();

    if (empty($this->locale)) {

    $this->locale = $this->getLocaleFromIP();

    }

    }


    return $this->locale;

    }

    View Slide

  15. i18n: Detecting locale
    15
    public function getLocaleFromHeaders() {

    $this->locale = '';


    if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {

    $preferredLocale = \Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);

    $this->locale = $this->_checkLocale($preferredLocale);

    }


    return $this->locale;

    }

    View Slide

  16. \Locale: cool functions!
    16
    Use To…
    ‣ \Locale::getDisplayLanguage() Create a list of languages
    ‣ \Locale::getDisplayRegion() Display the country where the locale is used
    ‣ \Locale::acceptFromHttp() Parse browser Accept header
    ‣ \Locale::setDefault() Set the default localization to use

    View Slide

  17. L10n: what to look out for?
    ‣ Warning! Locale is not just a translation
    ‣ What's normal for you can be strange to
    others
    "Diversity is amazing, both in appearance and
    thoughts, please respect different opinions,
    agree to disagree and live in harmony."
    @michellesanver
    17

    View Slide

  18. Images and gestures
    ‣ Pointing fingers can be offensive in Arabic countries
    18

    ‣ The "peace" sign is offensive in Australia, Ireland, New Zealand, South Africa and the
    United Kingdom

    View Slide

  19. Translations - Semantics (L10n)
    19
    echo $numberResults.' results found within a '.$range.'km range';
    If translated directly into Spanish, it will sound like this:
    ➡ "Found 123 results within range 5km"

    View Slide

  20. Translations - Semantics (i18n)
    20
    ‣Solvable by implementing printf()
    printf(
    '%1$d results found within a %2$d km range',
    $numberResults,
    $range
    );
    ‣Translator decides where to print out variables

    View Slide

  21. Translations - Semantics (L10n)
    21
    es-ES:
    Carro
    Coche
    es-CL:
    Auto

    View Slide

  22. This is a “carro” or
    “coche” in es-CL:

    View Slide

  23. Translations - Semantics (L10n)
    23
    British American
    Holiday Vacation
    Football Soccer
    American Football Football
    Flat Apartment
    Garden Yard
    Rubbish Garbage / Trash

    View Slide

  24. Translations - Semantics
    24
    Smart Phones
    Mobile phones
    Barcode Scanners
    Control Remotes
    See
    Printers
    Interactive question!
    What's wrong with the following list?

    View Slide

  25. Translations - Semantics
    25
    Captain here:
    "Watch" was translated as
    the verb instead of the
    noun

    View Slide

  26. 26

    View Slide

  27. Translations - Plural forms (L10n)
    27
    English Polish
    0 Apples Jabłek
    1 Apple Jabłko
    2 .. 4 Apples Jabłka
    5 .. 21 Apples Jabłek
    22 .. 24 Apples Jabłka
    25 .. 31 Apples Jabłek
    More complex cases do exist!
    ‣ Slovenian: 4 plural forms
    There are also cases with 1 plural form
    ‣ Japanese
    ‣ Vietnamese

    View Slide

  28. Translations - Plural forms (i18n)
    28
    Gettext!
    ‣ Supports plural forms
    ‣ Is cached in RAM (pros
    and cons)
    ‣ Very easy to edit (poEdit)
    ‣ Can be separated into
    modules
    ‣ Produces compiled
    language files

    View Slide

  29. Translations - Plural forms (i18n)
    ‣ \MessageFormatter can also help
    ‣ Can do pretty amazing stuff
    ‣ I personally don't have experience with it
    29
    $fmt = new MessageFormatter(

    'en_GB',

    'Peter has {0, plural, =0{no cat} =1{a cat} other{# cats}}'

    );

    echo $fmt->format(array(0));

    $fmt = new MessageFormatter(

    'nl_NL',

    'Peter heeft {0, plural, =0{geen kat} =1{een kat} other{# katten}}'

    );

    echo $fmt->format(array(0));


    // Outputs:

    // Peter has no cat

    // Peter heeft geen kat

    View Slide

  30. Number formatting - L10n
    ‣ Numbers have a lot of different types
    of annotations
    ‣ Corollary: nobody really knows well
    how a number should be formatted
    30
    Interactive question!
    How do you format the following
    negative number, in Euros, here in the
    Netherlands?
    1234,57

    View Slide

  31. Number formatting - L10n
    31
    nl-NL fr-FR pt-BR hi-IN ps-AR
    "-1.234,57" "-1 234,57" "-1.234,57" "-१,२३४.५७" "-۱٬۲۳۴٫۵۷"
    "€ 1.234,57-" "-1 234,57 €" "(€1.234,57)" "-€ १,२३४.५७" "-۱٬۲۳۴٫۵۷ €"
    "25%" "25 %" "25%" "२५%" "۲۵٪"

    View Slide

  32. Number formatting: i18n
    32
    $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN', 'ps-AR',];


    foreach ($locales as $myLocale) {

    $numberFormatter = new \NumberFormatter($myLocale, \NumberFormatter::DECIMAL);

    $percentFormatter = new \NumberFormatter($myLocale, \NumberFormatter::PERCENT);

    $currencyFormatter = new \NumberFormatter($myLocale, \NumberFormatter::CURRENCY);

    printf('Locale: %s'.PHP_EOL, $myLocale);

    printf('[DEC]-1.234,57: "%s" :: ', $numberFormatter->format(-1234.57));

    printf('[PER]25%%: "%s" :: ', $percentFormatter->format(0.25));

    printf('[CUR]1.234,57-: "%s"'.PHP_EOL, $currencyFormatter->formatCurrency(-1234.57, 'EUR'));
    // Last argument can also be \NumberFormatter::CURRENCY to print in CURRENCY of loaded locale

    }

    View Slide

  33. Number formatting: i18n
    33
    $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN', 'ps-AR',];


    foreach ($locales as $myLocale) {

    $numberFormatter = new \NumberFormatter($myLocale, \NumberFormatter::DECIMAL);

    $percentFormatter = new \NumberFormatter($myLocale, \NumberFormatter::PERCENT);

    $currencyFormatter = new \NumberFormatter($myLocale, \NumberFormatter::CURRENCY);

    printf('Locale: %s'.PHP_EOL, $myLocale);

    printf('[DEC]-1.234,57: "%s" :: ', $numberFormatter->format(-1234.57));

    printf('[PER]25%%: "%s" :: ', $percentFormatter->format(0.25));

    printf('[CUR]1.234,57-: "%s"'.PHP_EOL, $currencyFormatter->formatCurrency(-1234.57, 'EUR'));
    // Last argument can also be \NumberFormatter::CURRENCY to print in CURRENCY of loaded locale

    }

    View Slide

  34. 34
    unreal4u-MBP:localization unreal4u$ php numbers.php
    Locale: nl-NL
    [DEC]-1.234,57: "-1.234,57" :: [PER]25%: "25%" :: [CUR]1.234,57-: "€ 1.234,57-"
    All this in Dutch AKA Nederlands
    --------------------------------------------------------------------------------
    Locale: fr-FR
    [DEC]-1.234,57: "-1 234,57" :: [PER]25%: "25 %" :: [CUR]1.234,57-: "-1 234,57 €"
    All this in French AKA français
    --------------------------------------------------------------------------------
    Locale: pt-BR
    [DEC]-1.234,57: "-1.234,57" :: [PER]25%: "25%" :: [CUR]1.234,57-: "(€1.234,57)"
    All this in Portuguese AKA português
    --------------------------------------------------------------------------------
    Locale: hi-IN
    [DEC]-1.234,57: "-१,२३४.५७" :: [PER]25%: "२५%" :: [CUR]1.234,57-: "-€ १,२३४.५७"
    All this in Hindi AKA िहन्दी
    --------------------------------------------------------------------------------
    Locale: ps-AR
    [DEC]-1.234,57: "-١۱،٬٢۲٣۳۴٫۵٧۷" :: [PER]25%: "٢۲۵٪" :: [CUR]1.234,57-: "-١۱،٬٢۲٣۳۴٫۵٧۷ €"
    All this in Pashto AKA ﻮﺘ,ﭘ
    --------------------------------------------------------------------------------

    View Slide

  35. Some \NumberFormatter problems
    ‣ \NumberFormatter::DURATION isn't implemented in much locales
    ‣ echo $fmt->format(12345) -> 3 hours, 25 minutes, 45 seconds
    ‣ "Easy" to implement using getPattern() and setPattern()
    ‣ Documentation exists, but is not optimal
    35

    View Slide

  36. Date and time formatting - L10n
    ‣ Dates have 3 different annotations
    ‣ YYYY-MM-DD (1.660M)
    ‣ DD-MM-YYYY (4.810M)
    ‣ MM-DD-YYYY (320M)
    ‣ "It's complicated" (457M)
    ‣ Contrary to numbers, everybody
    knows how a date is formatted
    36
    Interactive question!
    What is the value of the following date?
    05-03-13

    View Slide

  37. Date and time formatting - L10n
    ➡ YMD (1660)
    ➡ YMD and DMY (287)
    ➡ DMY (3295)
    ➡ DMY and MDY (130)
    ➡ MDY (320)
    ➡ YMD and DMY and
    MDY (40)
    37
    https://en.wikipedia.org/wiki/Date_format_by_country

    View Slide

  38. Date and time formatting: i18n
    ‣ Use PHP's \*Date* related classes, like ALWAYS!
    ‣ Incredibly versatile yet powerful functions
    ‣ Specially in combination with locales
    ‣ Always work in UTC, let the \*Date* classes do the rest
    38

    View Slide

  39. Date and time formatting: i18n
    39
    nl-NL fr-FR hi-IN ps-AR
    Short "23-05-15" "23/05/15" "२३-५-१५" "۲۰۱۵/۵/۲۳"
    Medium "23 mei 2015" "23 mai 2015" "२३-०५-२०१५" "۲۳ ۲۰۱۵ یم"
    With time
    "23 mei 2015
    01:34:09"
    "23 mai 2015
    01:34:09"
    "२३-०५-२०१५ १:३४:०९ पूवार्ह्न" "۲۳ ۱:۳۴:۰۹ ۲۰۱۵ یم"

    View Slide

  40. Date and time formatting: i18n
    40
    $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN', 'ps-AR',];


    $printDate = new \DateTime('23-05-2015 01:34:09', new \DateTimeZone('UTC'));


    foreach ($locales as $myLocale) {

    $shortDateObject = \intlDateFormatter::create($myLocale, \intlDateFormatter::SHORT, \intlDateFormatter::SHORT);

    $mediumDateObject = \intlDateFormatter::create($myLocale, \intlDateFormatter::MEDIUM, \intlDateFormatter::MEDIUM);

    printf(

    'Locale: %s, Short: "%s", Medium "%s"'.PHP_EOL, 

    $myLocale, 

    $shortDateObject->format($printDate), 

    $mediumDateObject->format($printDate)

    );

    }

    View Slide

  41. i18n/L10n and OS
    ‣ Variety in L10n is almost infinite
    ‣ Automatic in i18n and L10n is
    better
    ‣ Operating system plays an
    important role
    ‣ Why reinvent a very very very
    complicated wheel if it already
    exists?
    41

    View Slide

  42. View Slide

  43. Timezones - L10n
    ‣ PHP has full support for timezones
    ‣ 39 (40?) official timezones
    ‣ Multiple timezones in one locale
    ‣ nl-NL: Europe/Amsterdam
    ‣ es-CL: America/Santiago and Easter/
    Pacific
    ‣ en-US: Has 4 timezones
    ‣ ru-RU: Has 8 timezones
    43
    Interactive question!
    What time is it now in
    Seoul (ko-KR)?

    View Slide

  44. Timezones: i18n
    Caution! Calls to ICU library can get pretty expensive!
    ‣ With a known locale, get all timezones
    ‣ If there's only one, instantiate \DateTimeZone
    ‣ More than 1? Get precise timezoneId and DST settings (cache them!)
    ‣ Now calculate the offset of a timezone for the view
    44

    View Slide

  45. Timezones: i18n - Check validity
    45
    public function isValidTimeZone($timeZoneName='') {

    if (!is_string($timeZoneName)) {

    $timeZoneName = '';

    }


    try {

    new \DateTimeZone($timeZoneName);

    return true;

    } catch (\Exception $e) {

    return false;

    }

    }

    View Slide

  46. Timezones: i18n - Get timezone candidates
    46
    /**
    * $region is defined as \Locale::getRegion($currentLocale)
    */
    private function _setTimezoneCandidates($region='') {

    if (!empty($region)) {

    $this->_timezoneCandidates = \DateTimeZone::listIdentifiers(\DateTimeZone::PER_COUNTRY, $region);

    if (!empty($this->_timezoneCandidates) && count($this->_timezoneCandidates) == 1) {

    $this->setTimezone($this->_timezoneCandidates[0]);

    }

    }

    }

    View Slide

  47. Timezones: i18n - Set timezone
    47
    public function setTimezone($timeZoneName='UTC') {

    if (!$this->isValidTimeZone($timeZoneName)) {

    $timeZoneName = 'UTC';

    }


    $this->timezone = new \DateTimeZone($timeZoneName);

    $this->timezoneId = $this->timezone->getName();

    $transitions = $this->timezone->getTransitions();

    $this->timezoneInDST = $transitions[0]['isdst'];


    return $this->timezoneId;

    }

    View Slide

  48. Timezones: i18n - Display
    48
    $theDate = new \DateTime('23-05-2015 21:34:09', new \DateTimeZone('UTC'));

    $dateObject = \intlDateFormatter::create(

    'ko-KR', // $this->_currentLocale

    \IntlDateFormatter::MEDIUM,

    \IntlDateFormatter::MEDIUM,

    $this->timeZoneId // Asia/Seoul

    );

    echo $dateObject->format($theDate);

    View Slide

  49. What's the time in Seoul then?
    Result?
    UTC 23-05-2015 21:34:09 is
    2015. 5. 24. য়੹ 6:34:09
    in ko-KR (Offset: +9 hours)
    49

    View Slide

  50. Encoding and charsets - L10n
    ‣ Difficult, often misunderstood subject
    ‣ Difficult to debug
    ‣ First step of debugging is knowing what
    encoding you are working with
    ‣ Convert to an appropriate charset with
    iconv()
    50

    View Slide

  51. Encoding in PHP
    ‣ Internal work always in UTF-8, EVERYWHERE
    ‣ Include some basic stuff so that PHP also knows that it has to work in UTF-8
    ‣ Don't forget to send the browser information as well
    51
    mb_internal_encoding('UTF-8');
    header('Content-type: %s; charset=UTF-8');
    ‣ Lots of small things to consider, but can vary on each case

    View Slide

  52. Encoding in PHP: mails
    Caution with the imap extension! Has some problems with UTF-7
    Always encode "To" (BC, BCC) and "Subject" fields
    52
    Code Output
    "=?utf-8?B?5L2p5ae/?= " 佩姿
    "=?iso-8859-1?Q?B=F8lla?=, med =?iso-8859-1?Q?=F8l?=
    i baggen "
    Bølla , med øl i baggen
    "=?utf-7?Q?Petra_M+APw-ller?=" Petra Müller

    View Slide

  53. Encoding in PHP: mails
    ‣ Buggy functions
    ‣ imap_rfc822_parse_adrlist()
    ‣ imap_mime_header_decode()
    ‣ Others?
    ‣ Check out https://github.com/unreal4u/string-operations/
    for replacement functions
    53

    View Slide

  54. Databases and
    L10n / i18n

    View Slide

  55. But before we begin…

    View Slide

  56. View Slide

  57. Database and encodings/charsets
    CHARSET
    57
    COLLATION

    View Slide

  58. Practical use of charset
    Rule of thumb: adjust to the best possible
    way according to input
    md5/SHA1-like strings should be
    ASCII-encoded
    (Why? It helps the db engine to predict better its memory assignment)
    58
    Interactive question!
    What charset should be used to save
    the following string?
    f5d39e997c5d7e4e2a3ef49973f61fb2

    View Slide

  59. Practical use of charset
    CREATE TABLE `t1` (

    `md5HashCalculation` CHAR(32) CHARSET ASCII COLLATE ascii_bin DEFAULT NULL

    );
    59

    View Slide

  60. Differences between TEXT and [VAR]CHAR
    ‣ [VAR]CHAR(255) holds up to 255
    characters
    ‣ TINYTEXT can hold up to 255 bytes
    ‣ UTF-8 characters can take up to 5
    (or more) bytes
    60

    View Slide

  61. Indexes and charsets
    When working with Unicode characters, performance can be indirectly and negatively
    impacted
    ‣ Too big (and complex) of a topic for now
    ‣ Use EXPLAIN to understand underlying decisions of MySQL (in some cases)
    ‣ Don't bother in micro-optimization either
    61

    View Slide

  62. COLLATION
    ‣ Used to order data in a "natural" way
    ‣ Different languages have different rules
    62
    CREATE TABLE `spanishCollation` (

    `name01` VARCHAR(15) COLLATE utf8_spanish_ci,

    `name02` VARCHAR(15) COLLATE utf8_spanish2_ci

    ) DEFAULT CHARSET utf8;

    View Slide

  63. Some notes on Collation
    "*_ci" stands for case-insensitive
    Watch out with utf8_general_ci and
    utf8_unicode_ci!
    ➡ utf8_general_ci has some problems
    with Hebrew and some cyrillic characters
    ➡ It's generally faster (7~12%)
    ➡ But utf8_unicode_ci is more
    compatible
    63

    View Slide

  64. Collation and performance
    ‣ Performance penalty: order in another collation
    ‣ It will have to do a filesort
    ‣ Which is MySQL's way of saying "quicksort"
    ‣ [Partial] keys can help avoid this quicksort operation
    64

    View Slide

  65. General database localization
    ‣ Not recommended: translation on database level
    ‣ If absolutely needed, investigate EAV model
    ‣ PRO: Quick, simple and cheap
    ‣ CON: Queries may become complex
    65

    View Slide

  66. Your own L10n database
    ‣ Does the locale use the metric or imperial
    system (either British or American)?
    ‣ What type of rounding is used in that locale?
    ‣ Optional: custom number and currency pattern
    to overwrite any default rules
    ‣ The preferred timezone (user based, not L10n
    based)
    ‣ Direction of text
    66

    View Slide

  67. Fonts
    ‣ Easily overseen, yet very important
    ‣ Web-safe fonts are generally safe to use
    ‣ Don't forget to test multibyte characters
    ‣ 2 bytes: ñÖÑú - ӬģĽ
    ‣ 3 bytes: 漢字 - ♥၍₶
    ‣ 4+bytes: -
    ‣ Example: Mamá vive en Föllinge en el bosque
    del Ñañdú.¿Enredado? ¡Deberías! (SimSun-ExtB)
    67

    View Slide

  68. JavaScript considerations
    ‣ Always use native Date() object
    ‣ Has support for timezones
    ‣ No native support for i18n on Javascript
    ‣ http://i18next.com is able to save the
    day!
    68

    View Slide

  69. 69
    Finally: Who am I?
    Want to know more? My name is Camilo Sperberg
    http://twitter.com/unreal4u
    [email protected]

    View Slide

  70. Finally: Who am I?
    ‣ Blog: http://blog.unreal4u.com/ (Spanish)
    ‣ Rate and comment: https://joind.in/talk/view/15219
    ‣ Please, it's the only way this talk (and others) can be improved
    ‣ Slides are ready to be downloaded:
    ‣ http://unreal4u.com/talks/
    ‣ https://speakerdeck.com/unreal4u
    70

    View Slide

  71. Thanks!
    71

    View Slide

  72. Nice reads and more information
    ‣ https://github.com/triplepoint/php-units-of-measure
    ‣ https://github.com/unreal4u/localization
    ‣ http://www.w3.org/International/articles/language-tags/
    ‣ http://php.net/manual/en/book.intl.php
    ‣ http://www.sitepoint.com/localizing-php-applications-1/
    ‣ http://www.utf8-chartable.de/
    72

    View Slide