Upgrade to Pro — share decks privately, control downloads, hide ads and more …

I18n & L10n - 010PHP

I18n & L10n - 010PHP

These are the slides as presented at 010PHP on 14 Apr 2016

Camilo Sperberg

April 14, 2016
Tweet

More Decks by Camilo Sperberg

Other Decks in Programming

Transcript

  1. Internationalis(z)ation and
    Localis(z)ation:
    The Basics®
    010PHP
    2016 - 04 - 14
    Camilo Sperberg / @unreal4u

    View full-size slide

  2. Main index
    2
    ‣Definitions and requirements
    ‣The \Locale object
    ‣What constitutes a locale
    ‣Encodings
    ‣Database
    ‣Other considerations

    View full-size slide

  3. What is internationalisation?
    Internationalisation (i18n):
    Preparing your application to be localized
    (Pro-tip: How? Watch this talk!)
    Localisation (L10n):
    Translating, adding icons and other things of certain zone
    3

    View full-size slide

  4. ….
    ….
    Most discussed thing in RFC's
    4
    Bilbo Baggins
    1290 - ?
    RFC 3066
    Jan. 2001
    RFC 4646
    Sep. 2006
    RFC 4647
    Sep. 2006
    Bungo Baggins
    1246 - 1326
    RFC 5646
    Sep. 2009
    Balbo Baggins
    1167 - 1258
    Mungo Baggins
    1207 - 1300
    RFC 1766
    Mar. 1995
    Son of
    Son of
    Son of
    Replaces
    Replaces
    Replaces
    Belladonna Took
    Laura Grubb
    Berylla Boffin
    1172 - 1265
    LOTR RFC i18n

    View full-size slide

  5. i18n and PHP: recommended extensions
    5
    ‣ For gettext: php-gettext
    ‣ Needs gettext
    ‣ https://www.gnu.org/software/gettext/
    ‣ For intl: php-intl
    ‣ Needs ICU4C
    ‣ http://site.icu-project.org/
    ‣ mb_* functions: multibyte support

    View full-size slide

  6. L10n: Definition
    "A set of parameters that defines the user's language, country and any special variant
    preferences that the user wants to see in their user interface"
    Wikipedia
    AKA: rules for a specific region
    ‣ Current code standard: RFC 4646
    ‣ Language in lowercase (ISO 639-1), hyphen, region in uppercase (ISO 3166-1 alpha-2)
    6

    View full-size slide

  7. L10n: Code standards
    nl-NL
    Dutch as SUISHE*® in the Netherlands
    nl-BE
    Dutch as SUISHE*® in Belgium
    7
    * Spoken, Used, Interpreted, Seen, Heard, Etc

    View full-size slide

  8. L10n: Code standards
    8
    * Spoken, Used, Interpreted, Seen, Heard, Etc
    pt-BR
    Portuguese as SUISHE*® in Brazil
    pt-PT
    Portuguese as SUISHE*® in Portugal

    View full-size slide

  9. L10n: Code standards
    9
    * Spoken, Used, Interpreted, Seen, Heard, Etc
    es-ES, es-CL, es-AR, es-PE, es-*
    Spanish as SUISHE*® in Spain, Chile,
    Argentina, Peru, etc.

    View full-size slide

  10. L10n: Code standards
    de-CH-1901
    German as used in Switzerland using the 1901 variant
    10

    View full-size slide

  11. L10n: Code standards
    hy-Latn-IT-arevela
    Eastern Armenian written in Latin script, as used in Italy
    11

    View full-size slide

  12. L10n: Determining which to use
    Introducing the
    \Locale
    object!
    12
    Best practice tips:
    ‣ Get it from your $_GET or $_POST
    ‣ Get it from the headers
    ‣ Get it based on ip

    View full-size slide

  13. i18n: Detecting locale
    13
    public function getLocaleFromClient() {

    $this->locale = $this->getLocaleFromGetRequest();

    if (empty($this->locale)) {

    $this->locale = $this->getLocaleFromHeaders();

    if (empty($this->locale)) {

    $this->locale = $this->getLocaleFromIP();

    }

    }


    return $this->locale;

    }

    View full-size slide

  14. i18n: Detecting locale
    14
    public function getLocaleFromHeaders() {

    $this->locale = '';


    if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {

    $preferredLocale = \Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);

    $this->locale = $this->_checkLocale($preferredLocale);

    }


    return $this->locale;

    }

    View full-size slide

  15. \Locale: cool functions!
    15
    Use To…
    ‣ \Locale::getDisplayLanguage() Create a list of languages
    ‣ \Locale::getDisplayRegion() Display the country where the locale is used
    ‣ \Locale::acceptFromHttp() Parse browser Accept header
    ‣ \Locale::setDefault() Set the default localization to use

    View full-size slide

  16. L10n: what to look out for?
    ‣ Warning! Locale is not just a translation
    ‣ What's normal for you can be strange to
    others
    16

    View full-size slide

  17. Images and gestures
    ‣ Pointing fingers can be offensive in Arabic countries
    17

    ‣ The "peace" sign is offensive in Australia, Ireland, New Zealand, South Africa and the
    United Kingdom

    View full-size slide

  18. Translations - Semantics (L10n)
    18
    echo $numberResults.' results found within a '.$range.'km range';
    If translated directly into Spanish, it will sound like this:
    ➡ "Found 123 results within range 5km"

    View full-size slide

  19. Translations - Semantics (i18n)
    19
    ‣Solvable by implementing printf()
    printf(
    '%1$d results found within a %2$d km range',
    $numberResults,
    $range
    );
    ‣Translator decides where to print out variables

    View full-size slide

  20. Translations - Semantics (L10n)
    20
    es-ES:
    Carro
    Coche
    es-CL:
    Auto

    View full-size slide

  21. This is a “carro” or
    “coche” in es-CL:

    View full-size slide

  22. Translations - Semantics (L10n)
    22
    British American
    Holiday Vacation
    Football Soccer
    American Football Football
    Flat Apartment
    Garden Yard
    Rubbish Garbage / Trash

    View full-size slide

  23. Translations - Semantics
    23
    Smart Phones
    Mobile phones
    Barcode Scanners
    Control Remotes
    See
    Printers
    Interactive question!
    Suppose an electronics shop: what's
    wrong with the following list?

    View full-size slide

  24. Translations - Semantics
    24
    Captain here:
    "Watch" was translated as
    the verb instead of the
    noun

    View full-size slide

  25. Translations - Plural forms (L10n)
    26
    English Polish
    0 Apples Jabłek
    1 Apple Jabłko
    2 .. 4 Apples Jabłka
    5 .. 21 Apples Jabłek
    22 .. 24 Apples Jabłka
    25 .. 31 Apples Jabłek
    More complex cases do exist!
    ‣ Slovenian: 4 plural forms
    There are also cases with 1 plural form
    ‣ Japanese
    ‣ Vietnamese

    View full-size slide

  26. Translations - Plural forms (i18n)
    27
    Gettext!
    ‣ Supports plural forms
    ‣ Is cached in RAM (pros
    and cons)
    ‣ Very easy to edit (poEdit)
    ‣ Can be separated into
    modules
    ‣ Produces compiled
    language files

    View full-size slide

  27. Translations - Plural forms (i18n)
    ‣ \MessageFormatter can also help
    ‣ Can do pretty amazing stuff
    ‣ I personally don't have experience with it
    28
    $fmt = new MessageFormatter(

    'en_GB',

    'Peter has {0, plural, =0{no cat} =1{a cat} other{# cats}}'

    );

    echo $fmt->format(array(0));

    $fmt = new MessageFormatter(

    'nl_NL',

    'Peter heeft {0, plural, =0{geen kat} =1{een kat} other{# katten}}'

    );

    echo $fmt->format(array(0));


    // Outputs:

    // Peter has no cat

    // Peter heeft geen kat

    View full-size slide

  28. Number formatting - L10n
    ‣ Numbers have a lot of different types
    of annotations
    ‣ Corollary: nobody really knows well
    how a number should be formatted
    29
    Interactive question!
    How do you format the following
    negative number, in Euros, here in the
    Netherlands?
    1234,57

    View full-size slide

  29. Number formatting - L10n
    30
    nl-NL fr-FR pt-BR hi-IN ps-AR
    "-1.234,57" "-1 234,57" "-1.234,57" "-१,२३४.५७" "-۱٬۲۳۴٫۵۷"
    "€ 1.234,57-" "-1 234,57 €" "(€1.234,57)" "-€ १,२३४.५७" "-۱٬۲۳۴٫۵۷ €"
    "25%" "25 %" "25%" "२५%" "۲۵٪"

    View full-size slide

  30. Number formatting: i18n
    31
    $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN', 'ps-AR',];


    foreach ($locales as $myLocale) {

    $numberFormatter = new \NumberFormatter($myLocale, \NumberFormatter::DECIMAL);

    $percentFormatter = new \NumberFormatter($myLocale, \NumberFormatter::PERCENT);

    $currencyFormatter = new \NumberFormatter($myLocale, \NumberFormatter::CURRENCY);

    printf('Locale: %s'.PHP_EOL, $myLocale);

    printf('[DEC]-1.234,57: "%s" :: ', $numberFormatter->format(-1234.57));

    printf('[PER]25%%: "%s" :: ', $percentFormatter->format(0.25));

    printf('[CUR]1.234,57-: "%s"'.PHP_EOL, $currencyFormatter->formatCurrency(-1234.57, 'EUR'));
    // Last argument can also be \NumberFormatter::CURRENCY to print in CURRENCY of loaded locale

    }

    View full-size slide

  31. Number formatting: i18n
    32
    $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN', 'ps-AR',];


    foreach ($locales as $myLocale) {

    $numberFormatter = new \NumberFormatter($myLocale, \NumberFormatter::DECIMAL);

    $percentFormatter = new \NumberFormatter($myLocale, \NumberFormatter::PERCENT);

    $currencyFormatter = new \NumberFormatter($myLocale, \NumberFormatter::CURRENCY);

    printf('Locale: %s'.PHP_EOL, $myLocale);

    printf('[DEC]-1.234,57: "%s" :: ', $numberFormatter->format(-1234.57));

    printf('[PER]25%%: "%s" :: ', $percentFormatter->format(0.25));

    printf('[CUR]1.234,57-: "%s"'.PHP_EOL, $currencyFormatter->formatCurrency(-1234.57, 'EUR'));
    // Last argument can also be \NumberFormatter::CURRENCY to print in CURRENCY of loaded locale

    }

    View full-size slide

  32. 33
    unreal4u-MBP:localization unreal4u$ php numbers.php
    Locale: nl-NL
    [DEC]-1.234,57: "-1.234,57" :: [PER]25%: "25%" :: [CUR]1.234,57-: "€ 1.234,57-"
    All this in Dutch AKA Nederlands
    --------------------------------------------------------------------------------
    Locale: fr-FR
    [DEC]-1.234,57: "-1 234,57" :: [PER]25%: "25 %" :: [CUR]1.234,57-: "-1 234,57 €"
    All this in French AKA français
    --------------------------------------------------------------------------------
    Locale: pt-BR
    [DEC]-1.234,57: "-1.234,57" :: [PER]25%: "25%" :: [CUR]1.234,57-: "(€1.234,57)"
    All this in Portuguese AKA português
    --------------------------------------------------------------------------------
    Locale: hi-IN
    [DEC]-1.234,57: "-१,२३४.५७" :: [PER]25%: "२५%" :: [CUR]1.234,57-: "-€ १,२३४.५७"
    All this in Hindi AKA िहन्दी
    --------------------------------------------------------------------------------
    Locale: ps-AR
    [DEC]-1.234,57: "-١،٢٣۴٫۵٧" :: [PER]25%: "٢۵٪" :: [CUR]1.234,57-: "-١،٢٣۴٫۵٧ €"
    All this in Pashto AKA ﻮﺘ,ﭘ

    View full-size slide

  33. Some \NumberFormatter problems
    ‣ \NumberFormatter::DURATION isn't implemented in much locales
    ‣ echo $fmt->format(12345) -> 3 hours, 25 minutes, 45 seconds
    ‣ "Easy" to implement using getPattern() and setPattern()
    ‣ Documentation exists, but is not optimal
    34

    View full-size slide

  34. Date and time formatting - L10n
    ‣ Dates have 3 different annotations
    ‣ YYYY-MM-DD (1.660M)
    ‣ DD-MM-YYYY (4.810M)
    ‣ MM-DD-YYYY (320M)
    ‣ "It's complicated" (457M)
    ‣ Contrary to numbers, everybody
    knows how a date is formatted
    35
    Interactive question!
    What is the value of the following date
    in The Netherlands?
    05-03-13

    View full-size slide

  35. Date and time formatting - L10n
    ➡ YMD (1660)
    ➡ YMD and DMY (287)
    ➡ DMY (3295)
    ➡ DMY and MDY (130)
    ➡ MDY (320)
    ➡ YMD and DMY and
    MDY (40)
    36
    https://en.wikipedia.org/wiki/Date_format_by_country

    View full-size slide

  36. Date and time formatting: i18n
    ‣ Use PHP's \*Date* related classes, like ALWAYS!
    ‣ Incredibly versatile yet powerful functions
    ‣ Specially in combination with locales
    ‣ Always work in UTC, let the \*Date* classes do the rest
    37

    View full-size slide

  37. Date and time formatting: i18n
    38
    nl-NL fr-FR hi-IN ps-AR
    Short "23-05-15" "23/05/15" "२३-५-१५" "۲۰۱۵/۵/۲۳"
    Medium "23 mei 2015" "23 mai 2015" "२३-०५-२०१५" "۲۳ ۲۰۱۵ یم"
    With time
    "23 mei 2015
    01:34:09"
    "23 mai 2015
    01:34:09"
    "२३-०५-२०१५ १:३४:०९ पूवार्ह्न" "۲۳ ۱:۳۴:۰۹ ۲۰۱۵ یم"

    View full-size slide

  38. Date and time formatting: i18n
    39
    $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN', 'ps-AR',];


    $printDate = new \DateTime('23-05-2015 01:34:09', new \DateTimeZone('UTC'));


    foreach ($locales as $myLocale) {

    $dateObject = \intlDateFormatter::create(
    $myLocale,
    \intlDateFormatter::MEDIUM,
    \intlDateFormatter::SHORT
    );

    printf(

    'Locale: %s, Short: "%s", Medium "%s"'.PHP_EOL, 

    $myLocale, 

    $dateObject->format($printDate), 

    );

    }

    View full-size slide

  39. i18n/L10n and OS
    ‣ Variety in L10n is almost infinite
    ‣ Automatic in i18n and L10n is
    better
    ‣ Operating system plays an
    important role
    ‣ Why reinvent a very very very
    complicated wheel if it already
    exists?
    40

    View full-size slide

  40. Timezones - L10n
    42
    Interactive question!
    What time is it now in
    Seoul (ko-KR)?

    View full-size slide

  41. Ask my timebot! https://telegram.me/TheTimeBot
    43
    Disclaimer: Feel free to use it, but please do only provide perfect input
    https://github.com/unreal4u/tg-timebot

    View full-size slide

  42. Timezones - Some data
    ‣ PHP has full support for timezones
    ‣ 39 (40?) official timezones
    ‣ Multiple timezones in one locale
    ‣ nl-NL: Europe/Amsterdam
    ‣ es-CL: America/Santiago and Easter/Pacific
    ‣ en-US: Has 4 timezones (Plus Alaska, Samoa, Hawaii and Chamorro)
    ‣ ru-RU: Has 8 timezones
    44

    View full-size slide

  43. Timezones: i18n
    Caution! Calls to ICU library can get pretty expensive!
    ‣ With a known locale, get all timezones
    ‣ If there's only one, instantiate \DateTimeZone
    ‣ More than 1? Get precise timezoneId and DST settings (cache them!)
    ‣ Now calculate the offset of a timezone for the view
    45

    View full-size slide

  44. Timezones: i18n - Check validity
    46
    public function isValidTimeZone($timeZoneName='') {

    if (!is_string($timeZoneName)) {

    $timeZoneName = '';

    }


    try {

    new \DateTimeZone($timeZoneName);

    return true;

    } catch (\Exception $e) {

    return false;

    }

    }

    View full-size slide

  45. Timezones: i18n - Get timezone candidates
    47
    /**
    * $region is defined as \Locale::getRegion($currentLocale)
    */
    private function _setTimezoneCandidates($region='') {

    if (!empty($region)) {

    $this->_timezoneCandidates = \DateTimeZone::listIdentifiers(
    \DateTimeZone::PER_COUNTRY,
    $region
    );
    if (!empty($this->_timezoneCandidates) && count($this->_timezoneCandidates) == 1) {

    $this->setTimezone($this->_timezoneCandidates[0]);

    }

    }

    }

    View full-size slide

  46. Timezones: i18n - Set timezone
    48
    public function setTimezone($timeZoneName='UTC') {

    if (!$this->isValidTimeZone($timeZoneName)) {

    $timeZoneName = 'UTC';

    }


    $this->timezone = new \DateTimeZone($timeZoneName);

    $this->timezoneId = $this->timezone->getName();

    $transitions = $this->timezone->getTransitions();

    $this->timezoneInDST = $transitions[0]['isdst'];


    return $this->timezoneId;

    }

    View full-size slide

  47. Timezones: i18n - Display
    49
    $idf = \intlDateFormatter::create(

    'ko-KR', // $this->_currentLocale

    \IntlDateFormatter::MEDIUM,

    \IntlDateFormatter::MEDIUM,

    $this->timeZoneId // Asia/Seoul

    );

    $theDate = new \DateTime(
    '23-05-2015 21:34:09',
    new \DateTimeZone(‘UTC')
    );
    echo $idf->format($theDate);

    View full-size slide

  48. What's the time in Seoul then?
    Result?
    UTC 23-05-2015 21:34:09 is
    2015. 5. 24. য়੹ 6:34:09
    in ko-KR (Offset: +9 hours)
    50

    View full-size slide

  49. Encoding and charsets - L10n
    ‣ Difficult, often misunderstood subject
    ‣ Difficult to debug
    ‣ First step of debugging is knowing what
    encoding you are working with
    ‣ Convert to an appropriate charset with
    iconv()
    51

    View full-size slide

  50. Encoding in PHP
    ‣ Internal work always in UTF-8, EVERYWHERE
    ‣ Include some basic stuff so that PHP also knows that it has to work in UTF-8
    ‣ Don't forget to send the browser information as well
    52
    mb_internal_encoding('UTF-8');
    header('Content-type: %s; charset=UTF-8');
    ‣ Lots of small things to consider, but can vary on each case

    View full-size slide

  51. Encoding in PHP: mails
    Caution with the imap extension! Has some problems with UTF-7
    Always encode "To" (BC, BCC) and "Subject" fields
    53
    Code Output
    "=?utf-8?B?5L2p5ae/?= " ֫঵
    "=?iso-8859-1?Q?B=F8lla?=, med =?iso-8859-1?Q?=F8l?=
    i baggen "
    Bølla , med øl i baggen
    "=?utf-7?Q?Petra_M+APw-ller?=" Petra Müller

    View full-size slide

  52. Encoding in PHP: mails
    ‣ Buggy functions
    ‣ imap_rfc822_parse_adrlist()
    ‣ imap_mime_header_decode()
    ‣ Others?
    ‣ Check out https://github.com/unreal4u/string-operations/
    for replacement functions
    54

    View full-size slide

  53. Databases and
    L10n / i18n

    View full-size slide

  54. But before we begin…

    View full-size slide

  55. Database and encodings/charsets
    CHARSET
    58
    COLLATION

    View full-size slide

  56. Practical use of charset
    md5/SHA1-like strings should be
    ASCII-encoded
    (Why? It helps the db engine to predict better its memory assignment)
    59
    Interactive question!
    What charset should be used to save
    the following string?
    f5d39e997c5d7e4e2a3ef49973f61fb2

    View full-size slide

  57. Practical use of charset
    CREATE TABLE `t1` (

    `md5HashCalculation` CHAR(32) CHARSET ASCII COLLATE ascii_bin

    );
    60

    View full-size slide

  58. Differences between TEXT and [VAR]CHAR
    ‣ [VAR]CHAR(255) holds up to 255
    characters
    ‣ TINYTEXT can hold up to 255 bytes
    ‣ UTF-8 characters can take up to 5
    (or more) bytes
    61

    View full-size slide

  59. Indexes and charsets
    When working with Unicode characters, performance can be indirectly and negatively
    impacted
    ‣ Too big (and complex) of a topic for now
    ‣ Use EXPLAIN to understand underlying decisions of MySQL (in some cases)
    ‣ Don't bother in micro-optimization either
    62

    View full-size slide

  60. COLLATION
    ‣ Used to order data in a "natural" way
    ‣ Different languages have different rules
    63
    CREATE TABLE `spanishCollation` (

    `name01` VARCHAR(15) COLLATE utf8_spanish_ci,

    `name02` VARCHAR(15) COLLATE utf8_spanish2_ci

    ) DEFAULT CHARSET utf8;

    View full-size slide

  61. Some notes on Collation
    "*_ci" stands for case-insensitive
    Watch out with utf8_general_ci and
    utf8_unicode_ci!
    ➡ utf8_general_ci has some problems
    with Hebrew and some cyrillic characters
    ➡ It's generally faster (7~12%)
    ➡ But utf8_unicode_ci is more
    compatible
    64

    View full-size slide

  62. Collation and performance
    ‣ Performance penalty: order in another collation
    ‣ It will have to do a filesort
    ‣ Which is MySQL's way of saying "quicksort"
    ‣ [Partial] keys can help avoid this quicksort operation
    65

    View full-size slide

  63. General database localization
    ‣ Not recommended: translation on database level
    ‣ If absolutely needed, investigate EAV model
    ‣ PRO: Quick, simple and cheap
    ‣ CON: Queries may become complex
    66

    View full-size slide

  64. Names and addresses
    ‣ UTF-8 does NOT cover all cases!
    ‣ Best way to save information is to save it in binary format:
    67
    CREATE TABLE `thaPeople` (

    `name` MEDIUMBLOB NULL DEFAULT NULL,

    `address` MEDIUMBLOB NULL DEFAULT NULL

    );
    ‣ However this is a very extreme case

    View full-size slide

  65. Your own L10n database
    ‣ Does the locale use the metric or imperial
    system (either British or American)?
    ‣ What type of rounding is used in that locale?
    ‣ Optional: custom number and currency pattern
    to overwrite any default rules
    ‣ The preferred timezone (user based, not L10n
    based)
    ‣ Direction of text
    68

    View full-size slide

  66. Fonts
    ‣ Easily overseen, yet very important
    ‣ Web-safe fonts are generally safe to use
    ‣ Don't forget to test multibyte characters
    ‣ 2 bytes: ñÖÑú - ӬģĽ
    ‣ 3 bytes: 佸ਁ - —၍₶
    ‣ 4+bytes: 韴韵韶 -
    ‣ Example: Mamá vive en Föllinge en el bosque
    del Ñañdú.¿Enredado? ¡Deberías! (SimSun-ExtB)
    69

    View full-size slide

  67. JavaScript considerations
    ‣ Always use native Date() object
    ‣ Has support for timezones
    ‣ No native support for i18n on Javascript
    ‣ http://i18next.com is able to save the
    day!
    70

    View full-size slide

  68. 71
    Finally: Who am I?
    Want to know more? My name is Camilo Sperberg
    Tweet me @unreal4u
    Email [email protected]

    View full-size slide

  69. Finally: Who am I?
    ‣ Blog: http://blog.unreal4u.com/ (Spanish)
    ‣ Leave comments on https://legacy.joind.in/17727
    ‣ Slides will be ready to be downloaded on:
    ‣ http://unreal4u.com/talks/
    ‣ https://speakerdeck.com/unreal4u
    72

    View full-size slide

  70. Nice reads and more information
    ‣ https://github.com/triplepoint/php-units-of-measure
    ‣ https://github.com/unreal4u/localization
    ‣ http://www.w3.org/International/articles/language-tags/
    ‣ http://php.net/manual/en/book.intl.php
    ‣ http://www.sitepoint.com/localizing-php-applications-1/
    ‣ http://www.utf8-chartable.de/
    ‣ http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
    74

    View full-size slide