Upgrade to Pro — share decks privately, control downloads, hide ads and more …

i18n & L10n: The basics (Groningen PHP)

i18n & L10n: The basics (Groningen PHP)

The Internalization and Localization talk as presented at Groningen PHP on 2016-07-07

Aca9a731cf8422c5ea7f6cb833d3976f?s=128

Camilo Sperberg

July 07, 2016
Tweet

Transcript

  1. Internationalis(z)ation and Localis(z)ation: The Basics® Groningen PHP 2016 - 07

    - 07 Camilo Sperberg / @unreal4u
  2. A little favor… 2 Let me know if my "ehmm"

    tag becomes too annoying
  3. Main index 3 ‣Definitions and requirements ‣The \Locale object ‣What

    constitutes a locale ‣Encodings ‣Database ‣Other considerations
  4. What is internationalisation? Internationalisation (i18n): Preparing your application to be

    localized (Pro-tip: How? Watch this talk!) Localisation (L10n): Translating, adding icons and other things of certain zone 4
  5. …. …. Most discussed thing in RFC's 5 Bilbo Baggins

    1290 - ? RFC 3066 Jan. 2001 RFC 4646 Sep. 2006 RFC 4647 Sep. 2006 Bungo Baggins 1246 - 1326 RFC 5646 Sep. 2009 Balbo Baggins 1167 - 1258 Mungo Baggins 1207 - 1300 RFC 1766 Mar. 1995 Son of Son of Son of Replaces Replaces Replaces Belladonna Took Laura Grubb Berylla Boffin 1172 - 1265 LOTR RFC i18n
  6. i18n and PHP: recommended extensions 6 ‣ For gettext: php-gettext

    ‣ Needs gettext ‣ https://www.gnu.org/software/gettext/ ‣ For intl: php-intl ‣ Needs ICU4C ‣ http://site.icu-project.org/ ‣ mb_* functions: multibyte support
  7. L10n: Definition "A set of parameters that defines the user's

    language, country and any special variant preferences that the user wants to see in their user interface" Wikipedia AKA: rules for a specific region ‣ Current code standard: RFC 4646 ‣ Language in lowercase (ISO 639-1), hyphen, region in uppercase (ISO 3166-1 alpha-2) 7
  8. L10n: Code standards nl-NL Dutch as SUISHE*® in the Netherlands

    nl-BE Dutch as SUISHE*® in Belgium 8 * Spoken, Used, Interpreted, Seen, Heard, Etc
  9. L10n: Code standards 9 * Spoken, Used, Interpreted, Seen, Heard,

    Etc pt-BR Portuguese as SUISHE*® in Brazil pt-PT Portuguese as SUISHE*® in Portugal
  10. L10n: Code standards 10 * Spoken, Used, Interpreted, Seen, Heard,

    Etc es-ES, es-CL, es-AR, es-PE, es-* Spanish as SUISHE*® in Spain, Chile, Argentina, Peru, etc.
  11. L10n: Code standards de-CH-1901 German as used in Switzerland using

    the 1901 variant 11
  12. L10n: Code standards hy-Latn-IT-arevela Eastern Armenian written in Latin script,

    as used in Italy 12
  13. L10n: Determining which to use Introducing the \Locale object! 13

    Best practice tips: ‣ Get it from your $_GET or $_POST ‣ Get it from the headers ‣ Get it based on ip
  14. i18n: Detecting locale 14 public function getLocaleFromClient() {
 $this->locale =

    $this->getLocaleFromGetRequest();
 if (empty($this->locale)) {
 $this->locale = $this->getLocaleFromHeaders();
 if (empty($this->locale)) {
 $this->locale = $this->getLocaleFromIP();
 }
 }
 
 return $this->locale;
 }
  15. i18n: Detecting locale 15 public function getLocaleFromHeaders() {
 $this->locale =

    '';
 
 if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
 $preferredLocale = \Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
 $this->locale = $this->_checkLocale($preferredLocale);
 }
 
 return $this->locale;
 }
  16. \Locale: cool functions! 16 Use To… ‣ \Locale::getDisplayLanguage() Create a

    list of languages ‣ \Locale::getDisplayRegion() Display the country where the locale is used ‣ \Locale::acceptFromHttp() Parse browser Accept header ‣ \Locale::setDefault() Set the default localization to use
  17. L10n: what to look out for? ‣ Warning! Locale is

    not just a translation ‣ What's normal for you can be strange to others 17
  18. Images and gestures ‣ Pointing fingers can be offensive in

    Arabic countries 18 ✌ ‣ The "peace" sign is offensive in Australia, Ireland, New Zealand, South Africa and the United Kingdom
  19. Translations - Semantics (L10n) 19 echo $numberResults.' results found within

    a '.$range.'km range'; If translated directly into Spanish, it will sound like this: ➡ "Found 123 results within range 5km"
  20. Translations - Semantics (i18n) 20 ‣Solvable by implementing printf() printf(

    '%1$d results found within a %2$d km range', $numberResults, $range ); ‣Translator decides where to print out variables
  21. Translations - Semantics (L10n) 21 es-ES: Carro Coche es-CL: Auto

  22. This is a “carro” or “coche” in es-CL:

  23. Translations - Semantics (L10n) 23 British American Holiday Vacation Football

    Soccer American Football Football Flat Apartment Garden Yard Rubbish Garbage / Trash
  24. Translations - Semantics 24 Smart Phone Barcode Scanner Control Remote

    See GPS Digital camera Interactive question! Suppose an electronics shop that sells batteries: what's wrong with the following list?
  25. Translations - Semantics 25 Captain here: "Watch" was translated as

    the verb instead of the noun
  26. Translations - Semantics 26 On a B&B page some time

    ago: ‣ nl_NL: (Original text) "Kom genieten van het uitzicht bij de Reeuwijkse plassen!" ‣ en_US: (Google Translate) "Come and enjoy the view at the Reeuwijkse pee!"
  27. 27

  28. Translations - Plural forms (L10n) 28 English Polish 0 Apples

    Jabłek 1 Apple Jabłko 2 .. 4 Apples Jabłka 5 .. 21 Apples Jabłek 22 .. 24 Apples Jabłka 25 .. 31 Apples Jabłek More complex cases do exist! ‣ Slovenian: 4 plural forms There are also cases with 1 plural form ‣ Japanese ‣ Vietnamese
  29. Translations - Plural forms (i18n) 29 Gettext! ‣ Supports plural

    forms ‣ Is cached in RAM (pros and cons) ‣ Very easy to edit (poEdit) ‣ Can be separated into modules ‣ Produces compiled language files
  30. Translations - Plural forms (i18n) ‣ \MessageFormatter can also help

    ‣ Can do pretty amazing stuff ‣ I personally don't have experience with it 30 $fmt = new MessageFormatter(
 'en_GB',
 'Peter has {0, plural, =0{no cat} =1{a cat} other{# cats}}'
 );
 echo $fmt->format(array(0));
 $fmt = new MessageFormatter(
 'nl_NL',
 'Peter heeft {0, plural, =0{geen kat} =1{een kat} other{# katten}}'
 );
 echo $fmt->format(array(0));
 
 // Outputs:
 // Peter has no cat
 // Peter heeft geen kat
  31. Number formatting - L10n ‣ Numbers have a lot of

    different types of annotations ‣ Corollary: nobody really knows well how a number should be formatted 31 Interactive question! How do you format the following negative number, in Euros, here in the Netherlands? 1234,57
  32. Number formatting - L10n 32 nl-NL fr-FR pt-BR hi-IN ps-AR

    "-1.234,57" "-1 234,57" "-1.234,57" "-१,२३४.५७" "-۱٬۲۳۴٫۵۷" "€ 1.234,57-" "-1 234,57 €" "(€1.234,57)" "-€ १,२३४.५७" "-۱٬۲۳۴٫۵۷ €" "25%" "25 %" "25%" "२५%" "۲۵٪"
  33. Number formatting: i18n 33 $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN',

    'ps-AR',];
 
 foreach ($locales as $myLocale) {
 $numberFormatter = new \NumberFormatter($myLocale, \NumberFormatter::DECIMAL);
 $percentFormatter = new \NumberFormatter($myLocale, \NumberFormatter::PERCENT);
 $currencyFormatter = new \NumberFormatter($myLocale, \NumberFormatter::CURRENCY);
 printf('Locale: %s'.PHP_EOL, $myLocale);
 printf('[DEC]-1.234,57: "%s" :: ', $numberFormatter->format(-1234.57));
 printf('[PER]25%%: "%s" :: ', $percentFormatter->format(0.25));
 printf('[CUR]1.234,57-: "%s"'.PHP_EOL, $currencyFormatter->formatCurrency(-1234.57, 'EUR')); // Last argument can also be \NumberFormatter::CURRENCY to print in CURRENCY of loaded locale
 }
  34. Number formatting: i18n 34 $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN',

    'ps-AR',];
 
 foreach ($locales as $myLocale) {
 $numberFormatter = new \NumberFormatter($myLocale, \NumberFormatter::DECIMAL);
 $percentFormatter = new \NumberFormatter($myLocale, \NumberFormatter::PERCENT);
 $currencyFormatter = new \NumberFormatter($myLocale, \NumberFormatter::CURRENCY);
 printf('Locale: %s'.PHP_EOL, $myLocale);
 printf('[DEC]-1.234,57: "%s" :: ', $numberFormatter->format(-1234.57));
 printf('[PER]25%%: "%s" :: ', $percentFormatter->format(0.25));
 printf('[CUR]1.234,57-: "%s"'.PHP_EOL, $currencyFormatter->formatCurrency(-1234.57, 'EUR')); // Last argument can also be \NumberFormatter::CURRENCY to print in CURRENCY of loaded locale
 }
  35. 35 unreal4u-MBP:localization unreal4u$ php numbers.php Locale: nl-NL [DEC]-1.234,57: "-1.234,57" ::

    [PER]25%: "25%" :: [CUR]1.234,57-: "€ 1.234,57-" All this in Dutch AKA Nederlands -------------------------------------------------------------------------------- Locale: fr-FR [DEC]-1.234,57: "-1 234,57" :: [PER]25%: "25 %" :: [CUR]1.234,57-: "-1 234,57 €" All this in French AKA français -------------------------------------------------------------------------------- Locale: pt-BR [DEC]-1.234,57: "-1.234,57" :: [PER]25%: "25%" :: [CUR]1.234,57-: "(€1.234,57)" All this in Portuguese AKA português -------------------------------------------------------------------------------- Locale: hi-IN [DEC]-1.234,57: "-१,२३४.५७" :: [PER]25%: "२५%" :: [CUR]1.234,57-: "-€ १,२३४.५७" All this in Hindi AKA िहन्दी -------------------------------------------------------------------------------- Locale: ps-AR [DEC]-1.234,57: "-١،٢٣۴٫۵٧" :: [PER]25%: "٢۵٪" :: [CUR]1.234,57-: "-١،٢٣۴٫۵٧ €" All this in Pashto AKA ﻮﺘ,ﭘ
  36. Some \NumberFormatter problems ‣ \NumberFormatter::DURATION isn't implemented in much locales

    ‣ echo $fmt->format(12345) -> 3 hours, 25 minutes, 45 seconds ‣ "Easy" to implement using getPattern() and setPattern() ‣ Documentation exists, but is not optimal 36
  37. Date and time formatting - L10n ‣ Dates have 3

    different annotations ‣ YYYY-MM-DD (1.660M) ‣ DD-MM-YYYY (4.810M) ‣ MM-DD-YYYY (320M) ‣ "It's complicated" (457M) ‣ Contrary to numbers, everybody knows how a date is formatted 37 Interactive question! What is the value of the following date in The Netherlands? 05-03-13
  38. Date and time formatting - L10n ➡ YMD (1660) ➡

    YMD and DMY (287) ➡ DMY (3295) ➡ DMY and MDY (130) ➡ MDY (320) ➡ YMD and DMY and MDY (40) 38 https://en.wikipedia.org/wiki/Date_format_by_country
  39. Date and time formatting: i18n ‣ Use PHP's \*Date* related

    classes, like ALWAYS! ‣ Incredibly versatile yet powerful functions ‣ Specially in combination with locales ‣ Always work in UTC, let the \*Date* classes do the rest 39
  40. Date and time formatting: i18n 40 nl-NL fr-FR hi-IN ps-AR

    Short "23-05-15" "23/05/15" "२३-५-१५" "۲۰۱۵/۵/۲۳" Medium "23 mei 2015" "23 mai 2015" "२३-०५-२०१५" "۲۳ ۲۰۱۵ یم" With time "23 mei 2015 01:34:09" "23 mai 2015 01:34:09" "२३-०५-२०१५ १:३४:०९ पूवार्ह्न" "۲۳ ۱:۳۴:۰۹ ۲۰۱۵ یم"
  41. Date and time formatting: i18n 41 $locales = ['nl-NL', 'fr-FR',

    'pt-BR', 'hi-IN', 'ps-AR',];
 
 $printDate = new \DateTime('23-05-2015 01:34:09', new \DateTimeZone('UTC'));
 
 foreach ($locales as $myLocale) {
 $dateObject = \intlDateFormatter::create( $myLocale, \intlDateFormatter::MEDIUM, \intlDateFormatter::SHORT );
 printf(
 'Locale: %s, Short: "%s", Medium "%s"'.PHP_EOL, 
 $myLocale, 
 $dateObject->format($printDate), 
 );
 }
  42. i18n/L10n and OS ‣ Variety in L10n is almost infinite

    ‣ Automatic in i18n and L10n is better ‣ Operating system plays an important role ‣ Why reinvent a very very very complicated wheel if it already exists? 42
  43. None
  44. Timezones - L10n 44 Interactive question! What time is it

    now in Seoul (ko-KR)?
  45. Ask my timebot! https://telegram.me/TheTimeBot 45 Disclaimer: Feel free to use

    it, but please do only provide perfect input https://github.com/unreal4u/tg-timebot
  46. Timezones - Some data ‣ PHP has full support for

    timezones ‣ 39 (40?) official timezones ‣ Multiple timezones in one locale ‣ nl-NL: Europe/Amsterdam ‣ es-CL: America/Santiago and Easter/Pacific ‣ en-US: Has 4 timezones (Plus Alaska, Samoa, Hawaii and Chamorro) ‣ ru-RU: Has 8 timezones 46
  47. Timezones: i18n Caution! Calls to ICU library can get pretty

    expensive! ‣ With a known locale, get all timezones ‣ If there's only one, instantiate \DateTimeZone ‣ More than 1? Get precise timezoneId and DST settings (cache them!) ‣ Now calculate the offset of a timezone for the view 47
  48. Timezones: i18n - Check validity 48 public function isValidTimeZone($timeZoneName='') {


    if (!is_string($timeZoneName)) {
 $timeZoneName = '';
 }
 
 try {
 new \DateTimeZone($timeZoneName);
 return true;
 } catch (\Exception $e) {
 return false;
 }
 }
  49. Timezones: i18n - Get timezone candidates 49 /** * $region

    is defined as \Locale::getRegion($currentLocale) */ private function _setTimezoneCandidates($region='') {
 if (!empty($region)) {
 $this->_timezoneCandidates = \DateTimeZone::listIdentifiers( \DateTimeZone::PER_COUNTRY, $region ); if (!empty($this->_timezoneCandidates) && count($this->_timezoneCandidates) == 1) {
 $this->setTimezone($this->_timezoneCandidates[0]);
 }
 }
 }
  50. Timezones: i18n - Set timezone 50 public function setTimezone($timeZoneName='UTC') {


    if (!$this->isValidTimeZone($timeZoneName)) {
 $timeZoneName = 'UTC';
 }
 
 $this->timezone = new \DateTimeZone($timeZoneName);
 $this->timezoneId = $this->timezone->getName();
 $transitions = $this->timezone->getTransitions();
 $this->timezoneInDST = $transitions[0]['isdst'];
 
 return $this->timezoneId;
 }
  51. Timezones: i18n - Display 51 $idf = \intlDateFormatter::create(
 'ko-KR', //

    $this->_currentLocale
 \IntlDateFormatter::MEDIUM,
 \IntlDateFormatter::MEDIUM,
 $this->timeZoneId // Asia/Seoul
 );
 $theDate = new \DateTime( '23-05-2015 21:34:09', new \DateTimeZone(‘UTC') ); echo $idf->format($theDate);
  52. What's the time in Seoul then? Result? UTC 23-05-2015 21:34:09

    is 2015. 5. 24. য়੹ 6:34:09 in ko-KR (Offset: +9 hours) 52
  53. Encoding and charsets - L10n ‣ Difficult, often misunderstood subject

    ‣ Difficult to debug ‣ First step of debugging is knowing what encoding you are working with ‣ Convert to an appropriate charset with iconv() 53
  54. Encoding in PHP ‣ Internal work always in UTF-8, EVERYWHERE

    ‣ Include some basic stuff so that PHP also knows that it has to work in UTF-8 ‣ Don't forget to send the browser information as well 54 mb_internal_encoding('UTF-8'); header('Content-type: %s; charset=UTF-8'); ‣ Lots of small things to consider, but can vary on each case
  55. Encoding in PHP: mails Caution with the imap extension! Has

    some problems with UTF-7 Always encode "To" (BC, BCC) and "Subject" fields 55 Code Output "=?utf-8?B?5L2p5ae/?= <my@name.com.tw>" ֫঵ "=?iso-8859-1?Q?B=F8lla?=, med =?iso-8859-1?Q?=F8l?= i baggen <my@name.com>" Bølla , med øl i baggen "=?utf-7?Q?Petra_M+APw-ller?=" Petra Müller
  56. Encoding in PHP: mails ‣ Buggy functions ‣ imap_rfc822_parse_adrlist() ‣

    imap_mime_header_decode() ‣ Others? ‣ Check out https://github.com/unreal4u/string-operations/ for replacement functions 56
  57. Databases and L10n / i18n

  58. But before we begin…

  59. None
  60. Database and encodings/charsets CHARSET 60 COLLATION

  61. Practical use of charset md5/SHA1-like strings should be ASCII-encoded (Why?

    It helps the db engine to predict better its memory assignment) 61 Interactive question! What charset should be used to save the following string? f5d39e997c5d7e4e2a3ef49973f61fb2
  62. Practical use of charset CREATE TABLE `t1` (
 `md5HashCalculation` CHAR(32)

    CHARSET ASCII COLLATE ascii_bin
 ); 62
  63. Differences between TEXT and [VAR]CHAR ‣ [VAR]CHAR(255) holds up to

    255 characters ‣ TINYTEXT can hold up to 255 bytes ‣ UTF-8 characters can take up to 5 (or more) bytes 63
  64. Indexes and charsets When working with Unicode characters, performance can

    be indirectly and negatively impacted ‣ Too big (and complex) of a topic for now ‣ Use EXPLAIN to understand underlying decisions of MySQL (in some cases) ‣ Don't bother in micro-optimization either 64
  65. COLLATION ‣ Used to order data in a "natural" way

    ‣ Different languages have different rules 65 CREATE TABLE `spanishCollation` (
 `name01` VARCHAR(15) COLLATE utf8_spanish_ci,
 `name02` VARCHAR(15) COLLATE utf8_spanish2_ci
 ) DEFAULT CHARSET utf8;
  66. Some notes on Collation "*_ci" stands for case-insensitive Watch out

    with utf8_general_ci and utf8_unicode_ci! ➡ utf8_general_ci has some problems with Hebrew and some cyrillic characters ➡ It's generally faster (7~12%) ➡ But utf8_unicode_ci is more compatible 66
  67. Collation and performance ‣ Performance penalty: order in another collation

    ‣ It will have to do a filesort ‣ Which is MySQL's way of saying "quicksort" ‣ [Partial] keys can help avoid this quicksort operation 67
  68. General database localization ‣ Not recommended: translation on database level

    ‣ If absolutely needed, investigate EAV model ‣ PRO: Quick, simple and cheap ‣ CON: Queries may become complex 68
  69. Names and addresses ‣ UTF-8 does NOT cover all cases!

    ‣ Best way to save information is to save it in binary format: 69 CREATE TABLE `thaPeople` (
 `name` MEDIUMBLOB NULL DEFAULT NULL,
 `address` MEDIUMBLOB NULL DEFAULT NULL
 ); ‣ However this is a very extreme case ‣ More info? Check www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
  70. Your own L10n database ‣ Does the locale use the

    metric or imperial system (either British or American)? ‣ What type of rounding is used in that locale? ‣ Optional: custom number and currency pattern to overwrite any default rules ‣ The preferred timezone (user based, not L10n based) ‣ Direction of text 70
  71. Fonts ‣ Easily overseen, yet very important ‣ Web-safe fonts

    are generally safe to use ‣ Don't forget to test multibyte characters ‣ 2 bytes: ñÖÑú - ӬģĽ ‣ 3 bytes: 佸ਁ - —၍₶ ‣ 4+bytes: 韴韵韶 - ‣ Example: Mamá vive en Föllinge en el bosque del Ñañdú.¿Enredado? ¡Deberías! (SimSun-ExtB) 71
  72. JavaScript considerations ‣ Always use native Date() object ‣ Has

    support for timezones ‣ No native support for i18n on Javascript ‣ http://i18next.com is able to save the day! 72
  73. 73 Finally: Who am I? Want to know more? My

    name is Camilo Sperberg Tweet me @unreal4u Email me@unreal4u.com or telegram.me/unreal4u
  74. Finally: Who am I? ‣ Blog: http://blog.unreal4u.com/ (Spanish) ‣ Slides

    will be ready to be downloaded on: ‣ https://speakerdeck.com/unreal4u 74
  75. Thanks! 75

  76. Nice reads and more information ‣ https://github.com/triplepoint/php-units-of-measure ‣ https://github.com/unreal4u/localization ‣

    http://www.w3.org/International/articles/language-tags/ ‣ http://php.net/manual/en/book.intl.php ‣ http://www.sitepoint.com/localizing-php-applications-1/ ‣ http://www.utf8-chartable.de/ ‣ http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ 76