Slide 1

Slide 1 text

Internationalization and Localization: The Basics® NijmegenPHP 2015 - 09 - 16 Camilo Sperberg / @unreal4u

Slide 2

Slide 2 text

Disclaimer Don't be offended by anything said or shown in this talk!

Slide 3

Slide 3 text

What is internationalization? Internationalization (i18n): Preparing your application to be localized (Pro-tip: How? Watch this talk!) Localization (L10n): Translating, adding icons and other things of certain zone 3

Slide 4

Slide 4 text

…. …. Most discussed thing in RFC's 4 Bilbo Baggins 1290 - ? RFC 3066 Jan. 2001 RFC 4646 Sep. 2006 RFC 4647 Sep. 2006 Bungo Baggins 1246 - 1326 RFC 5646 Sep. 2009 Balbo Baggins 1167 - 1258 Mungo Baggins 1207 - 1300 RFC 1766 Mar. 1995 Son of Son of Son of Replaces Replaces Replaces Belladonna Took Laura Grubb Berylla Boffin 1172 - 1265 LOTR RFC i18n

Slide 5

Slide 5 text

i18n and PHP: needed extensions 5 ‣ For gettext: php-gettext ‣ Needs gettext ‣ https://www.gnu.org/software/gettext/ ‣ For intl: php-intl ‣ Needs ICU4C ‣ http://site.icu-project.org/ ‣ mb_* functions: multibyte support

Slide 6

Slide 6 text

L10n: Definition "A set of parameters that defines the user's language, country and any special variant preferences that the user wants to see in their user interface" Wikipedia What it means: rules for a specific region ‣ Current code standard: RFC 4646 ‣ Language in lowercase (ISO 639-1), hyphen, region in uppercase (ISO 3166-1 alpha-2) 6

Slide 7

Slide 7 text

L10n: Code standards nl-NL Dutch as SUISHE*® in the Netherlands nl-BE Dutch as SUISHE*® in Belgium 7 * Spoken, Used, Interpreted, Seen, Heard, Etc

Slide 8

Slide 8 text

L10n: Code standards 8 * Spoken, Used, Interpreted, Seen, Heard, Etc pt-BR Portuguese as SUISHE*® in Brazil nl-PT Portuguese as SUISHE*® in Portugal

Slide 9

Slide 9 text

L10n: Code standards 9 * Spoken, Used, Interpreted, Seen, Heard, Etc es-ES, es-CL, es-AR, es-PE, es-* Spanish as SUISHE*® in Spain, Chile, Argentina, Peru, etc.

Slide 10

Slide 10 text

L10n: Code standards de-CH-1901 German as used in Switzerland using the 1901 variant 10

Slide 11

Slide 11 text

L10n: Code standards sl-IT-nedis Slovenian as used in Italy, Nadiza dialect 11

Slide 12

Slide 12 text

L10n: Code standards hy-Latn-IT-arevela Eastern Armenian written in Latin script, as used in Italy 12

Slide 13

Slide 13 text

L10n: Determining which to use Introducing the \Locale object! 13 Best practice tips: ‣ Get it from your $_GET or $_POST ‣ Get it from the headers ‣ Get it based on ip

Slide 14

Slide 14 text

i18n: Detecting locale 14 public function getLocaleFromClient() {
 $this->locale = $this->getLocaleFromGetRequest();
 if (empty($this->locale)) {
 $this->locale = $this->getLocaleFromHeaders();
 if (empty($this->locale)) {
 $this->locale = $this->getLocaleFromIP();
 }
 }
 
 return $this->locale;
 }

Slide 15

Slide 15 text

i18n: Detecting locale 15 public function getLocaleFromHeaders() {
 $this->locale = '';
 
 if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
 $preferredLocale = \Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
 $this->locale = $this->_checkLocale($preferredLocale);
 }
 
 return $this->locale;
 }

Slide 16

Slide 16 text

\Locale: cool functions! 16 Use To… ‣ \Locale::getDisplayLanguage() Create a list of languages ‣ \Locale::getDisplayRegion() Display the country where the locale is used ‣ \Locale::acceptFromHttp() Parse browser Accept header ‣ \Locale::setDefault() Set the default localization to use

Slide 17

Slide 17 text

L10n: what to look out for? ‣ Warning! Locale is not just a translation ‣ What's normal for you can be strange to others "Diversity is amazing, both in appearance and thoughts, please respect different opinions, agree to disagree and live in harmony." @michellesanver 17

Slide 18

Slide 18 text

Images and gestures ‣ Pointing fingers can be offensive in Arabic countries 18 ✌ ‣ The "peace" sign is offensive in Australia, Ireland, New Zealand, South Africa and the United Kingdom

Slide 19

Slide 19 text

Translations - Semantics (L10n) 19 echo $numberResults.' results found within a '.$range.'km range'; If translated directly into Spanish, it will sound like this: ➡ "Found 123 results within range 5km"

Slide 20

Slide 20 text

Translations - Semantics (i18n) 20 ‣Solvable by implementing printf() printf( '%1$d results found within a %2$d km range', $numberResults, $range ); ‣Translator decides where to print out variables

Slide 21

Slide 21 text

Translations - Semantics (L10n) 21 es-ES: Carro Coche es-CL: Auto

Slide 22

Slide 22 text

This is a “carro” or “coche” in es-CL:

Slide 23

Slide 23 text

Translations - Semantics (L10n) 23 British American Holiday Vacation Football Soccer American Football Football Flat Apartment Garden Yard Rubbish Garbage / Trash

Slide 24

Slide 24 text

Translations - Semantics 24 Smart Phones Mobile phones Barcode Scanners Control Remotes See Printers Interactive question! What's wrong with the following list?

Slide 25

Slide 25 text

Translations - Semantics 25 Captain here: "Watch" was translated as the verb instead of the noun

Slide 26

Slide 26 text

26

Slide 27

Slide 27 text

Translations - Plural forms (L10n) 27 English Polish 0 Apples Jabłek 1 Apple Jabłko 2 .. 4 Apples Jabłka 5 .. 21 Apples Jabłek 22 .. 24 Apples Jabłka 25 .. 31 Apples Jabłek More complex cases do exist! ‣ Slovenian: 4 plural forms There are also cases with 1 plural form ‣ Japanese ‣ Vietnamese

Slide 28

Slide 28 text

Translations - Plural forms (i18n) 28 Gettext! ‣ Supports plural forms ‣ Is cached in RAM (pros and cons) ‣ Very easy to edit (poEdit) ‣ Can be separated into modules ‣ Produces compiled language files

Slide 29

Slide 29 text

Translations - Plural forms (i18n) ‣ \MessageFormatter can also help ‣ Can do pretty amazing stuff ‣ I personally don't have experience with it 29 $fmt = new MessageFormatter(
 'en_GB',
 'Peter has {0, plural, =0{no cat} =1{a cat} other{# cats}}'
 );
 echo $fmt->format(array(0));
 $fmt = new MessageFormatter(
 'nl_NL',
 'Peter heeft {0, plural, =0{geen kat} =1{een kat} other{# katten}}'
 );
 echo $fmt->format(array(0));
 
 // Outputs:
 // Peter has no cat
 // Peter heeft geen kat

Slide 30

Slide 30 text

Number formatting - L10n ‣ Numbers have a lot of different types of annotations ‣ Corollary: nobody really knows well how a number should be formatted 30 Interactive question! How do you format the following negative number, in Euros, here in the Netherlands? 1234,57

Slide 31

Slide 31 text

Number formatting - L10n 31 nl-NL fr-FR pt-BR hi-IN ps-AR "-1.234,57" "-1 234,57" "-1.234,57" "-१,२३४.५७" "-۱٬۲۳۴٫۵۷" "€ 1.234,57-" "-1 234,57 €" "(€1.234,57)" "-€ १,२३४.५७" "-۱٬۲۳۴٫۵۷ €" "25%" "25 %" "25%" "२५%" "۲۵٪"

Slide 32

Slide 32 text

Number formatting: i18n 32 $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN', 'ps-AR',];
 
 foreach ($locales as $myLocale) {
 $numberFormatter = new \NumberFormatter($myLocale, \NumberFormatter::DECIMAL);
 $percentFormatter = new \NumberFormatter($myLocale, \NumberFormatter::PERCENT);
 $currencyFormatter = new \NumberFormatter($myLocale, \NumberFormatter::CURRENCY);
 printf('Locale: %s'.PHP_EOL, $myLocale);
 printf('[DEC]-1.234,57: "%s" :: ', $numberFormatter->format(-1234.57));
 printf('[PER]25%%: "%s" :: ', $percentFormatter->format(0.25));
 printf('[CUR]1.234,57-: "%s"'.PHP_EOL, $currencyFormatter->formatCurrency(-1234.57, 'EUR')); // Last argument can also be \NumberFormatter::CURRENCY to print in CURRENCY of loaded locale
 }

Slide 33

Slide 33 text

Number formatting: i18n 33 $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN', 'ps-AR',];
 
 foreach ($locales as $myLocale) {
 $numberFormatter = new \NumberFormatter($myLocale, \NumberFormatter::DECIMAL);
 $percentFormatter = new \NumberFormatter($myLocale, \NumberFormatter::PERCENT);
 $currencyFormatter = new \NumberFormatter($myLocale, \NumberFormatter::CURRENCY);
 printf('Locale: %s'.PHP_EOL, $myLocale);
 printf('[DEC]-1.234,57: "%s" :: ', $numberFormatter->format(-1234.57));
 printf('[PER]25%%: "%s" :: ', $percentFormatter->format(0.25));
 printf('[CUR]1.234,57-: "%s"'.PHP_EOL, $currencyFormatter->formatCurrency(-1234.57, 'EUR')); // Last argument can also be \NumberFormatter::CURRENCY to print in CURRENCY of loaded locale
 }

Slide 34

Slide 34 text

34 unreal4u-MBP:localization unreal4u$ php numbers.php Locale: nl-NL [DEC]-1.234,57: "-1.234,57" :: [PER]25%: "25%" :: [CUR]1.234,57-: "€ 1.234,57-" All this in Dutch AKA Nederlands -------------------------------------------------------------------------------- Locale: fr-FR [DEC]-1.234,57: "-1 234,57" :: [PER]25%: "25 %" :: [CUR]1.234,57-: "-1 234,57 €" All this in French AKA français -------------------------------------------------------------------------------- Locale: pt-BR [DEC]-1.234,57: "-1.234,57" :: [PER]25%: "25%" :: [CUR]1.234,57-: "(€1.234,57)" All this in Portuguese AKA português -------------------------------------------------------------------------------- Locale: hi-IN [DEC]-1.234,57: "-१,२३४.५७" :: [PER]25%: "२५%" :: [CUR]1.234,57-: "-€ १,२३४.५७" All this in Hindi AKA िहन्दी -------------------------------------------------------------------------------- Locale: ps-AR [DEC]-1.234,57: "-١۱،٬٢۲٣۳۴٫۵٧۷" :: [PER]25%: "٢۲۵٪" :: [CUR]1.234,57-: "-١۱،٬٢۲٣۳۴٫۵٧۷ €" All this in Pashto AKA ﻮﺘ,ﭘ --------------------------------------------------------------------------------

Slide 35

Slide 35 text

Some \NumberFormatter problems ‣ \NumberFormatter::DURATION isn't implemented in much locales ‣ echo $fmt->format(12345) -> 3 hours, 25 minutes, 45 seconds ‣ "Easy" to implement using getPattern() and setPattern() ‣ Documentation exists, but is not optimal 35

Slide 36

Slide 36 text

Date and time formatting - L10n ‣ Dates have 3 different annotations ‣ YYYY-MM-DD (1.660M) ‣ DD-MM-YYYY (4.810M) ‣ MM-DD-YYYY (320M) ‣ "It's complicated" (457M) ‣ Contrary to numbers, everybody knows how a date is formatted 36 Interactive question! What is the value of the following date? 05-03-13

Slide 37

Slide 37 text

Date and time formatting - L10n ➡ YMD (1660) ➡ YMD and DMY (287) ➡ DMY (3295) ➡ DMY and MDY (130) ➡ MDY (320) ➡ YMD and DMY and MDY (40) 37 https://en.wikipedia.org/wiki/Date_format_by_country

Slide 38

Slide 38 text

Date and time formatting: i18n ‣ Use PHP's \*Date* related classes, like ALWAYS! ‣ Incredibly versatile yet powerful functions ‣ Specially in combination with locales ‣ Always work in UTC, let the \*Date* classes do the rest 38

Slide 39

Slide 39 text

Date and time formatting: i18n 39 nl-NL fr-FR hi-IN ps-AR Short "23-05-15" "23/05/15" "२३-५-१५" "۲۰۱۵/۵/۲۳" Medium "23 mei 2015" "23 mai 2015" "२३-०५-२०१५" "۲۳ ۲۰۱۵ یم" With time "23 mei 2015 01:34:09" "23 mai 2015 01:34:09" "२३-०५-२०१५ १:३४:०९ पूवार्ह्न" "۲۳ ۱:۳۴:۰۹ ۲۰۱۵ یم"

Slide 40

Slide 40 text

Date and time formatting: i18n 40 $locales = ['nl-NL', 'fr-FR', 'pt-BR', 'hi-IN', 'ps-AR',];
 
 $printDate = new \DateTime('23-05-2015 01:34:09', new \DateTimeZone('UTC'));
 
 foreach ($locales as $myLocale) {
 $shortDateObject = \intlDateFormatter::create($myLocale, \intlDateFormatter::SHORT, \intlDateFormatter::SHORT);
 $mediumDateObject = \intlDateFormatter::create($myLocale, \intlDateFormatter::MEDIUM, \intlDateFormatter::MEDIUM);
 printf(
 'Locale: %s, Short: "%s", Medium "%s"'.PHP_EOL, 
 $myLocale, 
 $shortDateObject->format($printDate), 
 $mediumDateObject->format($printDate)
 );
 }

Slide 41

Slide 41 text

i18n/L10n and OS ‣ Variety in L10n is almost infinite ‣ Automatic in i18n and L10n is better ‣ Operating system plays an important role ‣ Why reinvent a very very very complicated wheel if it already exists? 41

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

Timezones - L10n ‣ PHP has full support for timezones ‣ 39 (40?) official timezones ‣ Multiple timezones in one locale ‣ nl-NL: Europe/Amsterdam ‣ es-CL: America/Santiago and Easter/ Pacific ‣ en-US: Has 4 timezones ‣ ru-RU: Has 8 timezones 43 Interactive question! What time is it now in Seoul (ko-KR)?

Slide 44

Slide 44 text

Timezones: i18n Caution! Calls to ICU library can get pretty expensive! ‣ With a known locale, get all timezones ‣ If there's only one, instantiate \DateTimeZone ‣ More than 1? Get precise timezoneId and DST settings (cache them!) ‣ Now calculate the offset of a timezone for the view 44

Slide 45

Slide 45 text

Timezones: i18n - Check validity 45 public function isValidTimeZone($timeZoneName='') {
 if (!is_string($timeZoneName)) {
 $timeZoneName = '';
 }
 
 try {
 new \DateTimeZone($timeZoneName);
 return true;
 } catch (\Exception $e) {
 return false;
 }
 }

Slide 46

Slide 46 text

Timezones: i18n - Get timezone candidates 46 /** * $region is defined as \Locale::getRegion($currentLocale) */ private function _setTimezoneCandidates($region='') {
 if (!empty($region)) {
 $this->_timezoneCandidates = \DateTimeZone::listIdentifiers(\DateTimeZone::PER_COUNTRY, $region);
 if (!empty($this->_timezoneCandidates) && count($this->_timezoneCandidates) == 1) {
 $this->setTimezone($this->_timezoneCandidates[0]);
 }
 }
 }

Slide 47

Slide 47 text

Timezones: i18n - Set timezone 47 public function setTimezone($timeZoneName='UTC') {
 if (!$this->isValidTimeZone($timeZoneName)) {
 $timeZoneName = 'UTC';
 }
 
 $this->timezone = new \DateTimeZone($timeZoneName);
 $this->timezoneId = $this->timezone->getName();
 $transitions = $this->timezone->getTransitions();
 $this->timezoneInDST = $transitions[0]['isdst'];
 
 return $this->timezoneId;
 }

Slide 48

Slide 48 text

Timezones: i18n - Display 48 $theDate = new \DateTime('23-05-2015 21:34:09', new \DateTimeZone('UTC'));
 $dateObject = \intlDateFormatter::create(
 'ko-KR', // $this->_currentLocale
 \IntlDateFormatter::MEDIUM,
 \IntlDateFormatter::MEDIUM,
 $this->timeZoneId // Asia/Seoul
 );
 echo $dateObject->format($theDate);

Slide 49

Slide 49 text

What's the time in Seoul then? Result? UTC 23-05-2015 21:34:09 is 2015. 5. 24. য়੹ 6:34:09 in ko-KR (Offset: +9 hours) 49

Slide 50

Slide 50 text

Encoding and charsets - L10n ‣ Difficult, often misunderstood subject ‣ Difficult to debug ‣ First step of debugging is knowing what encoding you are working with ‣ Convert to an appropriate charset with iconv() 50

Slide 51

Slide 51 text

Encoding in PHP ‣ Internal work always in UTF-8, EVERYWHERE ‣ Include some basic stuff so that PHP also knows that it has to work in UTF-8 ‣ Don't forget to send the browser information as well 51 mb_internal_encoding('UTF-8'); header('Content-type: %s; charset=UTF-8'); ‣ Lots of small things to consider, but can vary on each case

Slide 52

Slide 52 text

Encoding in PHP: mails Caution with the imap extension! Has some problems with UTF-7 Always encode "To" (BC, BCC) and "Subject" fields 52 Code Output "=?utf-8?B?5L2p5ae/?= " 佩姿 "=?iso-8859-1?Q?B=F8lla?=, med =?iso-8859-1?Q?=F8l?= i baggen " Bølla , med øl i baggen "=?utf-7?Q?Petra_M+APw-ller?=" Petra Müller

Slide 53

Slide 53 text

Encoding in PHP: mails ‣ Buggy functions ‣ imap_rfc822_parse_adrlist() ‣ imap_mime_header_decode() ‣ Others? ‣ Check out https://github.com/unreal4u/string-operations/ for replacement functions 53

Slide 54

Slide 54 text

Databases and L10n / i18n

Slide 55

Slide 55 text

But before we begin…

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

Database and encodings/charsets CHARSET 57 COLLATION

Slide 58

Slide 58 text

Practical use of charset Rule of thumb: adjust to the best possible way according to input md5/SHA1-like strings should be ASCII-encoded (Why? It helps the db engine to predict better its memory assignment) 58 Interactive question! What charset should be used to save the following string? f5d39e997c5d7e4e2a3ef49973f61fb2

Slide 59

Slide 59 text

Practical use of charset CREATE TABLE `t1` (
 `md5HashCalculation` CHAR(32) CHARSET ASCII COLLATE ascii_bin DEFAULT NULL
 ); 59

Slide 60

Slide 60 text

Differences between TEXT and [VAR]CHAR ‣ [VAR]CHAR(255) holds up to 255 characters ‣ TINYTEXT can hold up to 255 bytes ‣ UTF-8 characters can take up to 5 (or more) bytes 60

Slide 61

Slide 61 text

Indexes and charsets When working with Unicode characters, performance can be indirectly and negatively impacted ‣ Too big (and complex) of a topic for now ‣ Use EXPLAIN to understand underlying decisions of MySQL (in some cases) ‣ Don't bother in micro-optimization either 61

Slide 62

Slide 62 text

COLLATION ‣ Used to order data in a "natural" way ‣ Different languages have different rules 62 CREATE TABLE `spanishCollation` (
 `name01` VARCHAR(15) COLLATE utf8_spanish_ci,
 `name02` VARCHAR(15) COLLATE utf8_spanish2_ci
 ) DEFAULT CHARSET utf8;

Slide 63

Slide 63 text

Some notes on Collation "*_ci" stands for case-insensitive Watch out with utf8_general_ci and utf8_unicode_ci! ➡ utf8_general_ci has some problems with Hebrew and some cyrillic characters ➡ It's generally faster (7~12%) ➡ But utf8_unicode_ci is more compatible 63

Slide 64

Slide 64 text

Collation and performance ‣ Performance penalty: order in another collation ‣ It will have to do a filesort ‣ Which is MySQL's way of saying "quicksort" ‣ [Partial] keys can help avoid this quicksort operation 64

Slide 65

Slide 65 text

General database localization ‣ Not recommended: translation on database level ‣ If absolutely needed, investigate EAV model ‣ PRO: Quick, simple and cheap ‣ CON: Queries may become complex 65

Slide 66

Slide 66 text

Your own L10n database ‣ Does the locale use the metric or imperial system (either British or American)? ‣ What type of rounding is used in that locale? ‣ Optional: custom number and currency pattern to overwrite any default rules ‣ The preferred timezone (user based, not L10n based) ‣ Direction of text 66

Slide 67

Slide 67 text

Fonts ‣ Easily overseen, yet very important ‣ Web-safe fonts are generally safe to use ‣ Don't forget to test multibyte characters ‣ 2 bytes: ñÖÑú - ӬģĽ ‣ 3 bytes: 漢字 - ♥၍₶ ‣ 4+bytes: - ‣ Example: Mamá vive en Föllinge en el bosque del Ñañdú.¿Enredado? ¡Deberías! (SimSun-ExtB) 67

Slide 68

Slide 68 text

JavaScript considerations ‣ Always use native Date() object ‣ Has support for timezones ‣ No native support for i18n on Javascript ‣ http://i18next.com is able to save the day! 68

Slide 69

Slide 69 text

69 Finally: Who am I? Want to know more? My name is Camilo Sperberg http://twitter.com/unreal4u [email protected]

Slide 70

Slide 70 text

Finally: Who am I? ‣ Blog: http://blog.unreal4u.com/ (Spanish) ‣ Rate and comment: https://joind.in/talk/view/15219 ‣ Please, it's the only way this talk (and others) can be improved ‣ Slides are ready to be downloaded: ‣ http://unreal4u.com/talks/ ‣ https://speakerdeck.com/unreal4u 70

Slide 71

Slide 71 text

Thanks! 71

Slide 72

Slide 72 text

Nice reads and more information ‣ https://github.com/triplepoint/php-units-of-measure ‣ https://github.com/unreal4u/localization ‣ http://www.w3.org/International/articles/language-tags/ ‣ http://php.net/manual/en/book.intl.php ‣ http://www.sitepoint.com/localizing-php-applications-1/ ‣ http://www.utf8-chartable.de/ 72