intl me this, intl me that Andrei Zmievski
 AppDynamics PHP UK ~ February 22, 2014 ~ London

me • Software architect at AppDynamics • PHP Core contributor (1999-2010) • Architect of the Unicode/i18n in PHP 6 • Twitter: @a • Beer lover (and brewer)

unicode 7

terms • Internationalization (i18n) • to design and develop an application without built-in cultural assumptions that is efficient to localize • Localization (l10n) • to tailor an application to meet the needs of a particular region, market, or culture

no assumptions • English/French/Chinese is just another language • Your country is just another country • Earth is just another planet (eventually)

why localize? • English speakers are now a minority on WWW • Nearly 3 out of 4 participants surveyed by Common Sense Advisory agreed that they were more likely to buy from sites in their own languages than in English • Global consumers will pay more for products with information in their language

locale • identifier referring to linguistic and cultural preferences of a user community • language • script • country • variant • @keywords sr_Latn_YU_REVISED@currency=USD en_GB

locale data • Common Locale Data Repository (CLDR) • 740 locales: 238 languages and 259 territories • updated regularly

intl • available since PHP 5.3 • bundled locale data • formatters/parsers • collation (sorting) • calendars and timezones • boundary iteration • transliteration • resource bundles • character set conversion • spoof checking

API • OO and procedural API • Same underlying implementation collator_create() new Collator() collator_set_strength() $collator->setStrength() numfmt_format() NumberFormatter::format()

Slide 12 text


purpose • robust conversion between character encodings • replacement for mb_convert_encoding()

simple echo UConverter::transcode( "Cura\xe7ao", "utf-8", "iso-8859-1"); echo UConverter::transcode( "Curaçao", "iso-8859-8", "utf-8"); echo UConverter::transcode( "Curaçao", "iso-8859-8", "utf-8", array("to_subst" => "×")); Curaçao

simple echo UConverter::transcode( "Cura\xe7ao", "utf-8", "iso-8859-1"); echo UConverter::transcode( "Curaçao", "iso-8859-8", "utf-8"); echo UConverter::transcode( "Curaçao", "iso-8859-8", "utf-8", array("to_subst" => "×")); Curaçao Curaao

simple echo UConverter::transcode( "Cura\xe7ao", "utf-8", "iso-8859-1"); echo UConverter::transcode( "Curaçao", "iso-8859-8", "utf-8"); echo UConverter::transcode( "Curaçao", "iso-8859-8", "utf-8", array("to_subst" => "×")); Curaçao Curaao Cura×ao

callbacks class MyConverter extends UConverter { public function fromUCallback($reason, $source, $codepoint, &$error) { if (($reason == UConverter::REASON_UNASSIGNED) && ($codepoint == 0x221A)) { // translate √ to sqrt $error = U_ZERO_ERROR; return 'square root of '; } } } $c = new MyConverter('ascii', 'utf-8'); echo $c->convert("What is √2?"); What is square root of 2?

Slide 18 text


sorting • languages may sort more than one way • traditional vs. modern Spanish • Japanese stroke-radical vs. radical-stroke • German dictionary vs. phone book

collation levels primary base characters secondary accents and language quirks tertiary case and variants of base forms quaternary you will never use this identical tie-breaker

collation levels • Each locale has default level setting • Differences in lower levels are ignored if higher levels are already different

comparing strings côte < coté $coll = new Collator("fr_FR"); if ($coll->compare("côte", "coté") < 0) { echo "before"; } else { echo "after"; } before

strength control $coll = new Collator("fr_FR"); $coll->setStrength(Collator::PRIMARY); if ($coll->compare("côte", "coté") == 0) { echo "same"; } else { echo "different"; } côte = coté same

sorting strings cote côte Côte coté Coté côté Côté coter $strings = array( "cote", "côte", "Côte", "coté", "Coté", "côté", "Côté", "coter"); $coll = new Collator("fr_FR"); $coll->sort($strings);

other attributes $coll = new Collator("en_US"); $coll->setAttribute(Collator::CASE_FIRST, Collator::UPPER_FIRST); if ($coll->compare("abc", "ABC") < 0) { echo "before"; } else { echo "after"; } ABC < abc before

numeric collation 1 < 2 < 10 $strings = array("10", "1", "2"); $coll->setStrength(Collator::NUMERIC_COLLATION, Collator::ON); $coll = new Collator(null); $coll->sort($strings);

Slide 27 text


purpose • formats numbers as strings according to the locale, given pattern or set of rules • parses strings into numbers according to these patterns • replacement for number_format()

formatter styles • NumberFormatter::PATTERN_DECIMAL
 1234,567 (with ##.##) • NumberFormatter::DECIMAL
 1 234,56 • NumberFormatter::CURRENCY
 1 234,57 € • NumberFormatter::PERCENT
 123 457 % 1234.567 in fr_FR

formatter styles • NumberFormatter::SCIENTIFIC
 1,234567E3 • NumberFormatter::SPELLOUT
 mille deux cent trente-quatre virgule cinq six sept • NumberFormatter::ORDINAL
 1 235e • NumberFormatter::DURATION
 1 235 1234.567 in fr_FR

formatting $fmt = new NumberFormatter('en_GB', NumberFormatter::DECIMAL); $fmt->format(1234); ! $fmt = new NumberFormatter('de_CH', NumberFormatter::CURRENCY); $fmt->formatCurrency(1234, 'CNY'); 1,234 CN¥ 1'234.00

parsing $fmt = new NumberFormatter('in_IN', NumberFormatter::DECIMAL); var_dump($fmt->parse('7.005.944', NumberFormatter::TYPE_INT32)); int(7005944)

Slide 33 text


purpose • produces concatenated messages in a language- neutral way • operates on patterns, which contain sub formats • program does not need to know the order of fragments

messages Today is February 22, 2014. echo "Today is ", date("F d, Y"); old way intl way pattern Today is {0,date}. args array(time())

Slide 36

Slide 36 text

formatting $pattern = "On {0,date} you have {1,number} meetings."; $args = array(time(), 2); $fmt = new MessageFormatter("en_US", $pattern); echo $fmt->format($args); On February 22, 2014 you have 2 meetings.

formatting $pattern = "On {0,date,short} your balance was {1,number,currency}."; $args = array(time(), 184.22); $fmt = new MessageFormatter("en_GB", $pattern); echo $fmt->format($args); On 22/02/14 your balance was £184.22.

formatting $fr_pattern = "Aujourd'hui, {2,date,dd MMMM}, il y a {0,number} personnes sur {1}."; $fr_args = array(7213518802, "la Terre", time()); ! $msg = new MessageFormatter("fr_FR", $fr_pattern); echo $msg->format($fr_args); Aujourd'hui, 22 février, il y a 7 213 518 802
 personnes sur la Terre.

parsing messages $pattern = “On {0,date} you have {1,number} meetings.”; $text = “On February 22, 2014 you have 33 meetings.”; $msg = new MessageFormatter("en_US", $pattern); var_dump($fmt->parse($text)); array(2) { [0]=> int(1393056000) [1]=> int(33) }

plural selection $pattern = "There {0,plural, =0{are no results} =1{is # result} other{are # results}} found.”; $fmt = new MessageFormatter("en_GB", $pattern); echo $fmt->format(array(0)); echo $fmt->format(array(12)); There are no results found. There are 12 results found.

Break Iterators

purpose • locate linguistic boundaries • supported units • characters • words • lines • sentences • more complex ones are possible with custom rules

Slide 43

sentences $text = <<setText($text); foreach ($bi->getPartsIterator() as $part) echo "** ", $part, "\n"; ** She asked, “Are you from U.K.?” ** John Smith Sr. nodded.

Slide 44

lines $bi = IntlBreakIterator::createLineInstance("en"); $bi->setText($text); foreach ($sentenceBI->getPartsIterator() as $part) echo $part, "\n"; She asked, "Are you from U.K.?" John Waxby Sr. nodded.

lines $offset = 39; $lineBI->first(); echo substr($text, 0, $lineBI->next()),"."; echo substr($text, 0, $lineBI->next()),"."; echo substr($text, 0, $lineBI->preceding($offset)),"*"; She * She asked, * She asked, "Are you from U.K.?" John *

Resource Bundles

purpose • contain resources for localization • messages, labels, formatting patterns, etc • accessed via locale-independent interface • fallback mechanism is key

data hierarchy root root en es ja zh language Hans Hant script US ES MX JP CN HK country

Slide 49

Slide 49 text

data format • simple resources • string, integer, binary data, integer array • complex resources • arrays and tables

Slide 50

Slide 50 text

root.txt root { version:string { "1.0.0" } ! mainTitle { "Welcome to our store!" } errors:array { :string { "Website is experiencing difficulties" } :string { "Maximum of {0,number,integer}” "products are allowed" } } sizes:intvector { 10, 100 } }

en_GB.txt en_GB { version { "1.0.1" } ! mainTitle:string { "Welcome to our old shoppe!" } sizes:intvector { 25, 250 } }

compiling % mkdir myres % genrb -d myres root.txt en.txt en_GB.txt % ls myres genrb number of files: 3 en.res en_GB.res root.res

retrieval • root $bundle = DIRNAME(__FILE__).'/myres'; $r = ResourceBundle::create('root', $bundle); echo $r['mainTitle']; echo $r['errors'][1]; print_r($r['sizes']); Welcome to our store! Maximum of {0,number,integer} products are allowed Array ( [0] => 10 [1] => 100 )

retrieval • en_GB $bundle = DIRNAME(__FILE__).'/myres'; $r = ResourceBundle::create('en_GB', $bundle); echo $r['mainTitle']; echo $r['errors'][1]; print_r($r['sizes']); Welcome to our olde shoppe! Maximum of {0,number,integer} products are allowed Array ( [0] => 25 [1] => 250 )

retrieval • de $bundle = DIRNAME(__FILE__).'/myres'; $r = ResourceBundle::create('de', $bundle); echo $r['mainTitle']; echo $r['errors'][1]; print_r($r['sizes']); Welcome to our store! Maximum of {0,number,integer} products are allowed Array ( [0] => 10 [1] => 100 )

Spoof Checking

Slide 57 text You received a large payment. Click here to receive:

purpose • prevent certain classes of security attacks • check identifiers (typically URLs) for visual confusion • single script • mixed script • whole script

single script $url1 = "";! $url2 = "";! ! $spoof = new SpoofChecker();! if ($spoof->areConfusable($url1, $url2))! echo "$url1 and $url2 are confusable\n";

Slide 60

Slide 60 text

single script $url1 = "";! $url2 = "";! ! $spoof = new SpoofChecker();! if ($spoof->areConfusable($url1, $url2))! echo "$url1 and $url2 are confusable\n"; and are confusable

mixed script $url1 = ""; $url2 = "yahо"; ! $spoof = new SpoofChecker(); if ($spoof->areConfusable($url1, $url2)) echo "$url1 and $url2 are confusable\n";

mixed script $url1 = ""; $url2 = "yahо"; ! $spoof = new SpoofChecker(); if ($spoof->areConfusable($url1, $url2)) echo "$url1 and $url2 are confusable\n"; and yahо are confusable

suspicious $word = "Норе"; $spoof->setAllowedLocales("en_US"); if ($spoof->isSuspicious($word)) echo "$word is suspicous in en_US"; else echo "not suspicious"; Норе is suspicous in en_US

suspicious $word = "Норе"; $spoof->setAllowedLocales("en_US,ru_RU"); if ($spoof->isSuspicious($word)) echo "$word is suspicous in en_US,ru_RU"; else echo "not suspicious"; not suspicious

Slide 65 text


purpose • originally used for script transliteration • much more general transform mechanism, including: • case • normalization • full/half-width • hex/character names

transliteration IDs source-target/variant

transliteration IDs Any-target/variant

sample IDs • Katakana-Latin • Latin-ASCII • NFD • Any-Hex/XML

script conversion $tr = Transliterator::create("Any-Latin"); $sign = 'ϚοΫυφϧυ'; echo $latin = $tr->transliterate($sign); $tr = Transliterator::create("Latin-Katakana"); var_dump($tr->transliterate($latin) == $sign); makkudonarudo

script conversion $tr = Transliterator::create("Cyrillic-Latin"); echo $tr->transliterate('я в избушке сижу опять’); ! $tr = Transliterator::create("Russian-Latin/BGN"); echo $tr->transliterate('я в избушке сижу опять'); â v izbuške sižu opâtʹ

script conversion â v izbuške sižu opâtʹ ya v izbushke sizhu opyatʹ $tr = Transliterator::create("Cyrillic-Latin"); echo $tr->transliterate('я в избушке сижу опять’); ! $tr = Transliterator::create("Russian-Latin/BGN"); echo $tr->transliterate('я в избушке сижу опять');

Any-Name $tr = Transliterator::create("Any-Name"); echo $tr->transliterate('я$'); \N{CYRILLIC SMALL LETTER YA}\N{DOLLAR SIGN}

Latin-ASCII $tr = Transliterator::create("Latin-ASCII"); echo $tr->transliterate("© 1990 «PHP»"); (C) 1990 <>

compound IDs $tr = Transliterator::create("Greek-Latin"); echo $tr->transliterate("Αλφαβητικός Κατάλογος”); Alphabētikós Katálogos

compound IDs $tr = Transliterator::create("Greek-Latin"); echo $tr->transliterate("Αλφαβητικός Κατάλογος”); $tr = Transliterator::create( "Greek-Latin; NFD; [:Nonspacing Mark:] Remove; NFC”); echo $tr->transliterate("Αλφαβητικός Κατάλογος”); Alphabētikós Katálogos Alphabetikos Katalogos

rule-based transforms $rules = <<<'RULES' $space = ' ' ; $space {$space} > ; # collapse multiple spaces '--' <> — ; # convert fake dash into real one RULES; $tr = Transliterator::createFromRules($rules); echo $tr->transliterate("a very spacey -- and delimited -- remark”); a very spacey — and delimited — remark

• • • •

спасибо thank you merci þakka þér ͋Γ͕ͱ͏