Everything you always wanted to know about UTF-8 (but never dared to ask)

‡ ‡ Disclaimer: <rant> As you never dared to ask,
how am I supposed to know what you wanted to know ? Do you think I’m a mind-reader of something ? Anyways, my lawyer advised me against trying to mind-read, so I’m just going to guess and hope I get it right. Don’t come and complain afterwards that I didn’t tell you the things you wanted to know. At the end of the day: if you wanted to get your questions answered, you had better have the courage to ask them.</rant> By: Juliette Reinders Folmer @jrf_nl

“Internationalization is like parenting: a lifelong cycle of hardship in
which no cumulative knowledge is gained.” Mark Pilgrim, april 2004 “Mark believes that because Unicode is harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N becomes an issue.” J. Graham, april 2004 “Internationalization is like parenting: a lifelong cycle of hardship in which no cumulative knowledge is gained.” Mark Pilgrim, april 2004

Some common misconceptions • Unicode !== UTF-8 • UTF-8 !==
internationalization • UTF-8 !== charset

Why worry about it anyway ? • Local is an
illusion, always think global: – Company/Client gets taken over by a foreign company – Mergers – Expansion to other regions – Local users/employees from other origins • Code efficiency • Cost Helgi Þormar Þorbjörnsson

Some language statistics • 7105 ‘living’ languages • +/- 308
languages with > 1 million speakers • Nr 1 language in the world ? • Nr 2 ? Did you know: • That France has more than 9 officially recognized languages ? • That the country with the most languages is Papua New Guinea ? * Alsatian, Catalan, Corsican, Breton, French, Gallo, Occitan, Tahitian, some languages of New Caledonia (837) Mandarin Chinese Spanish Source: Ethnologue 2013

Top 20 languages in the world * Arabic 64 Urdu
20 Javanese 84 Javanese 10 Roman 68 Vietnamese 19 Hiragana, Katakana, and Kanji 122 Japanese 9 Tamil 69 Tamil 18 Cyrillic 167 Russian 8 Roman 71 Turkish 17 Bengali 193 Bengali 7 Devanagari 72 Marathi 16 Roman 203 Portuguese 6 Telugu 74 Telugu 15 Arabic 237 Arabic (standard) 5 Roman 75 French 14 Devanagari 260 Hindi 4 Korean (Hangul) 77 Korean 13 Roman 335 English 3 Roman 78 German (standard) 12 Roman 414 Spanish 2 Lahnda, Arabic 83 Lahnda ( Western Punjabi) 11 Vernacular Chinese 1.197 Mandarin Chinese 1 Script Total speakers (M) Language Script Total speakers (M) Language * Source: Ethnologue 2013/14

About writing systems • There are approximately writing systems in
active use • Most are used (with or without extensions) for several languages • Some languages use more than one writing system • Numerous other writing systems for ceremonial or religious use • Or for fun ;-) 180 * * Source: Omniglot

Distribution of writing systems Source: Wikipedia

There Ain't No Such Thing As Plain Text

On character sets and encoding 11000111 10111010 * Ǻ Encoding
UTF-8 Charset

Unicode Unicode is a computing industry standard for the consistent
encoding, representation and handling of text expressed in most of the world's writing systems. (Wikipedia) • Unicode Code charts: http://www.unicode.org/charts/

UTF • UTF = Unicode Transform Format • UTF-8 is
one of the character encodings for implementing Unicode • Alternatives are UTF-7 (legacy), UTF-16, UTF-32 • UTF-8 is (backward) compatible with ASCII, UTF-16/32 are not. * Image source: W3C

Advantages of UTF-8 • Backward compatible with ASCII • UTF-8
can encode any Unicode character • XML requires UTF-8 or UTF-16 • UTF-8 and UTF-16 are the standards for having Unicode in HTML. UTF-8 is preferred. • Can be fairly reliably recognized with small chance of confusion. • Sorting UTF-8 as arrays of unsigned bytes will result in same order as sorting on Unicode code point.

So, what’s the problem ? • Everything defaults to non-UTF-8
Mostly latin, ISO-8859-1 or US-ASCII So, what’s the solution ? • Be EXPLICIT everywhere (and I don’t mean $%&@-explicit)

We’ll be covering: •Dependancy on user’s computer setup •Client side
code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues

User’s computer Potential issues: • Extended language support ? •
Code pages ? • Font ? • Browsers, browsers, browsers

Characteristics of text • Language • Writing system • Writing
direction • Writing direction • Character (sub)set • Character encoding • Font • Meaning • English • Roman/Latin • Left to right • Top to bottom • Basic Latin • UTF-16 (can vary) • Arial “What I really love” “ار ا يذ ا ا “ • Arabic • Arabic • Right to left • Top to bottom • Arabic • UTF-16 (...) • Arial

Languages: •English •Greek •Ukrainian •Mandarin Chinese •Japanese •Hindi •Korean •Kannada
•Punjabi Gurmuki •Tamil •Tigre •Myanmar •Arabic •Farsi •Hebrew

About Fonts • Unicode versus non-unicode fonts Be aware &
be wary ! • Few fonts capable of handling a wide range of Unicode characters. Examples: Arial Unicode MS, Bitstream Cyberbit, Code2000, GNU Unifont

Fonts used: •Verdana •Verdana •Verdana •SimSun •MS Mincho •Code2000 •Batang
•Arial Unicode MS •Lohit Punjabi •Latha •GS GeezMahtemUnicode •WinInnwa •Arial Unicode MS •Arial •Arial Unicode MS

These are the same phrases converted to the Verdana font.
WinInnwa is a non-Unicode compliant font...

To stress the importance of Unicode and Unicode-compliant fonts:

These are the same phrases again, now converted to the
Arial Unicode MS font. The Ge’ez script Character (sub)set (Ethiopic range) are not included in Arial Unicode MS.

Client side • Always declare the character encoding * Image
source: W3C

Client side – HTML • Use meta-headers: <meta http-equiv="Charset" content="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> • Tell the browser (and the search engines) the language too if you can: <html dir="[DIRECTION – ltr/rtl]" lang="[LANGCODE – eg de-DE or zh-cn for Mandarin Chinese]"> <meta name="language" content="[LANGCODE]"> <meta http-equiv="Content-Language" content="[LANGNAME - German]">

Client side – XHTML • Add an XML declaration at
the top <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- transitional.dtd"> <html xmlns="http ://www.w3.org/1999/xhtml" xml:lang="[LANGCODE]" lang="[LANGCODE]"> • Be aware of quirksmode vs standards mode (IE 6)

Client side – CSS • Add an encoding to the
CSS file – must be on the very first line of the file! @charset "utf-8"; • Use Unicode compliant fonts and tell the browser which to use with CSS <p lang="zh-cn">我很喜欢</p> P[LANG|="zh"] { font-family: SimSun, "MS Song", "Adobe Song Std L", sans-serif;} P[LANG="zh-cn"] { font-family: SimSun, “MS Song”, "Adobe Song Std L", sans-serif;} P[LANG|="ar"] { font-family: Arial, "Arial Unicode MS", sans-serif; direction: rtl; }

Client-side - Javascript • Specify content-type AND charset headers! •
Can use Unicode escape sequences \uHHHH

Putting files on the server • Save the file(s) as
encoded in UTF-8 • Don’t forget to upload the file as binary rather than ascii !

URL’s • Can you see a difference ?

Client-server communication Receiving data from the client • Make sure
you send information to the server in the correct encoding • This is especially important for user-input !!! <form accept-charset="utf-8">

Client-server communication Sending data to the client • Send a
HTTP header: header( 'Content-Type: text/html; charset=utf-8' ); • .htaccess/ Apache’s httpd.conf Settings in .htaccess overrule HTML headers !

Client-server communication .htaccess examples • # Maps file extensions to
a character encoding. Especially useful in content negotiation situations. (httpd.conf) AddCharset utf-8 .utf8 • # Pass the default character encoding for content-type text/plain and text/html AddDefaultCharset On|Off|charset AddDefaultCharset UTF-8 AddDefaultCharset On => iso-8859-1 • # Add a default character encoding per file extension AddType 'text/html; charset=UTF-8' html • # Identify the encoding for a particular file: <Files ~ "events\.html"> ForceType 'text/html; charset=UTF-8‘ </Files>

PHP • Currently not very friendly for UTF-8 • PHP6
development dormant • Some PHP extensions come to the rescue: MBstring iconv ~Intl • There are also some nifty function collections / classes available to help you. Take note of: http://sourceforge.net/projects/phputf8

PHP UTF-8 safe functions Safe: • explode() • str_replace() •
PHP5+ ~htmlentities() NOT Safe: • Everything else MUST READ: http://www.phpwact.org/php/i18n/utf-8

Danger zone • setlocale() – And all functions which use
locale • strtoupper() / strtolower() • number_format() / money_format() • ucfirst() / ucwords() • strftime() • Gettext extension • Filter extension text functions • Ctype

Test for well-formedness function utf8_compliant( $string ) { if (
strlen( $string ) == 0 ) { return true; } return ( preg_match( '/^.{1}/us', $string , $array ) == 1 ); }

Handling text • You don’t need htmlentities() anymore. Use htmlspecialchars()
instead: $html = htmlspecialchars($utf8_string, ENT_COMPAT, 'UTF-8'); • strlen() will count bytes, so use: function utf8_strlen( $string ){ return strlen( utf8_decode( $str ) ); }

PRCE • PRCE can be relatively UTF-8 safe if compiled
with Unicode. Use: preg_match('/^.+$/u', $string); • Test whether PRCE has been compiled with Unicode support: if( preg_match('/^.{1}$/u',"Ã±", $UTF8_ar) != 1 ){ trigger_error('PCRE is not compiled with UTF-8 support',E_USER_ERROR); }

utf8_encode() & utf8_decode() • Only useful for converting between ISO-8859-1
and UTF-8.

MBstring extension • Multibyte aware implementations of some of the
most common PHP string functions, the POSIX extended regex extension and the mail function. • Mbstring supports many different character sets, most importantly UTF-8. • Allows for conversion between character sets and implements some level of encoding detection.

Iconv extension • Bundled since PHP 5+. • Main purpose
of iconv : converting between different character sets. • From PHP 5+, iconv has implementations of some common string functions, but is slower than mbstring for UTF-8. • Great for dealing with files, filtering streams and handling output buffers.

Intl extension • Bundled since PHP 5.3+, but not always
enabled. • Wrapper around the excellent ICU library • Modules: – Collator – Number Formatting – Currency Formatting – Message Formatter (replaces gettext) – Normalizer – Locale – Convertors – Transliterators – Spoof checker – And more...

Communicating with MySQL • The connection between PHP and MySQL
defaults to a latin1 connection. • The first query you should run after making your connection: mysqli_query( 'SET NAMES "utf8" [COLLATE "collation_name"]' ); OR mysqli_query( 'SET CHARACTER SET utf8' ); • PHP 5.2+: mysqli_set_charset( 'utf8', $conn );

code – HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues 4.1+

Finding out what’s available • To find out which character
encodings are available and what their default collation is: SHOW CHARACTER SET; • To find out which collations are available: SHOW COLLATION LIKE 'utf8%';

Setting up a server • Add the following to your
/etc/my.cnf file: [mysqld] ... default-character-set=utf8 default-collation=utf8_general_ci • If you are the only user you could even do: (MySQL 5.x and later) (not executed for super-user logins) init_connect=’SET NAMES utf8′

Setting up databases & tables • Make sure that both
database, tables as well as text columns are in UTF-8: (CREATE | ALTER) DATABASE / TABLE ... ( ... ) [DEFAULT] CHARACTER SET utf8 [[DEFAULT] COLLATE collation] • Don’t forget Field widths

Choosing the collation • Collation == Sort order • Guideline
to the collations: _ci = case insensitive _cs = case sensitive _bin = binary • Test ! Image Source: Mysql.com

Converting an existing database • Using MySQL’s CONVERT function you
can migrate ‘old’ data: INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table; • For a complete php script to convert your database: http://www.phpwact.org/php/i18n/utf-8/mysql

Querying a database • You can specify the collation to
use for a specific query: SELECT k FROM t1 ORDER BY k COLLATE utf8_spanish_ci; • You can even use it in the WHERE clause: SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;

Common issue • If you run into the following error
message when running a query: Illegal mix of collations (utf8_bin,IMPLICIT) and (latin1_swedish_ci,COERCIBLE) for operation You may want to try and make your query more explicit with a character string literal: SELECT * FROM table WHERE col = _utf8'xyz';

Working with files • Use the b flag in fopen()
• Don’t use unicode in filenames

.po files • Poedit understands all encodings supported by operating
system and works in Unicode internally: http://www.poedit.net/

BOM ! Run for your life ! • ï»¿ or
extra blank line • BOM = Byte Order Mark The character is the ZERO WIDTH NON-BREAKING SPACE. If not placed at the top, it shouldn’t give any problems. • In UTF-16/32 the BOM is necessary to determine the byte order. In UTF-8 it is not.

Bom squad • Some browsers may display the BOM. •
‘headers already send’ problems. It can also cause problems when in front of “#!” in a shell script. • Check your editor settings • Open the file in (another) editor, if you can see the BOM manually delete it and save the file. Sometimes even just opening the file and saving it again as UTF-8 will solve it.

We’ve covered: •Dependancy on user’s computer setup •Client side code
– HTML, CSS, JS •Communication with the client •Server side code – PHP •Communicating with a MySQL database •MySQL •Communicating with files •Other common issues

Keep in touch! (I’m self-employed, you can hire me ;-)
) Juliette Reinders Folmer Email: [email protected] Web: http://www.adviesenzo.nl/ LinkedIn: http://nl.linkedin.com/in/julietterf Twitter: http://twitter.com/jrf_nl GitHub: http://github.com/jrfnl/ Please rate this talk on joined.in/11233 Slides: speakerdeck.com/jrf Endorsements and recommendations on LinkedIn are much appreciated too!

Anything else you never dared to ask before ?

Everything you always wanted to know about UTF-...

Everything you always wanted to know about UTF-8 (but never dared to ask)

More Decks by Juliette Reinders Folmer

Other Decks in Programming

Featured

Transcript