Get to know the Unicode monster and don’t let it harm you - Jeremy Chan

Get to know the Unicode monster and don’t let it
harm you JEREMY CHAN 1

Every programmer’s nightmare UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'
in position 3: ordinal not in range(128) 2

A Brief History of Unicode 3

Basic Concept Ultimate goal of encoding: maps characters into bytes
and vice versa 0x48 0x69 Hi Bytes Characters Encoding 4

The beginning was ASCII in the 60s ASCII is a
Character Set 7-bit = 2^7 = 128 characters A = 65 z = 122 … 5

128 characters aren’t nearly enough No British Pound symbol £,
no accents, etc. ISO-8859: let’s add one more bit to support more characters for our language Latin-1 (Western European) Latin-2 (Non-Cyrillic Central and Eastern European) Latin-3 (Southern European languages and Esperanto) Latin-5 (Turkish) Latin-6 (Northern European and Baltic languages) …and many more 8 bits = 2^8 = 256 characters 6

Unicode (1991) ONE CHARSET TO RULE IT ALL 8 Unicode
Consortium

Code point – layer of abstraction ? ? ? ?
? ? Hi Bytes U+0048 U+0069 U+1F4A9 Code point (A hex number) Characters Encoding 9

How many code points? 10 Valid code points U+0000 –
U+10FFFF (21 bits) 1,114,112 code points Currently about 143,859 (~13%) utilized ◦~ 1,000 characters for European languages ◦~ 75,000 characters for Chinese, Japanese and Korean ◦> 3,000 emojis

More and more emojis (Unicode 13.0 March 2020) 11

Examples 12 A U+0041 LATIN CAPITAL LETTER A 65 in
decimal

Examples 13 Ã U+00C3 LATIN CAPITAL LETTER A WITH TILDE

Examples 14 ضب U+0636 ARABIC LETTER DAD

Examples 15 世 U+4E16 CJK UNIFIED IDEOGRAPH-4E16

Examples 16 U+1F4A9 PILE OF POO

Fancy adding your own? We did it! How a comment
on HackerNews lead to 4 ½ new Unicode characters 17

How to turn code points into bytes? ? ? ?
? ? ? Hi Bytes U+0048 U+0069 U+1F4A9 Code point (A hex number) Characters Encoding 18 ~ 1 million possibilities

Unfortunately multiple standards again 19 Compatibility with ASCII

Now for the scary stuff 20

Combining multiple code points 21 = e U+0065 LATIN SMALL
LETTER E ́ U+0301 COMBINING ACUTE ACCENT é U+00E9 LATIN SMALL LETTER E WITH ACUTE + 2 code points 1 code point

22 Objects.equals( "é", "é") // false Objects.equals( Normalizer.normalize("é", Normalizer.Form.NFD), Normalizer.normalize("é",
Normalizer.Form.NFD) ) // true Normalization – use it like how you used to do toLowerCase() Same character, but not equal

Emoji as well 23 = U+1F468 Man U+1F3FE Emoji Modifier
Fitzpatrick Type-5 + 2 code points

Surrogate Pairs 24 = U+D83D HIGH_SURROGATES U+DC7D LOW_SURROGATES U+1F47D EXTRATERRESTRIAL
ALIEN + Combines two 16-bit code units (surrogate pair) to store high code-points (0x10000 to 0x10FFFF) 1 code point String str = "Hi \uD83D\uDC7D"; // Hi

Understand what you are counting 25 "".length() // 2 (code
units)

Grapheme (user-perceived character) count? 26 BreakIterator it = BreakIterator.getCharacterInstance(); it.setText("Hi");
int count = 0; while (it.next() != BreakIterator.DONE) { count++; } System.out.println(count); // 3 More advanced features in ICU4J library: https://sites.google.com/site/icusite/home/why-use-icu4j

27 String str = "Hi \uD83D\uDC7D"; // Hi for (char
c: str.toCharArray()) { System.out.println(c); } H i ? ? Never use char again

Iterate in code points (Java 8) 28 H i String
s = "Hi \uD83D\uDC7D"; // Hi s.codePoints().forEach(c -> { System.out.println(Character.toChars(c)); });

Be careful with regex 29 Your name is Zo Pattern
pattern = Pattern.compile("My name is (\\w+)"); Matcher matcher = pattern.matcher("My name is Zoé"); if (matcher.find()) { System.out.println("Your name is " + matcher.group(1)); }

30 Pattern pattern = Pattern.compile("My name is (\\w+)", UNICODE_CHARACTER_CLASS); Matcher
matcher = pattern.matcher("My name is Zoé"); if (matcher.find()) { System.out.println("Your name is " + matcher.group(1)); } Your name is Zoé

Finally…databases 31

RTFM – read the friendly manual, carefully 32 Not work
for most emojis

Reference Must Read: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software- developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ History of unicode: https://www.translationroyale.com/the-history-of-unicode/ Some
common pitfalls: http://unicode.org/faq/utf_bom.html Unicode hacks: https://speakerdeck.com/mathiasbynens/hacking-with-unicode Unicode regex pitfalls: http://www.guido-flohr.net/unicode-regex-pitfalls/ 35

Thanks LINKEDIN: HTTPS://WWW.LINKEDIN.COM/IN/JEREMYCWCHAN 36

Get to know the Unicode monster and don’t let i...

Get to know the Unicode monster and don’t let it harm you - Jeremy Chan

More Decks by London Java Community

Other Decks in Programming

Featured

Transcript