Save 37% off PRO during our Black Friday Sale! »

Get to know the Unicode monster and don’t let it harm you - Jeremy Chan

Get to know the Unicode monster and don’t let it harm you - Jeremy Chan

Jeremy Chan's short talk to the London Java Community on 8th September 2020.

In this talk, Jeremy will give us a brief history of Unicode and discuss some of the common pitfalls he's come across.

Jeremy Chan is currently a VP at Goldman Sachs. He has previously worked at Credit Suisse and Barclays Investment Bank and has a Masters in Computer Science and Engineering.

6a159fe473851237caea0d116ea732c4?s=128

London Java Community

September 08, 2020
Tweet

Transcript

  1. Get to know the Unicode monster and don’t let it

    harm you JEREMY CHAN 1
  2. Every programmer’s nightmare UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'

    in position 3: ordinal not in range(128) 2
  3. A Brief History of Unicode 3

  4. Basic Concept Ultimate goal of encoding: maps characters into bytes

    and vice versa 0x48 0x69 Hi Bytes Characters Encoding 4
  5. The beginning was ASCII in the 60s ASCII is a

    Character Set 7-bit = 2^7 = 128 characters A = 65 z = 122 … 5
  6. 128 characters aren’t nearly enough No British Pound symbol £,

    no accents, etc. ISO-8859: let’s add one more bit to support more characters for our language Latin-1 (Western European) Latin-2 (Non-Cyrillic Central and Eastern European) Latin-3 (Southern European languages and Esperanto) Latin-5 (Turkish) Latin-6 (Northern European and Baltic languages) …and many more 8 bits = 2^8 = 256 characters 6
  7. 7

  8. Unicode (1991) ONE CHARSET TO RULE IT ALL 8 Unicode

    Consortium
  9. Code point – layer of abstraction ? ? ? ?

    ? ? Hi Bytes U+0048 U+0069 U+1F4A9 Code point (A hex number) Characters Encoding 9
  10. How many code points? 10 Valid code points U+0000 –

    U+10FFFF (21 bits) 1,114,112 code points Currently about 143,859 (~13%) utilized ◦~ 1,000 characters for European languages ◦~ 75,000 characters for Chinese, Japanese and Korean ◦> 3,000 emojis
  11. More and more emojis (Unicode 13.0 March 2020) 11

  12. Examples 12 A U+0041 LATIN CAPITAL LETTER A 65 in

    decimal
  13. Examples 13 Ã U+00C3 LATIN CAPITAL LETTER A WITH TILDE

  14. Examples 14 ضب U+0636 ARABIC LETTER DAD

  15. Examples 15 世 U+4E16 CJK UNIFIED IDEOGRAPH-4E16

  16. Examples 16 U+1F4A9 PILE OF POO

  17. Fancy adding your own? We did it! How a comment

    on HackerNews lead to 4 ½ new Unicode characters 17
  18. How to turn code points into bytes? ? ? ?

    ? ? ? Hi Bytes U+0048 U+0069 U+1F4A9 Code point (A hex number) Characters Encoding 18 ~ 1 million possibilities
  19. Unfortunately multiple standards again 19 Compatibility with ASCII

  20. Now for the scary stuff 20

  21. Combining multiple code points 21 = e U+0065 LATIN SMALL

    LETTER E ́ U+0301 COMBINING ACUTE ACCENT é U+00E9 LATIN SMALL LETTER E WITH ACUTE + 2 code points 1 code point
  22. 22 Objects.equals( "é", "é") // false Objects.equals( Normalizer.normalize("é", Normalizer.Form.NFD), Normalizer.normalize("é",

    Normalizer.Form.NFD) ) // true Normalization – use it like how you used to do toLowerCase() Same character, but not equal
  23. Emoji as well 23 = U+1F468 Man U+1F3FE Emoji Modifier

    Fitzpatrick Type-5 + 2 code points
  24. Surrogate Pairs 24 = U+D83D HIGH_SURROGATES U+DC7D LOW_SURROGATES U+1F47D EXTRATERRESTRIAL

    ALIEN + Combines two 16-bit code units (surrogate pair) to store high code-points (0x10000 to 0x10FFFF) 1 code point String str = "Hi \uD83D\uDC7D"; // Hi
  25. Understand what you are counting 25 "".length() // 2 (code

    units)
  26. Grapheme (user-perceived character) count? 26 BreakIterator it = BreakIterator.getCharacterInstance(); it.setText("Hi");

    int count = 0; while (it.next() != BreakIterator.DONE) { count++; } System.out.println(count); // 3 More advanced features in ICU4J library: https://sites.google.com/site/icusite/home/why-use-icu4j
  27. 27 String str = "Hi \uD83D\uDC7D"; // Hi for (char

    c: str.toCharArray()) { System.out.println(c); } H i ? ? Never use char again
  28. Iterate in code points (Java 8) 28 H i String

    s = "Hi \uD83D\uDC7D"; // Hi s.codePoints().forEach(c -> { System.out.println(Character.toChars(c)); });
  29. Be careful with regex 29 Your name is Zo Pattern

    pattern = Pattern.compile("My name is (\\w+)"); Matcher matcher = pattern.matcher("My name is Zoé"); if (matcher.find()) { System.out.println("Your name is " + matcher.group(1)); }
  30. 30 Pattern pattern = Pattern.compile("My name is (\\w+)", UNICODE_CHARACTER_CLASS); Matcher

    matcher = pattern.matcher("My name is Zoé"); if (matcher.find()) { System.out.println("Your name is " + matcher.group(1)); } Your name is Zoé
  31. Finally…databases 31

  32. RTFM – read the friendly manual, carefully 32 Not work

    for most emojis
  33. 34

  34. Reference Must Read: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software- developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ History of unicode: https://www.translationroyale.com/the-history-of-unicode/ Some

    common pitfalls: http://unicode.org/faq/utf_bom.html Unicode hacks: https://speakerdeck.com/mathiasbynens/hacking-with-unicode Unicode regex pitfalls: http://www.guido-flohr.net/unicode-regex-pitfalls/ 35
  35. Thanks LINKEDIN: HTTPS://WWW.LINKEDIN.COM/IN/JEREMYCWCHAN 36