Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Get to know the Unicode monster and don’t let it harm you - Jeremy Chan

Get to know the Unicode monster and don’t let it harm you - Jeremy Chan

Jeremy Chan's short talk to the London Java Community on 8th September 2020.

In this talk, Jeremy will give us a brief history of Unicode and discuss some of the common pitfalls he's come across.

Jeremy Chan is currently a VP at Goldman Sachs. He has previously worked at Credit Suisse and Barclays Investment Bank and has a Masters in Computer Science and Engineering.

London Java Community

September 08, 2020
Tweet

More Decks by London Java Community

Other Decks in Programming

Transcript

  1. Get to know the Unicode monster and don’t let it

    harm you JEREMY CHAN 1
  2. Every programmer’s nightmare UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'

    in position 3: ordinal not in range(128) 2
  3. A Brief History of Unicode 3

  4. Basic Concept Ultimate goal of encoding: maps characters into bytes

    and vice versa 0x48 0x69 Hi Bytes Characters Encoding 4
  5. The beginning was ASCII in the 60s ASCII is a

    Character Set 7-bit = 2^7 = 128 characters A = 65 z = 122 … 5
  6. 128 characters aren’t nearly enough No British Pound symbol £,

    no accents, etc. ISO-8859: let’s add one more bit to support more characters for our language Latin-1 (Western European) Latin-2 (Non-Cyrillic Central and Eastern European) Latin-3 (Southern European languages and Esperanto) Latin-5 (Turkish) Latin-6 (Northern European and Baltic languages) …and many more 8 bits = 2^8 = 256 characters 6
  7. 7

  8. Unicode (1991) ONE CHARSET TO RULE IT ALL 8 Unicode

    Consortium
  9. Code point – layer of abstraction ? ? ? ?

    ? ? Hi Bytes U+0048 U+0069 U+1F4A9 Code point (A hex number) Characters Encoding 9
  10. How many code points? 10 Valid code points U+0000 –

    U+10FFFF (21 bits) 1,114,112 code points Currently about 143,859 (~13%) utilized ◦~ 1,000 characters for European languages ◦~ 75,000 characters for Chinese, Japanese and Korean ◦> 3,000 emojis
  11. More and more emojis (Unicode 13.0 March 2020) 11

  12. Examples 12 A U+0041 LATIN CAPITAL LETTER A 65 in

    decimal
  13. Examples 13 Ã U+00C3 LATIN CAPITAL LETTER A WITH TILDE

  14. Examples 14 ضب U+0636 ARABIC LETTER DAD

  15. Examples 15 世 U+4E16 CJK UNIFIED IDEOGRAPH-4E16

  16. Examples 16 U+1F4A9 PILE OF POO

  17. Fancy adding your own? We did it! How a comment

    on HackerNews lead to 4 ½ new Unicode characters 17
  18. How to turn code points into bytes? ? ? ?

    ? ? ? Hi Bytes U+0048 U+0069 U+1F4A9 Code point (A hex number) Characters Encoding 18 ~ 1 million possibilities
  19. Unfortunately multiple standards again 19 Compatibility with ASCII

  20. Now for the scary stuff 20

  21. Combining multiple code points 21 = e U+0065 LATIN SMALL

    LETTER E ́ U+0301 COMBINING ACUTE ACCENT é U+00E9 LATIN SMALL LETTER E WITH ACUTE + 2 code points 1 code point
  22. 22 Objects.equals( "é", "é") // false Objects.equals( Normalizer.normalize("é", Normalizer.Form.NFD), Normalizer.normalize("é",

    Normalizer.Form.NFD) ) // true Normalization – use it like how you used to do toLowerCase() Same character, but not equal
  23. Emoji as well 23 = U+1F468 Man U+1F3FE Emoji Modifier

    Fitzpatrick Type-5 + 2 code points
  24. Surrogate Pairs 24 = U+D83D HIGH_SURROGATES U+DC7D LOW_SURROGATES U+1F47D EXTRATERRESTRIAL

    ALIEN + Combines two 16-bit code units (surrogate pair) to store high code-points (0x10000 to 0x10FFFF) 1 code point String str = "Hi \uD83D\uDC7D"; // Hi
  25. Understand what you are counting 25 "".length() // 2 (code

    units)
  26. Grapheme (user-perceived character) count? 26 BreakIterator it = BreakIterator.getCharacterInstance(); it.setText("Hi");

    int count = 0; while (it.next() != BreakIterator.DONE) { count++; } System.out.println(count); // 3 More advanced features in ICU4J library: https://sites.google.com/site/icusite/home/why-use-icu4j
  27. 27 String str = "Hi \uD83D\uDC7D"; // Hi for (char

    c: str.toCharArray()) { System.out.println(c); } H i ? ? Never use char again
  28. Iterate in code points (Java 8) 28 H i String

    s = "Hi \uD83D\uDC7D"; // Hi s.codePoints().forEach(c -> { System.out.println(Character.toChars(c)); });
  29. Be careful with regex 29 Your name is Zo Pattern

    pattern = Pattern.compile("My name is (\\w+)"); Matcher matcher = pattern.matcher("My name is Zoé"); if (matcher.find()) { System.out.println("Your name is " + matcher.group(1)); }
  30. 30 Pattern pattern = Pattern.compile("My name is (\\w+)", UNICODE_CHARACTER_CLASS); Matcher

    matcher = pattern.matcher("My name is Zoé"); if (matcher.find()) { System.out.println("Your name is " + matcher.group(1)); } Your name is Zoé
  31. Finally…databases 31

  32. RTFM – read the friendly manual, carefully 32 Not work

    for most emojis
  33. 34

  34. Reference Must Read: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software- developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ History of unicode: https://www.translationroyale.com/the-history-of-unicode/ Some

    common pitfalls: http://unicode.org/faq/utf_bom.html Unicode hacks: https://speakerdeck.com/mathiasbynens/hacking-with-unicode Unicode regex pitfalls: http://www.guido-flohr.net/unicode-regex-pitfalls/ 35
  35. Thanks LINKEDIN: HTTPS://WWW.LINKEDIN.COM/IN/JEREMYCWCHAN 36