Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode Essentials

Unicode Essentials

Essential Unicode topics that every software developer should know about. This deck helps minimize the burden of dealing with modern text systems.

Dan Chen

March 08, 2022
Tweet

More Decks by Dan Chen

Other Decks in Programming

Transcript

  1. 4

  2. % node Welcome to Node.js v14.11.0. > 'helloworld'.split('').reverse().join('') 'dlrowolleh' >

    '你好世界'.split('').reverse().join('') '界世好你' > 'hello💩world'.split('').reverse().join('') 'dlrow��olleh' poop testing 5
  3. % node Welcome to Node.js v14.11.0. > 'helloworld'.length 10 >

    '你好世界'.length 4 > 'hello💩world'.length 12 8
  4. % python3 Python 3.9.0 (default, Dec 6 2020, 18:02:34) >>>

    len('helloworld') 10 >>> len('hello💩world') 11 >>> len('helloworld ') 18 10
  5. Life is tough, Unicode is hard. Afraid no more! •

    Essential topics to minimize the burden of dealing with texts • Unicode basics ◦ Codepoints ◦ Encodings • Common issues with Unicode ◦ Length (bytes, characters) ◦ Normalization ◦ … 11
  6. How to represent text in computer world? • ASCII (American

    Standard Code for Information Interchange) ◦ Character = 7 bits, e.g., ⟨A⟩ = 65, or 0x41, or 0b1000001 ◦ Not enough for modern texts (even with extended) • Big-5 (i.e. CP950 on Windows) ◦ Character = 16 bits (2 byte), e.g., ⟨字⟩ = 0xA672 ◦ Region-specific encoding/codepage • Unicode Codepoints ◦ Character = 21 bits (logically), e.g., ⟨字⟩ = U+5B57, or No. 22383 ◦ One standard to rule them all (over the world) ◦ Technically, some codepoints do not refer to visible/standalone characters /zì/ character 13
  7. • A unique identifier (codepoint) of a Unicode character =

    21 bits (~3 bytes unsigned) • For example ⟨字⟩ is at No. 22383 (U+5B57) • “Let's store each character with 4 bytes” ◦ Named as UCS-4, or UTF-32 (where UCS=Universal Coded Character Set, and UTF=Unicode Transformation Format) ◦ Encoded buffer looks like codepoint sequence, easy peasy! ◦ ⟨語言⟩ = ⟨U+8A9E⟩⟨U+8A00⟩ → ⟨9E 8A 00 00⟩⟨8A 00 00 00⟩ ◦ ⟨AB⟩ = ⟨U+0041⟩⟨U+0042⟩ → ⟨41 00 00 00⟩⟨42 00 00 00⟩ ◦ Not efficient in most cases (where using ASCII or Extended ASCII was/is enough) OK then, How to represent a number in computer world? /yǔ yán/ language I'm trying to make it short & simple, hence the information may be imprecise or incorrect. Please refer to Wikipedia and Unicode website for detailed history. ⚠ 14
  8. • “Alright. What if we put commonly used characters to

    lower 2 bytes?” ◦ UCS-2, and later UTF-16 (unlike UCS-2 & UTF-32, these two are different!) ◦ ⟨語⟩⟨言⟩ = ⟨U+8A9E⟩⟨U+8A00⟩ → bytes ⟨9E 8A 00 00⟩⟨00 8A 00 00⟩ ◦ ⟨A⟩⟨B⟩ = ⟨U+0041⟩⟨U+0042⟩ → ⟨41 00 00 00⟩⟨42 00 00 00⟩ ◦ Reduced overall document size by 50% (in comparison to UTF-32) • “Come on. My ASCII document is still bloated to 200% with a lot of trash.” ◦ UTF-8 encodes a character with 1~4 bytes smartly (i.e. variable length) ◦ ⟨語言AB⟩ = ⟨U+8A9E⟩⟨U+8A00⟩⟨U+0041⟩⟨U+0042⟩ → ⟨E8 AA 9E⟩⟨E8 A8 80⟩⟨41⟩⟨42⟩ OK then, How to represent a number in computer world? (cont.) 15
  9. Encoded Sequence (Storage/Transport) Unicode Codepoints (In-Memory) Unicode Planes (Rendering) E8

    AA 9E E8 A8 80 41 42 E8 AA 9E ↔ U+8A9E E8 A8 80 ↔ U+8A00 41 ↔ U+0041 42 ↔ U+0041 U+8A9E ↔ 語 U+8A00 ↔ 言 U+0041 ↔ A U+0042 ↔ B “A standard to give every character a unique identifier” 17
  10. “No worries! My string is encoded in Unicode” UTF-8 UTF-16LE,

    UTF-16BE UTF-32LE, UTF-32BE UCS-2 UCS-4 … 18
  11. % python3 Python 3.9.0 (default, Dec 6 2020, 18:02:34) >>>

    '語言AB'.encode('utf-16le') b'\x9e\x8a\x00\x8aA\x00B\x00' >>> '語言AB'[1].encode('utf-32le') b'\x00\x8a\x00\x00' >>> '語言AB'.encode('utf-8') b'\xe8\xaa\x9e\xe8\xa8\x80AB' >>> ' '.join(map(hex, '語言AB'.encode('utf-8'))) '0xe8 0xaa 0x9e 0xe8 0xa8 0x80 0x41 0x42' 19
  12. UCS-2 vs. UTF-16 • Unicode needs 21 bits (logically) to

    represent all characters ◦ But UCS-2 only takes 16 bits for each character ◦ What if we need to store/display a character that takes more than 16 bits? ◦ For example, 💩 = U+1F4A9 = 0b11111010010101001 (17 bits) • Augment UCS-2 with variable length capability → UTF-16 • Unicode defines planes ◦ Plane = Group of 65536 (216) codepoints ◦ BMP (Basic Multilingual Plane) or simply Plane 0, defines commonly used characters ◦ Codepoint outside BMP takes more than 16 bits 20
  13. • Take 4 bytes (rather than 2 bytes) in UTF-16

    to represent a codepoint • High surrogate U+D800~U+DBFF, combines with low surrogate U+DC00~U+DFFF • Some (classic) JavaScript APIs behave like UCS-2 ◦ While some other JavaScript APIs behave like UTF-16 (i.e. surrogate pair awared) ◦ Google V8 internally uses UTF-16 (ref. v8/heap/factory.cc) % node Welcome to Node.js v14.11.0. > '💩'.length 2 > [0,1].map(i => '💩'.charCodeAt(i).toString(16)) [ 'd83d', 'dca9' ] > [...'Hello 💩'] [ 'H', 'e', 'l', 'l', 'o', ' ', '💩' ] Surrogate Pairs 21
  14. Given a “Unicode” file, how do you tell its encoding?

    • UTF-8, UCS-2, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, etc. • BOM (Byte Order Mark) — ⟨U+FFFE⟩ ◦ Specify endianness → E.g., UTF-16BE = ⟨0xFF 0xFE⟩, UTF-16LE = ⟨0xFE 0xFF⟩ ◦ Specify encoding → E.g., UTF-8 = ⟨0xEF 0xBB 0xBF⟩ • Case Study (...) ◦ Microsoft Excel opens CSV with “system encoding” (Windows settings) ◦ E.g., CP950 (Big-5), unless the BOM is present #!python3 with open(p, 'w', encoding='utf-8') as f: w = csv.DictWriter(…) # ---- vs ---- with open(p, 'w', encoding='utf-8-sig') as f: w = csv.DictWriter(…) 22
  15. 24

  16. • Canonically equivalent codepoint sequences ◦ Same appearance & meaning

    when printed or displayed ◦ For example, the diacritics (accents) and the Hangul syllables • Compatible codepoint sequences ◦ Possibly distinct appearance, but same meaning in some contexts ◦ For example, ⟨ff⟩ U+FB00 is compatible with two ordinary Latin ⟨f⟩ U+0066 letters ◦ Another example, ligature ⟨㍿⟩ U+337F ◦ Ref. CJK Compatibility • Normal form (Normalization) ◦ Reduce equivalent codepoint sequences to the same codepoints ◦ NFD, NFC, NFKD, NFKC Unicode Equivalence / Normalization % python3 from unicodedata import normalize >>> '\uC77C', '\uC774\u11AF' ('일', '일') >>> '\uC77C' == '\uC774\u11AF' False >>> '\uC77C' == normalize('NFC', '\uC774\u11AF') True >>> normalize('NFD', '\uC77C') == '\uC774\u11AF' False /kabushiki gaisha/ 25
  17. • Emoji codepoints reside in SMP (Supplementary Multilingual Plane) ◦

    Surrogate pairs in UTF-16 (2 × 2 bytes), 4 bytes in UTF-8 ◦ Poop testing 💩 (e.g., in Node.js, "💩".length gives 2) • ZWJ (zero-width joiner, U+200D) Sequence ◦ Man Technologist = a combination of 👨 U+1F468, ⟨ZWJ⟩ U+200D, and 💻 U+1F4BB ◦ Family = 👨 + ⟨ZWJ⟩ + 👩 + ⟨ZWJ⟩ + 👧 + ⟨ZWJ⟩ + 👦 ◦ Kiss ❤ 💋 = 👨 + 🏾 + ⟨ZWJ⟩ + ❤ + ⟨Variation Selector-16⟩ U+FE0F + ⟨ZWJ⟩ + 💋 + <ZWJ> + 👩 + 🏻 • Skin Tone Modifiers ◦ 6 levels according to the Fitzpatrick scale ◦ Women (Light Skin & Curly Hair) = 👩 U+1F469 🏻 U+1F3FB ⟨ZWJ⟩ U+200D 🦱 U+1F9B1 • Flags ◦ Taiwan = ⟨🇹⟩ U+1F1F9 ⟨🇼⟩ U+1F1FC % python3 >>> len('💩') 1 >>> len(' ') 3 >>> len(' ') 4 >>> len(' ') 7 >>> len(' ') 2 Emoji ❤💀🤖 26 = + + ⟨ZWJ⟩ + ❤ + ⟨Variation Selector-16⟩ U+FE0F + ⟨ZWJ⟩ + + <ZWJ> + +
  18. • Python 2 → Use no more, and migrate to

    Python 3 • Python 3 ◦ len(str) and [x for x in str] operate on codepoints (beware of combining characters ⚠) ◦ Rule of 🍔 → ① bytes#decode to text on input, ② process text, ③ str#encode to bytes on output ◦ Use unicodedata#normalize and str#casefold for display/sorting/comparison ◦ Sorting (standard sorted) may require locale#setlocale, or just import pyuca ◦ Use grapheme#length to count visible characters ◦ Ref. Unicode HOWTO and Unicode Objects and Codecs (Python 3 official documentation) • JavaScript ◦ String#length counts UCS-2 code units, while for … of str and [...str] operate codepoints ◦ Beware of surrogate pairs (in some APIs) and combining characters ⚠ ◦ Consider String#normalize, String#localeCompare, and String#toLocaleLowerCase ◦ Use GraphemeSplitter#countGraphemes to count visible characters Programming with Unicode (the simple version) 27
  19. Fun Programming Tricks (though you may not need these) 28

    % python3 Python 3.9.0 (default, Dec 6 2020, 18:02:34) >>> '\N{PILE OF POO}' '💩' >>> x = '⑧⑥' >>> int(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '⑧⑥' >>> sum([unicodedata.numeric(x[i]) * 10**(len(x)-i-1) \ for i in range(len(x))]) 86.0
  20. • Zalgo text (creepy/glitchy appearance with diacritics) • Collation (byte-to-byte

    comparison among strings, e.g., radical or strokes for Chinese, ref) • Casefolding (aggressive “lower()” for non-Latin caseless comparison) • CJK Unified Ideographs (For ⟨㋿⟩, which one to use → ⟨令 U+4EE4⟩ vs ⟨令 U+F9A8⟩) • IDN homograph attack (phishing scam) • Regular Expressions (e.g., r"\p{InCJK_Unified_Ideographs}") • Transliteration (letter swap rules, e.g., Greek ⟨α⟩ → ⟨a⟩, sometimes used in SMS) • Punycode (represent Unicode characters in ASCII subset, e.g., ⟨中文⟩ → ⟨xn--fiq228c⟩) • Right-to-left mark ⟨U+200F⟩ (e.g., ⟨U+0623 U+0647 U+0644 U+0627! U+200F⟩ → ⟨!ﻼھأ⟩) • Localization (with Unicode CLDR) → Unicode is not just about characters • Databases: MySQL (utf8mb4), and other languages: C, C++, Swift, Go, HTML, etc. • Security issues related to Unicode processing (e.g., CVE-2018-4290 – crashes iPhone) • … (Unicode is too hard to master) Advanced Topics (not covered today) 部首 筆劃 29
  21. • Unicode Solutions in Python 2 and Python 3 (slides)

    👍 • Unicode, JavaScript and the Emoji family (slides) 👍 • What Every Programmer Absolutely Needs To Know About Encodings & Character Sets (blog post) 👍 • I � Unicode (slides) 👍 • Hacking with Unicode in 2016 (slides) 👍 • Python 3 and Unicode (slides) • Counting characters – Twitter Developer Platform (blog post) • International Components for Unicode (slides) • 新元号対応についてー日本マイクロソフト株式会社 (slides) • Unicode Explained (book, O'reilly, 2006) • Unicode Demystified (book, O'reilly, 2002) • CJKV Information Processing, 2/e (book, O'reilly, 2008) • rust-lang/rust #12056 path: Windows paths may contain non-utf8-representable sequences (GitHub) • Unicode Support – The Linux kernel user's & admin's guide (documentation) Further Readings 30
  22. • Compart Unicode Lookup • Unicode search tool (scarfboy.com) •

    Emojipedia • Shapecatcher (recognize handwritten Unicode characters) • 古今文字集成 (ancient Chinese characters lookup) • Unicode tools curation (unicode.org) Useful Online Tools 31
  23. Bonus: Encoding Option in Windows Notepad • ANSI → System

    Locale • Unicode → UTF-16LE • Unicode big endian → UTF-16BE • UTF-8 → UTF-8 32
  24. • ANSI → System Locale • UTF-16 LE • UTF-16

    BE • UTF-8 • UTF-8 with BOM Bonus: Encoding Option in Windows Notepad (cont.) Things seem to get better in latest Windows (20H2~) 33
  25. 35