Unicode Essentials - Speaker Deck

Slide 1

Slide 1 text

Unicode Essentials Something you need to know about modern text systems Dan Chen / 2022-03-08

Slide 2

Slide 2 text

Q1 How do you reverse a string in JavaScript? 2

Slide 3

Slide 3 text

A 'helloworld'.split('').reverse().join('') 3

Slide 4

Slide 4 text

Slide 5

Slide 5 text

% node Welcome to Node.js v14.11.0. > 'helloworld'.split('').reverse().join('') 'dlrowolleh' > '你好世界'.split('').reverse().join('') '界世好你' > 'hello💩world'.split('').reverse().join('') 'dlrow��olleh' poop testing 5

Slide 6

Slide 6 text

Q2 How do you get a string's length in JavaScript? 6

Slide 7

Slide 7 text

A 'helloworld'.length 7

Slide 8

Slide 8 text

% node Welcome to Node.js v14.11.0. > 'helloworld'.length 10 > '你好世界'.length 4 > 'hello💩world'.length 12 8

Slide 9

Slide 9 text

“JavaScript is lame. That's why I use Python” 9

Slide 10

Slide 10 text

% python3 Python 3.9.0 (default, Dec 6 2020, 18:02:34) >>> len('helloworld') 10 >>> len('hello💩world') 11 >>> len('helloworld ') 18 10

Slide 11

Slide 11 text

Life is tough, Unicode is hard. Afraid no more! ● Essential topics to minimize the burden of dealing with texts ● Unicode basics ○ Codepoints ○ Encodings ● Common issues with Unicode ○ Length (bytes, characters) ○ Normalization ○ … 11

Slide 12

Slide 12 text

“A standard to give every character a unique identiﬁer” 12

Slide 13

Slide 13 text

How to represent text in computer world? ● ASCII (American Standard Code for Information Interchange) ○ Character = 7 bits, e.g., ⟨A⟩ = 65, or 0x41, or 0b1000001 ○ Not enough for modern texts (even with extended) ● Big-5 (i.e. CP950 on Windows) ○ Character = 16 bits (2 byte), e.g., ⟨字⟩ = 0xA672 ○ Region-speciﬁc encoding/codepage ● Unicode Codepoints ○ Character = 21 bits (logically), e.g., ⟨字⟩ = U+5B57, or No. 22383 ○ One standard to rule them all (over the world) ○ Technically, some codepoints do not refer to visible/standalone characters /zì/ character 13

Slide 14

Slide 14 text

● A unique identiﬁer (codepoint) of a Unicode character = 21 bits (~3 bytes unsigned) ● For example ⟨字⟩ is at No. 22383 (U+5B57) ● “Let's store each character with 4 bytes” ○ Named as UCS-4, or UTF-32 (where UCS=Universal Coded Character Set, and UTF=Unicode Transformation Format) ○ Encoded buffer looks like codepoint sequence, easy peasy! ○ ⟨語言⟩ = ⟨U+8A9E⟩⟨U+8A00⟩ → ⟨9E 8A 00 00⟩⟨8A 00 00 00⟩ ○ ⟨AB⟩ = ⟨U+0041⟩⟨U+0042⟩ → ⟨41 00 00 00⟩⟨42 00 00 00⟩ ○ Not eﬃcient in most cases (where using ASCII or Extended ASCII was/is enough) OK then, How to represent a number in computer world? /yǔ yán/ language I'm trying to make it short & simple, hence the information may be imprecise or incorrect. Please refer to Wikipedia and Unicode website for detailed history. ⚠ 14

Slide 15

Slide 15 text

● “Alright. What if we put commonly used characters to lower 2 bytes?” ○ UCS-2, and later UTF-16 (unlike UCS-2 & UTF-32, these two are different!) ○ ⟨語⟩⟨言⟩ = ⟨U+8A9E⟩⟨U+8A00⟩ → bytes ⟨9E 8A 00 00⟩⟨00 8A 00 00⟩ ○ ⟨A⟩⟨B⟩ = ⟨U+0041⟩⟨U+0042⟩ → ⟨41 00 00 00⟩⟨42 00 00 00⟩ ○ Reduced overall document size by 50% (in comparison to UTF-32) ● “Come on. My ASCII document is still bloated to 200% with a lot of trash.” ○ UTF-8 encodes a character with 1~4 bytes smartly (i.e. variable length) ○ ⟨語言AB⟩ = ⟨U+8A9E⟩⟨U+8A00⟩⟨U+0041⟩⟨U+0042⟩ → ⟨E8 AA 9E⟩⟨E8 A8 80⟩⟨41⟩⟨42⟩ OK then, How to represent a number in computer world? (cont.) 15

Slide 16

Slide 16 text

Ref. https://zh.wikipedia.org/wiki/UTF-8#/media/File:Unicode_Web_growth.svg Thumb rule → favor UTF-8 by default 16

Slide 17

Slide 17 text

Encoded Sequence (Storage/Transport) Unicode Codepoints (In-Memory) Unicode Planes (Rendering) E8 AA 9E E8 A8 80 41 42 E8 AA 9E ↔ U+8A9E E8 A8 80 ↔ U+8A00 41 ↔ U+0041 42 ↔ U+0041 U+8A9E ↔ 語 U+8A00 ↔ 言 U+0041 ↔ A U+0042 ↔ B “A standard to give every character a unique identiﬁer” 17

Slide 18

Slide 18 text

“No worries! My string is encoded in Unicode” UTF-8 UTF-16LE, UTF-16BE UTF-32LE, UTF-32BE UCS-2 UCS-4 … 18

Slide 19

Slide 19 text

% python3 Python 3.9.0 (default, Dec 6 2020, 18:02:34) >>> '語言AB'.encode('utf-16le') b'\x9e\x8a\x00\x8aA\x00B\x00' >>> '語言AB'[1].encode('utf-32le') b'\x00\x8a\x00\x00' >>> '語言AB'.encode('utf-8') b'\xe8\xaa\x9e\xe8\xa8\x80AB' >>> ' '.join(map(hex, '語言AB'.encode('utf-8'))) '0xe8 0xaa 0x9e 0xe8 0xa8 0x80 0x41 0x42' 19

Slide 20

Slide 20 text

UCS-2 vs. UTF-16 ● Unicode needs 21 bits (logically) to represent all characters ○ But UCS-2 only takes 16 bits for each character ○ What if we need to store/display a character that takes more than 16 bits? ○ For example, 💩 = U+1F4A9 = 0b11111010010101001 (17 bits) ● Augment UCS-2 with variable length capability → UTF-16 ● Unicode deﬁnes planes ○ Plane = Group of 65536 (216) codepoints ○ BMP (Basic Multilingual Plane) or simply Plane 0, deﬁnes commonly used characters ○ Codepoint outside BMP takes more than 16 bits 20

Slide 21

Slide 21 text

● Take 4 bytes (rather than 2 bytes) in UTF-16 to represent a codepoint ● High surrogate U+D800~U+DBFF, combines with low surrogate U+DC00~U+DFFF ● Some (classic) JavaScript APIs behave like UCS-2 ○ While some other JavaScript APIs behave like UTF-16 (i.e. surrogate pair awared) ○ Google V8 internally uses UTF-16 (ref. v8/heap/factory.cc) % node Welcome to Node.js v14.11.0. > '💩'.length 2 > [0,1].map(i => '💩'.charCodeAt(i).toString(16)) [ 'd83d', 'dca9' ] > [...'Hello 💩'] [ 'H', 'e', 'l', 'l', 'o', ' ', '💩' ] Surrogate Pairs 21

Slide 22

Slide 22 text

Given a “Unicode” ﬁle, how do you tell its encoding? ● UTF-8, UCS-2, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, etc. ● BOM (Byte Order Mark) — ⟨U+FFFE⟩ ○ Specify endianness → E.g., UTF-16BE = ⟨0xFF 0xFE⟩, UTF-16LE = ⟨0xFE 0xFF⟩ ○ Specify encoding → E.g., UTF-8 = ⟨0xEF 0xBB 0xBF⟩ ● Case Study (...) ○ Microsoft Excel opens CSV with “system encoding” (Windows settings) ○ E.g., CP950 (Big-5), unless the BOM is present #!python3 with open(p, 'w', encoding='utf-8') as f: w = csv.DictWriter(…) # ---- vs ---- with open(p, 'w', encoding='utf-8-sig') as f: w = csv.DictWriter(…) 22

Slide 23

Slide 23 text

Slide 24

Slide 24 text

● Canonically equivalent codepoint sequences ○ Same appearance & meaning when printed or displayed ○ For example, the diacritics (accents) and the Hangul syllables ● Compatible codepoint sequences ○ Possibly distinct appearance, but same meaning in some contexts ○ For example, ⟨ﬀ⟩ U+FB00 is compatible with two ordinary Latin ⟨f⟩ U+0066 letters ○ Another example, ligature ⟨㍿⟩ U+337F ○ Ref. CJK Compatibility ● Normal form (Normalization) ○ Reduce equivalent codepoint sequences to the same codepoints ○ NFD, NFC, NFKD, NFKC Unicode Equivalence / Normalization % python3 from unicodedata import normalize >>> '\uC77C', '\uC774\u11AF' ('일', '일') >>> '\uC77C' == '\uC774\u11AF' False >>> '\uC77C' == normalize('NFC', '\uC774\u11AF') True >>> normalize('NFD', '\uC77C') == '\uC774\u11AF' False /kabushiki gaisha/ 25

Slide 25

Slide 25 text

● Emoji codepoints reside in SMP (Supplementary Multilingual Plane) ○ Surrogate pairs in UTF-16 (2 × 2 bytes), 4 bytes in UTF-8 ○ Poop testing 💩 (e.g., in Node.js, "💩".length gives 2) ● ZWJ (zero-width joiner, U+200D) Sequence ○ Man Technologist = a combination of 👨 U+1F468, ⟨ZWJ⟩ U+200D, and 💻 U+1F4BB ○ Family = 👨 + ⟨ZWJ⟩ + 👩 + ⟨ZWJ⟩ + 👧 + ⟨ZWJ⟩ + 👦 ○ Kiss ❤ 💋 = 👨 + 🏾 + ⟨ZWJ⟩ + ❤ + ⟨Variation Selector-16⟩ U+FE0F + ⟨ZWJ⟩ + 💋 + + 👩 + 🏻 ● Skin Tone Modiﬁers ○ 6 levels according to the Fitzpatrick scale ○ Women (Light Skin & Curly Hair) = 👩 U+1F469 🏻 U+1F3FB ⟨ZWJ⟩ U+200D 🦱 U+1F9B1 ● Flags ○ Taiwan = ⟨🇹⟩ U+1F1F9 ⟨🇼⟩ U+1F1FC % python3 >>> len('💩') 1 >>> len(' ') 3 >>> len(' ') 4 >>> len(' ') 7 >>> len(' ') 2 Emoji ❤💀🤖 26 = + + ⟨ZWJ⟩ + ❤ + ⟨Variation Selector-16⟩ U+FE0F + ⟨ZWJ⟩ + + + +

Slide 26

Slide 26 text

● Python 2 → Use no more, and migrate to Python 3 ● Python 3 ○ len(str) and [x for x in str] operate on codepoints (beware of combining characters ⚠) ○ Rule of 🍔 → ① bytes#decode to text on input, ② process text, ③ str#encode to bytes on output ○ Use unicodedata#normalize and str#casefold for display/sorting/comparison ○ Sorting (standard sorted) may require locale#setlocale, or just import pyuca ○ Use grapheme#length to count visible characters ○ Ref. Unicode HOWTO and Unicode Objects and Codecs (Python 3 oﬃcial documentation) ● JavaScript ○ String#length counts UCS-2 code units, while for … of str and [...str] operate codepoints ○ Beware of surrogate pairs (in some APIs) and combining characters ⚠ ○ Consider String#normalize, String#localeCompare, and String#toLocaleLowerCase ○ Use GraphemeSplitter#countGraphemes to count visible characters Programming with Unicode (the simple version) 27

Slide 27

Slide 27 text

Fun Programming Tricks (though you may not need these) 28 % python3 Python 3.9.0 (default, Dec 6 2020, 18:02:34) >>> '\N{PILE OF POO}' '💩' >>> x = '⑧⑥' >>> int(x) Traceback (most recent call last): File "", line 1, in ValueError: invalid literal for int() with base 10: '⑧⑥' >>> sum([unicodedata.numeric(x[i]) * 10**(len(x)-i-1) \ for i in range(len(x))]) 86.0

Slide 28

Slide 28 text

● Zalgo text (creepy/glitchy appearance with diacritics) ● Collation (byte-to-byte comparison among strings, e.g., radical or strokes for Chinese, ref) ● Casefolding (aggressive “lower()” for non-Latin caseless comparison) ● CJK Uniﬁed Ideographs (For ⟨㋿⟩, which one to use → ⟨令 U+4EE4⟩ vs ⟨令 U+F9A8⟩) ● IDN homograph attack (phishing scam) ● Regular Expressions (e.g., r"\p{InCJK_Unified_Ideographs}") ● Transliteration (letter swap rules, e.g., Greek ⟨α⟩ → ⟨a⟩, sometimes used in SMS) ● Punycode (represent Unicode characters in ASCII subset, e.g., ⟨中文⟩ → ⟨xn--ﬁq228c⟩) ● Right-to-left mark ⟨U+200F⟩ (e.g., ⟨U+0623 U+0647 U+0644 U+0627! U+200F⟩ → ⟨!ﻼھأ⟩) ● Localization (with Unicode CLDR) → Unicode is not just about characters ● Databases: MySQL (utf8mb4), and other languages: C, C++, Swift, Go, HTML, etc. ● Security issues related to Unicode processing (e.g., CVE-2018-4290 – crashes iPhone) ● … (Unicode is too hard to master) Advanced Topics (not covered today) 部首筆劃 29

Slide 29

Slide 29 text

● Unicode Solutions in Python 2 and Python 3 (slides) 👍 ● Unicode, JavaScript and the Emoji family (slides) 👍 ● What Every Programmer Absolutely Needs To Know About Encodings & Character Sets (blog post) 👍 ● I � Unicode (slides) 👍 ● Hacking with Unicode in 2016 (slides) 👍 ● Python 3 and Unicode (slides) ● Counting characters – Twitter Developer Platform (blog post) ● International Components for Unicode (slides) ● 新元号対応についてー日本マイクロソフト株式会社 (slides) ● Unicode Explained (book, O'reilly, 2006) ● Unicode Demystiﬁed (book, O'reilly, 2002) ● CJKV Information Processing, 2/e (book, O'reilly, 2008) ● rust-lang/rust #12056 path: Windows paths may contain non-utf8-representable sequences (GitHub) ● Unicode Support – The Linux kernel user's & admin's guide (documentation) Further Readings 30

Slide 30

Slide 30 text

● Compart Unicode Lookup ● Unicode search tool (scarfboy.com) ● Emojipedia ● Shapecatcher (recognize handwritten Unicode characters) ● 古今文字集成 (ancient Chinese characters lookup) ● Unicode tools curation (unicode.org) Useful Online Tools 31

Slide 31

Slide 31 text

Bonus: Encoding Option in Windows Notepad ● ANSI → System Locale ● Unicode → UTF-16LE ● Unicode big endian → UTF-16BE ● UTF-8 → UTF-8 32

Slide 32

Slide 32 text

● ANSI → System Locale ● UTF-16 LE ● UTF-16 BE ● UTF-8 ● UTF-8 with BOM Bonus: Encoding Option in Windows Notepad (cont.) Things seem to get better in latest Windows (20H2~) 33