Slide 28
Slide 28 text
● Zalgo text (creepy/glitchy appearance with diacritics)
● Collation (byte-to-byte comparison among strings, e.g., radical or strokes for Chinese, ref)
● Casefolding (aggressive “lower()” for non-Latin caseless comparison)
● CJK Unified Ideographs (For ⟨㋿⟩, which one to use → ⟨令 U+4EE4⟩ vs ⟨令 U+F9A8⟩)
● IDN homograph attack (phishing scam)
● Regular Expressions (e.g., r"\p{InCJK_Unified_Ideographs}")
● Transliteration (letter swap rules, e.g., Greek ⟨α⟩ → ⟨a⟩, sometimes used in SMS)
● Punycode (represent Unicode characters in ASCII subset, e.g., ⟨中文⟩ → ⟨xn--fiq228c⟩)
● Right-to-left mark ⟨U+200F⟩ (e.g., ⟨U+0623 U+0647 U+0644 U+0627! U+200F⟩ → ⟨!ﻼھأ⟩)
● Localization (with Unicode CLDR) → Unicode is not just about characters
● Databases: MySQL (utf8mb4), and other languages: C, C++, Swift, Go, HTML, etc.
● Security issues related to Unicode processing (e.g., CVE-2018-4290 – crashes iPhone)
● … (Unicode is too hard to master)
Advanced Topics (not covered today)
部首 筆劃
29