工程師一定要懂的 Text Encoding

Slide 1

Slide 1 text

Creative Commons 3.0 BY-NC-ND 工程師一定要懂的 Text Encoding 都 2 0 2 0 了你還搞不懂 U n i c o d e 嗎？ Inndy Lin / [email protected]

Slide 2

Slide 2 text

奧義智慧 Proprietary and Confidential Information 電腦如何儲存文字？

Slide 3

Slide 3 text

Creative Commons 3.0 BY-NC-ND ASCII 編碼 ► ASCII 定義 0x00 ~ 0x7F ► Printable 0x20 ~ 0x7e ► ' ': 0x20 ► A: 0x41 ► a: 0x61 ► Q: 轉換大小寫快速做法 ► Ans: c ^ 0x20

Slide 4

Slide 4 text

Creative Commons 3.0 BY-NC-ND ASCII Escape Character ► 0x00 ~ 0x1F, 0x7F ► GUI 被發明/普及以前，用 Terminal 跟 Printer ► 為什麼 Windows ( CR LF / "\r\n" ) 換行跟 Linux ( LF "\n" ) 換行不一樣？ ► CR 移動遊標到行頭 ► LF 移動遊標到下一行至於為什麼 Linux 只有 LF 我就不知道了 XD

Slide 5

Slide 5 text

奧義智慧 Proprietary and Confidential Information 歐洲人怎麼辦？

Slide 6

Slide 6 text

Creative Commons 3.0 BY-NC-ND Latin-1 編碼 ► ASCII 定義範圍 0x00 ~ 0x7F ► E(xtended)ASCII 0x00 ~ 0xFF ► a.k.a. ► Latin-1 ► ISO/IEC 8859-1 ► CP 437 ► Python Hack: 用 latin1 編碼可以保存任何 binary data ► b"\x01\x02\x80\xff".decode("latin1")

Slide 7

Slide 7 text

奧義智慧 Proprietary and Confidential Information 亞洲人怎麼辦？

Slide 8

Slide 8 text

Creative Commons 3.0 BY-NC-ND CCCII, CSIC, CISCII, …??? Big5!!! ► 電腦發明後，臺灣自己做了幾套中文系統 ► 倚天中文系統、IBM 5550、王安碼 ► 細節我不清楚，畢竟時候我還沒出生 XD ► 資策會把廠商找來開會，指定出大五碼（五家廠商），也就是 Big5，後來也被香港採用 ► 後來 Big5 有改版，新增一些文字，Python 裡面的 Big5 是很早期的標準，所以會缺字 ► 中國自己做了 GBK ► 香港基於 Big5 做了 HKSCS ► Python Tips: 解碼中文請用 "big5-hkscs"

Slide 9

Slide 9 text

奧義智慧 Proprietary and Confidential Information 同一份文件要放兩種語言要怎麼辦？

Slide 10

Slide 10 text

Creative Commons 3.0 BY-NC-ND Unicode! ► 1991 年制訂了 Unicode 1.0 ► 除了各國文字外，也包含了 Emoji ► 2020/3 Unicode 13.0 Bubble Tea Emoji! → ► Windows NT: UCS-2 ► Windows XP: UTF-16 ► UTF-8 ► ASCII 的超集合，0x00 ~ 0x7f 相容與 ASCII ► RLE (Running Length Encode) ► 中文字通常佔 3bytes ⚫ 但是 Big5, UTF16 中文只要 2bytes

Slide 11

Slide 11 text

Creative Commons 3.0 BY-NC-ND Charset, Encoding? ► Charset 定義了有那些符號可以用（Character, Symbol） ► Encoding 是如何把 Character 儲存成 binary 的編碼方式 ► Big5 包含 Charset 以及 Encoding ► Unicode 是 Charset, UTF-8 是 Encoding ► UTF-16 其實有兩種，UTF-16LE, UTF-16BE ► BOM (Byte-Order Mark): 在檔案開頭放一個 U+FEFF 的 character ► PHP → Warning: Cannot modify header information – headers already sent

Slide 12

Slide 12 text

Creative Commons 3.0 BY-NC-ND Integer Encoding ► Little Endian ► int a = 0x12345678; ► 在記憶體 / 檔案中： 78 56 34 12 ► UTF-16LE ► >>> chr(0x2266).encode('utf-16le').hex() ► '6622' ► >>> chr(0x2266).encode('utf-16').hex() # with BOM ► 'fffe6622'

Slide 13

Slide 13 text

Creative Commons 3.0 BY-NC-ND Unicode 編碼黑魔法 ► Unicode 的 Code point 超過一個 char / wchar_t 的時候 ► UTF-8：一個中文字佔用 3 bytes，Emoji 佔用 4 bytes ► UTF-16 有時候會用 4bytes 表達一個 character

Slide 14

Slide 14 text

Creative Commons 3.0 BY-NC-ND Unicode - Surrogate Pairs

Slide 15

Slide 15 text

Creative Commons 3.0 BY-NC-ND 延伸閱讀 ► http://utf8everywhere.org/ ← 非常推薦 ► https://www.meziantou.net/how-to-correctly-count-the-number-of-characters-of- a-string.htm ► https://github.com/tonsky/FiraCode ► https://speakerdeck.com/inndy/binary-processing

Slide 16

Slide 16 text

Creative Commons 3.0 BY-NC-ND 延伸閱讀 ► https://en.wikipedia.org/wiki/Unicode ► https://en.wikipedia.org/wiki/UTF-8 ► https://en.wikipedia.org/wiki/UTF-16 ► https://en.wikipedia.org/wiki/Emoji ► https://en.wikipedia.org/wiki/Big5 ► https://en.wikipedia.org/wiki/Code_page_950

Slide 17

Slide 17 text

Creative Commons 3.0 BY-NC-ND 延伸閱讀 ► https://www.ptt.cc/bbs/Python/M.1467340705.A.2F5.html ► https://docs.python.org/3.5/library/codecs.html#standard-encodings