Java Unicode NCR 處理

Slide 1

Slide 1 text

Unicode NCR 處理 BrianHsu [email protected]

Slide 2

Slide 2 text

基本的概念 ● 電腦是二進位，只能儲存 0/1 兩種狀態 ● 使用對於 bit 的解讀來存放不同類型的資料 – 整數（二補數） – 浮點數（ IEEE754 ） – 字元

Slide 3

Slide 3 text

ASCII ● 使用一個 BYTE 來儲存字元 ● 1byte=8bit=2^8=256 種狀態 ● ASCII 針對前 128 個有統一的定義 – 剩下的 128 個由各系統自行決定應用 ● A=65 、 B=66 、 C=67... ● 0=48 、 1=49 、 2=50.... – 剩下的 128 個字元，由 CodePage 決定如何解讀

Slide 4

Slide 4 text

CJK 字元的儲存 ● ASCII 使用的 1 個 byte=256 個字元對英文語系很夠用，但漢字圈的字元多達上萬個 – Unicode 3.1 收錄超過七萬個 ● 使用兩個 Byte 來儲存非 ASCII 內定義的字元

Slide 5

Slide 5 text

萬碼奔騰的年代 ● 繁體中文： BIG5 ● 簡體中文： GB2312 ● 日文： EUC-JP/Shift_JIS... ● BIG5 – 聯合目錄現行 XML 交換編碼 – 以 1 個 byte 儲存 ASCII 範圍字元 – 以 2 個 byte 儲存中文字元 – 缺字問題 ● 堃、喆……等

Slide 6

Slide 6 text

Unicode 編碼

Slide 7

Slide 7 text

Unicode 解決的問題 ● 解決各系統編碼不同的問題 – 只要有字型，可以同時顯示不同國家的語言，如中日韓混合 ● 已成為網際網路實質標準 ● 收錄的漢字比 BIG5 多很多 – 堃、喆

Slide 8

Slide 8 text

Unicode 的基本概念 ● 每個字元對應到一個抽象的 CodePoint – CodePoint 是整數數值 – CodePoint 是抽象的，不是實際上儲存的值 ● 為了與 ASCII 相容、 ASCII 內的字元數值和 Unicode 裡的 CodePoint 相同 – A=0x0041=65 – B=0x0042=66 ● 其他的字元範例 – 公 =0x516C=20844 – 喆 =0x5586=21894

Slide 9

Slide 9 text

沒有「儲存為 Unicode 」這種事！！

Slide 10

Slide 10 text

Unicode 實際儲存時 ● Unicode CointPoint 是抽象的！ ● 儲存時需要依照編碼規則儲存 – UTF-8 – UTF-16BE – UTF-16LE

Slide 11

Slide 11 text

UTF-8 ● 網際網路上的實質資料交換標準 ● 使用 1~4 個 byte 來儲存 – 與 ASCII 相容，在 ASCII 範圍內的字使用 1 個 byte 儲存 – 其於字元視需要使用２～４個 byte 來儲存

Slide 12

Slide 12 text

UTF-16BigEndean ● Java 內部字元的儲存方式 – Java 裡 1 char = 2byte ● 固定使用 2 個 byte 的倍數 – 一個 UnicodePoint 存為 2 個 byte – 一個 UnicodePoint 存為 4 個 byte ● 若 CodePoint 是在 2^16=65535 內 – 直接儲存 CodePoint 轉二進位後的值 – A=0x0041 – 堃 =0x5B03

Slide 13

Slide 13 text

UTF-16BigEndean ● 若字元的 CodePoint 超過 65535 ● 使用兩個 Byte 表示 – 並非直接對應 CodePoint ，需經過編碼 – 編碼過後的第一個 Byte 必定為 HighSurrogate 範圍內 (0xD800–0xDBFF) – 言㐌 ● CodePoint 0x279A7 ● 實際編碼 – 0xD85E 0xDDA7

Slide 14

Slide 14 text

NCR ● Numeric character reference ● 使用 Unicode 的 CodePoint 來表示字元 – 用在 XML/HTML 中 – 可以在非 UTF-8/UTF-16 編碼的文件中，表示 Unicode 的字元 ● 型式 – 堃 // 十進位 – �x5803; // 十六進位 – 堃 // 十六進位

Slide 15

Slide 15 text

NCR ● 可以在 Big5 的文件中表示出 Big5 中沒有的字 ● 例： – 游鍚堃 – 陶喆

Slide 16

Slide 16 text

從 NCR 轉回 UTF-16 ● 不要自己做！ ● 光是 CodePoint 在 65535 以上字元編碼就很麻煩而且容易出錯 ● 使用 Apache Commons Language 函式庫 – JAR檔連結 – JavaDoc API

Slide 17

Slide 17 text

從 NCR 轉回 UTF-16 import org.apache.commons.lang3.StringEscapeUtils; public class Test { public static void main(String [] args) { String str1 = " 游鍚堃與陶喆"; String str2 = StringEscapeUtils.unescapeXml(str1); String str3 = "𧦧懷 "; String str4 = StringEscapeUtils.unescapeXml(str3); System.out.println(str2); System.out.println(str4); } }

Slide 18

Slide 18 text

將 BIG5 中沒有的字轉成 NCR ● 「游鍚堃」轉成「游鍚堃堃」 ● 將字串轉為字元陣列 – 記得有一個 Unicode 字元對應到四個 Byte 的狀態 – 把字元陣列掃過一次 ● 先確定是否為 CodePoint 的開頭 ● 如果是的話 – 檢查是否為 Big5 中的字元，如果不是就轉成 NCR ● 如果不是的話 – 這是四個 Byte 的狀態的尾巴二個 byte ，不理他 ● 參照 tw.digitalarchives.util.TextUtil

Slide 19

Slide 19 text

將 BIG5 中沒有的字轉成 NCR import tw.digitalarchives.util.TextUtil; public class Test { public static void main(String [] args) { String str1 = " 游鍚堃與陶喆 "; String str2 = " 懷 "; String str3 = TextUtil.normalizeString(str1); String str4 = TextUtil.normalizeString(str2); System.out.println(str3); System.out.println(str4); } }

Slide 20

Slide 20 text

參考資料 ● Unicode Code Point列表 ● Unicode Code Point查詢