runeとUnicodeと文字数と

runeとUnicodeと文字数と文字ってなんだよ......ってなる話 by ことね@_ktnyt @ Go勉強会 by bitkey × voicy

自己紹介ことね（板谷美玲）@_ktnyt LAPRAS株式会社 Webエンジニア io.Readerをすこれの人趣味プログラミング、ドライブ、音楽

文字数、数えられますか？

これらの文字列は何文字でしょう？ 1: A 2: あ 3: ㌖ 4: 5: 6:
7:

試してみよう！ package main import "fmt" func main() { ss :=
[]string{ "A", " あ", " ㌖", " ", " ", " ", " ", } fmt.Println("| s | len(s) |") fmt.Println("|:-:|-------:|") for _, s := range ss { fmt.Printf("| %s | %d |\n", s, len(s)) } }

結果 s len(s) A 1 あ 3 ㌖ 3 4
13 17 25

なんで？ Goにおける len(<string>) の仕様 Call Argument type Result len(s) string
type string length in bytes [n]T, *[n]T array length (== n) []T slice length map[K]T map length (number of defined keys) chan T number of elements queued in channel buffer type parameter see below cap(s) [n]T, *[n]T array length (== n) []T slice capacity chan T channel buffer capacity type parameter see below from: https://go.dev/ref/spec#Length_and_capacity

よろしい、ならばruneだ Rune literals A rune literal represents a rune constant,
an integer value identifying a Unicode code point. A rune literal is expressed as one or more characters enclosed in single quotes, as in 'x' or '\n'. Within the quotes, any character may appear except newline and unescaped single quote. A single quoted character represents the Unicode value of the character itself, while multi-character sequences beginning with a backslash encode values in various formats. from: https://go.dev/ref/spec#Rune_literals

Unicode Code Point? Any value in the Unicode codespace; that
is, the range of integers from 0 to 0x10FFFF. from: https://www.unicode.org/glossary/#code_point

与太話 UnicodeとUTF-8/UTF-16 Unicodeの文字空間である 0x0-0x10FFFF をエンコードするのに必要なのは 21 bit で、8N bit (N
byte) のサイズを持つ変数で表現するのには最低 24 bit、実装上殆どの場合 32 bit が用いられる。UTF-8 は Unicode の Code Point を 8 bit 単位（実際にはプレフィックスがつくので厳密には 8bit ではないが）、UTF-16 は 16 bit 単位（同上）の Code Unit で保持する。

runeで数えてみよう package main import "fmt" func main() { ss :=
[]string{ "A", " あ", " ㌖", " ", " ", " ", " ", } fmt.Println("| s | len(s) | len([]rune(s)) |") fmt.Println("|:-:|-------:|---------------:|") for _, s := range ss { fmt.Printf("| %s | %d | %d |\n", s, len(s), len([]rune(s))) } }

結果 s len(s) len([]rune(s)) A 1 1 あ 3 1
㌖ 3 1 4 1 13 4 17 5 25 7

なんでや

Code Point 表記 package main import ( "fmt" "strings" )
func main() { ss := []string{ "A", " あ", " ㌖", " ", " ", " ", " ", } fmt.Println("| s | Code Points |") fmt.Println("|:-:|:------------|") for _, s := range ss { rr := []rune(s) cp := make([]string, len(rr)) for i, r := range rr { cp[i] = fmt.Sprintf("%U", r) } fmt.Printf("| %s | %s |\n", s, strings.Join(cp, " ")) } }

s Code Points A U+0041 あ U+3042 ㌖ U+3316 U+1F64F
U+1F64B U+200D U+2640 U+FE0F U+1F64B U+1F3FB U+200D U+2640 U+FE0F U+1F469 U+200D U+1F469 U+200D U+1F467 U+200D U+1F467 U+200D: Zero Width Space, U+FE0F: Variant Selector

UAX #29: UNICODE TEXT SEGMENTATION https://unicode.org/reports/tr29/ Unicodeの文字区切りについての仕様。たとえばハングルでは個別に音を表現するパーツを組み合わせて一つの文字を作るので、 Unicode的に複数の Code
Point でも一つに見せる必要がある。一文字に見える複数の Code Point からなる文字列を Grapheme Cluster と呼ぶ。

つまり？ Code Unit→ 8 bit (UTF-8), 16 bit (UTF-16), etc.
UTF-8 は ascii を効率よくエンコーディングできる。 Unicodeの一文字 (Code Point) → 最低 21 bit (multi-byte) UTF-8 → 1-4 byte Go rune → 32 bit 見かけ上の一文字 (Grapheme Cluster) → 複数の Code Point

rivo/uniseg package main import ( "fmt" "github.com/rivo/uniseg" ) func main()
{ ss := []string{ "A", " あ", " ㌖", " ", " ", " ", " ", } fmt.Println("| s | len(s) | len([]rune(s)) | uniseg.GraphemeClusterCount |") fmt.Println("|:-:|-------:|---------------:|----------------------------:|") for _, s := range ss { fmt.Printf("| %s | %d | %d | %d |\n", s, len(s), len([]rune(s)), uniseg.GraphemeClusterCount(s)) } }

結果 s len(s) len([]rune(s)) uniseg.GraphemeClusterCount A 1 1 1 あ
3 1 1 ㌖ 3 1 1 4 1 1 13 4 1 17 5 1 25 7 1

「午前3時のいばらきけん」に様々な Combining Diacritical Mark をつけたテキスト。 Unicode的には11文字+いろんな修飾という認識になる。 len(s) len([]rune(s)) uniseg.GraphemeClusterCount 129
60 11

「文字」って難しい「㌖」は日本人的には6文字だけどUnicode的には一文字だったり。「」は人間的には一文字だけどUnicode的には7文字だったり。「文字」って簡単そうで実は難しい。

runeとUnicodeと文字数と

runeとUnicodeと文字数と

ktnyt

More Decks by ktnyt

Other Decks in Programming

Featured

Transcript