Upgrade to Pro — share decks privately, control downloads, hide ads and more …

runeとUnicodeと文字数と

ktnyt
June 30, 2022

 runeとUnicodeと文字数と

「len(s)で文字数をカウントするとバイト数になっちゃうから、きちんと文字数を数えるならruneの数を数えなきゃ(キリッ)」と思っていた時期が私にもありました。Unicode Text Segmentationとは!?そして絵文字との深い関係は!?

ktnyt

June 30, 2022
Tweet

More Decks by ktnyt

Other Decks in Programming

Transcript

  1. 試してみよう! package main import "fmt" func main() { ss :=

    []string{ "A", " あ", " ㌖", " ", " ", " ", " ", } fmt.Println("| s | len(s) |") fmt.Println("|:-:|-------:|") for _, s := range ss { fmt.Printf("| %s | %d |\n", s, len(s)) } }
  2. なんで? Goにおける len(<string>) の仕様 Call Argument type Result len(s) string

    type string length in bytes [n]T, *[n]T array length (== n) []T slice length map[K]T map length (number of defined keys) chan T number of elements queued in channel buffer type parameter see below cap(s) [n]T, *[n]T array length (== n) []T slice capacity chan T channel buffer capacity type parameter see below from: https://go.dev/ref/spec#Length_and_capacity
  3. よろしい、ならばruneだ Rune literals A rune literal represents a rune constant,

    an integer value identifying a Unicode code point. A rune literal is expressed as one or more characters enclosed in single quotes, as in 'x' or '\n'. Within the quotes, any character may appear except newline and unescaped single quote. A single quoted character represents the Unicode value of the character itself, while multi-character sequences beginning with a backslash encode values in various formats. from: https://go.dev/ref/spec#Rune_literals
  4. Unicode Code Point? Any value in the Unicode codespace; that

    is, the range of integers from 0 to 0x10FFFF. from: https://www.unicode.org/glossary/#code_point
  5. 与太話 UnicodeとUTF-8/UTF-16 Unicodeの文字空間である 0x0-0x10FFFF をエンコードするのに必要なのは 21 bit で、8N bit (N

    byte) のサイズを持つ変数で表現するのには最低 24 bit、実装上殆どの場合 32 bit が用 いられる。UTF-8 は Unicode の Code Point を 8 bit 単位(実際にはプレフィックスがつく ので厳密には 8bit ではないが)、UTF-16 は 16 bit 単位(同上)の Code Unit で保持する。
  6. runeで数えてみよう package main import "fmt" func main() { ss :=

    []string{ "A", " あ", " ㌖", " ", " ", " ", " ", } fmt.Println("| s | len(s) | len([]rune(s)) |") fmt.Println("|:-:|-------:|---------------:|") for _, s := range ss { fmt.Printf("| %s | %d | %d |\n", s, len(s), len([]rune(s))) } }
  7. Code Point 表記 package main import ( "fmt" "strings" )

    func main() { ss := []string{ "A", " あ", " ㌖", " ", " ", " ", " ", } fmt.Println("| s | Code Points |") fmt.Println("|:-:|:------------|") for _, s := range ss { rr := []rune(s) cp := make([]string, len(rr)) for i, r := range rr { cp[i] = fmt.Sprintf("%U", r) } fmt.Printf("| %s | %s |\n", s, strings.Join(cp, " ")) } }
  8. s Code Points A U+0041 あ U+3042 ㌖ U+3316 U+1F64F

    U+1F64B U+200D U+2640 U+FE0F U+1F64B U+1F3FB U+200D U+2640 U+FE0F U+1F469 U+200D U+1F469 U+200D U+1F467 U+200D U+1F467 U+200D: Zero Width Space, U+FE0F: Variant Selector
  9. つまり? Code Unit→ 8 bit (UTF-8), 16 bit (UTF-16), etc.

    UTF-8 は ascii を効率よくエンコーディングできる。 Unicodeの一文字 (Code Point) → 最低 21 bit (multi-byte) UTF-8 → 1-4 byte Go rune → 32 bit 見かけ上の一文字 (Grapheme Cluster) → 複数の Code Point
  10. rivo/uniseg package main import ( "fmt" "github.com/rivo/uniseg" ) func main()

    { ss := []string{ "A", " あ", " ㌖", " ", " ", " ", " ", } fmt.Println("| s | len(s) | len([]rune(s)) | uniseg.GraphemeClusterCount |") fmt.Println("|:-:|-------:|---------------:|----------------------------:|") for _, s := range ss { fmt.Printf("| %s | %d | %d | %d |\n", s, len(s), len([]rune(s)), uniseg.GraphemeClusterCount(s)) } }