1จࣈͷੈք@orisano
View Slide
Έͳ͞ΜจࣈΛ͑ΒΕ·͢ΑͶʁ
a
a => 1
͋
͋ => 1
佛
佛 => 1
=> 1
Z͑ͫ̓ͪ̂ͫ̽ ̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌ ̌͘! ͖̬̰̙̗̿̋ ͥͥ̂ͣ̐́́͜͞
Z͑ͫ̓ͪ̂ͫ̽ ̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌ ̌͘! ͖̬̰̙̗̿̋ ͥͥ̂ͣ̐́́͜͞ => 6
Έͳ͞ΜόΠτΛ͑ΒΕ·͔͢ʁ(UTF-8)
͋ => 3
佛 => 4
=> 4
=> 18
Z͑ͫ̓ͪ̂ͫ̽ ̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌ ̌͘! ͖̬̰̙̗̿̋ ͥͥ̂ͣ̐́́͜͞ => 143
͋ͳ͕ͨࢥ͏1จࣈͲ͏͑Δ͖͔ʁ
byteͰ͑ΒΕͳ͍
Unicodeจࣈू߹จࣈͱ͕ରԠ͢Δ
͋ => 3042
=> 1F914
͜ͷͷ͜ͱΛίʔυϙΠϯτͱݺͿ
͜ͷίʔυϙΠϯτΛbyteྻͰදݱ͢Δํ๏ΛΤϯίʔσΟϯάͱ͍͏
UTF-8ͱ͔UTF-16ͱ͔ΤϯίʔσΟϯάͷҰछ
ͱΓ͋͑ͣίʔυϙΠϯτΛ͑Εղܾʁ
͍͍͑
=>1F468 + 200D + 1F469 +200D + 1F466
࣮ෳͷίʔυϙΠϯτͰҰͭͷจࣈʹͳͬͨΓ͢Δ
ਓ͕ؒೝ͍ࣝͯ͠Δ̍จࣈॻهૉ(Grapheme cluster)ͱݺΕ͍ͯΔ
Ͳ͏ΕίʔυϙΠϯτͷྻ͔ΒॻهૉΛऔΓग़ͤΔ͔
ίʔυϙΠϯτ͕ؒॻهૉڥքʹͳΔ͔Ͳ͏͔ͷݫີͳϧʔϧ͕͋Δ
UAX #29Unicode Text Segmentation
͜ΕΛJSͰ࣮ͯ͠·ͨ͠github.com/orisano/graphemesplit
ৄ͘͠ UAX #29 Λݟͯhttp://unicode.org/reports/tr29/
ݟΒ͵ਓʹʓจࣈͱݴΘΕͨͱ͖ʹͪΌΜͱ֬ೝ͠Α͏ʂ
1 byte?1 codepoint?1 grapheme cluster?