Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
続・mruby/cにUTF-8 を実装する
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
ima1zumi
August 19, 2023
68
1
Share
続・mruby/cにUTF-8 を実装する
ima1zumi
August 19, 2023
More Decks by ima1zumi
See All by ima1zumi
Is Ruby's Multi-Encoding Overhead Heavy?
ima1zumi
1
2k
OSSと私たち: Rubyの開発を支える STORES
ima1zumi
2
160
Ruby Taught Me About Under the Hood
ima1zumi
6
22k
Exploring Reline: Enhancing Command Line Usability
ima1zumi
1
170
10年物のRailsアプリにキャッチアップ!〜コードを読まずに理解したかった〜
ima1zumi
0
160
RubyKaigiの登壇者一覧ページを作った
ima1zumi
0
610
Relineのその後の生活
ima1zumi
0
290
IRB and Reline Kaigi 2024
ima1zumi
0
53
Exploring Reline: Enhancing Command Line Usability
ima1zumi
3
16k
Featured
See All Featured
Navigating Team Friction
lara
192
16k
Believing is Seeing
oripsolob
1
130
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Prompt Engineering for Job Search
mfonobong
0
320
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.8k
Why Our Code Smells
bkeepers
PRO
340
58k
Scaling GitHub
holman
464
140k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1.3k
Applied NLP in the Age of Generative AI
inesmontani
PRO
4
2.3k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
55k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
199
73k
Transcript
ଓɾmruby/cʹUTF-8 Λ࣮͢Δ 2023-08-19 ima1zumi
Introduction • @ima1zumi (Mari Imaizumi) • Character encoding lover •
IRB and Reline committer • ESM, inc. 2
ఏڙ
RubyKaigi 2023Ͱͨ͜͠ͱ • mruby/cʹUTF-8Λ࣮ͯ͠·͢
mruby/cͱ • ϝϞϦ༻ྔ͕গͳ͍ • < 40KB • ϫϯνοϓϚΠίϯ͕λʔήοτ • CͱRubyͰ࣮͞Ε͍ͯΔ
5 [^3]: https://github.com/mrubyc/mrubyc
mruby/c !== mruby 6 Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γ https://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html
mruby/cͷString • RubyͰ͍͏ͱ͜ΖͷASCII-8BIT͔͑͠ͳ͍ • UTF-8͑ΔΑ͏ʹ࣮த
ରԠ͠ͳ͍͍ͯ͘ϝιου • new, +, *, to_i, to_f, to_s, b, clear,
chomp, chomp!, dup, empty?, getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?, end_with?, include?, bytes, ==, <=> • 25ݸ 8
ରԠ͕ඞཁ • ࣮ࡁ: 12 • index, size(length), slice([]), slice!([]=), insert,
inspect, ord • <<, Integer#chr, each_char, encoding, valid_encoding? • Todo: 3 • tr, tr!, split 9
Integer#chr • selfΛίʔυϙΠϯτͱΈͳͯ͠จࣈΛฦ͢ • UTF-8ͷ߹Unicode scalar valueͱΈͳ͢ • Scalar value্Լαϩήʔτ(0xD800͔Β0xDFFF)ؚ·ͳ͍
https://www.unicode.org/versions/ Unicode15.0.0/ch03.pdf
None
0b0011_0000_0100_0010 (U+3042) ↑scalar value
0b0011_0000_0100_0010 (U+3042) 1110zzzz 10yyyyyy 10xxxxxx ↑scalar value
1110zzzz 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
11100011 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
0b0011_0000_0100_0010 (U+3042) 11100011 10yyyyyy 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10000010 ↑scalar value
0b00110000_01000010 (U+3042) 0b11100011_10000001_10000010 "あ".bytes.map { _1.to_s(2) } => ["11100011", "10000001",
"10000010"]
None
String#<<
String#encoding • mruby/cͷจࣈίʔυΓସ͑ϏϧυΦϓγϣϯͰ੍ޚ • ϏϧυΦϓγϣϯΛݟͯฦ͚ͩ͢
String#valid_encoding? • ASCII-8BITͳΜͰOK • UTF-8well-formed͔Ͳ͏͔ͷఆ͕ඞཁ • ϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͍͑ͯΔͨΊɺUTF-8ͱͯ͠well-formed͔Ͳ͏͔ Λͯ͢ͷStringͰ֬ೝ͢ΔͱόΠφϦ͕StringʹೖΒͳ͍ͱ͍͏͕͋Δ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
Well-formed UTF-8 byte sequences • mruby/cϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͑Δ • શStringͰਖ਼͍͔֬͠ೝ͢ΔͱόΠφϦ͕࡞ෆՄೳʹͳΔ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
None
None
valid_encoding? • UTF-8ͷόϦσʔγϣϯΞϧΰϦζϜ͍Ζ͍Ζ • [ߴͳUTF-8όϦσʔγϣϯɺrangeΞϧΰϦζϜͷհ - ͖ͯͱ͏ ͳ͍͞ͱɻ͐ͨΜ](https://tekitoh-memdhoi.info/views/872) • https://github.com/cyb70289/utf8
• [[2010.03090] Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090) • Θ͔Γ͍ͨ
upcase, downcase • ରԠ͠ͳ͍ • UnicodeͷେจࣈɺখจࣈมϚοϐϯάςʔϒϧ͕ඞਢ • "LJ".downcase == "lj"
• ϚΠίϯͰͦ͜·Ͱ͍ͨ͠Ϣʔεέʔε͋·Γͳͦ͞͏