Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
続・mruby/cにUTF-8 を実装する
Search
ima1zumi
August 19, 2023
1
33
続・mruby/cにUTF-8 を実装する
ima1zumi
August 19, 2023
Tweet
Share
More Decks by ima1zumi
See All by ima1zumi
Ruby Taught Me About Under the Hood
ima1zumi
6
16k
Exploring Reline: Enhancing Command Line Usability
ima1zumi
1
94
10年物のRailsアプリにキャッチアップ!〜コードを読まずに理解したかった〜
ima1zumi
0
97
RubyKaigiの登壇者一覧ページを作った
ima1zumi
0
440
Relineのその後の生活
ima1zumi
0
240
IRB and Reline Kaigi 2024
ima1zumi
0
15
Exploring Reline: Enhancing Command Line Usability
ima1zumi
3
15k
Reline 1分 Cooking
ima1zumi
0
39
UTF-8 is coming to mruby/c
ima1zumi
4
5.5k
Featured
See All Featured
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
A Modern Web Designer's Workflow
chriscoyier
695
190k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
8
480
The Language of Interfaces
destraynor
160
25k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.6k
Speed Design
sergeychernyshev
32
1.1k
Product Roadmaps are Hard
iamctodd
PRO
54
11k
Practical Orchestrator
shlominoach
190
11k
Automating Front-end Workflow
addyosmani
1370
200k
Java REST API Framework Comparison - PWX 2021
mraible
33
8.8k
Transcript
ଓɾmruby/cʹUTF-8 Λ࣮͢Δ 2023-08-19 ima1zumi
Introduction • @ima1zumi (Mari Imaizumi) • Character encoding lover •
IRB and Reline committer • ESM, inc. 2
ఏڙ
RubyKaigi 2023Ͱͨ͜͠ͱ • mruby/cʹUTF-8Λ࣮ͯ͠·͢
mruby/cͱ • ϝϞϦ༻ྔ͕গͳ͍ • < 40KB • ϫϯνοϓϚΠίϯ͕λʔήοτ • CͱRubyͰ࣮͞Ε͍ͯΔ
5 [^3]: https://github.com/mrubyc/mrubyc
mruby/c !== mruby 6 Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γ https://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html
mruby/cͷString • RubyͰ͍͏ͱ͜ΖͷASCII-8BIT͔͑͠ͳ͍ • UTF-8͑ΔΑ͏ʹ࣮த
ରԠ͠ͳ͍͍ͯ͘ϝιου • new, +, *, to_i, to_f, to_s, b, clear,
chomp, chomp!, dup, empty?, getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?, end_with?, include?, bytes, ==, <=> • 25ݸ 8
ରԠ͕ඞཁ • ࣮ࡁ: 12 • index, size(length), slice([]), slice!([]=), insert,
inspect, ord • <<, Integer#chr, each_char, encoding, valid_encoding? • Todo: 3 • tr, tr!, split 9
Integer#chr • selfΛίʔυϙΠϯτͱΈͳͯ͠จࣈΛฦ͢ • UTF-8ͷ߹Unicode scalar valueͱΈͳ͢ • Scalar value্Լαϩήʔτ(0xD800͔Β0xDFFF)ؚ·ͳ͍
https://www.unicode.org/versions/ Unicode15.0.0/ch03.pdf
None
0b0011_0000_0100_0010 (U+3042) ↑scalar value
0b0011_0000_0100_0010 (U+3042) 1110zzzz 10yyyyyy 10xxxxxx ↑scalar value
1110zzzz 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
11100011 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
0b0011_0000_0100_0010 (U+3042) 11100011 10yyyyyy 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10000010 ↑scalar value
0b00110000_01000010 (U+3042) 0b11100011_10000001_10000010 "あ".bytes.map { _1.to_s(2) } => ["11100011", "10000001",
"10000010"]
None
String#<<
String#encoding • mruby/cͷจࣈίʔυΓସ͑ϏϧυΦϓγϣϯͰ੍ޚ • ϏϧυΦϓγϣϯΛݟͯฦ͚ͩ͢
String#valid_encoding? • ASCII-8BITͳΜͰOK • UTF-8well-formed͔Ͳ͏͔ͷఆ͕ඞཁ • ϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͍͑ͯΔͨΊɺUTF-8ͱͯ͠well-formed͔Ͳ͏͔ Λͯ͢ͷStringͰ֬ೝ͢ΔͱόΠφϦ͕StringʹೖΒͳ͍ͱ͍͏͕͋Δ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
Well-formed UTF-8 byte sequences • mruby/cϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͑Δ • શStringͰਖ਼͍͔֬͠ೝ͢ΔͱόΠφϦ͕࡞ෆՄೳʹͳΔ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
None
None
valid_encoding? • UTF-8ͷόϦσʔγϣϯΞϧΰϦζϜ͍Ζ͍Ζ • [ߴͳUTF-8όϦσʔγϣϯɺrangeΞϧΰϦζϜͷհ - ͖ͯͱ͏ ͳ͍͞ͱɻ͐ͨΜ](https://tekitoh-memdhoi.info/views/872) • https://github.com/cyb70289/utf8
• [[2010.03090] Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090) • Θ͔Γ͍ͨ
upcase, downcase • ରԠ͠ͳ͍ • UnicodeͷେจࣈɺখจࣈมϚοϐϯάςʔϒϧ͕ඞਢ • "LJ".downcase == "lj"
• ϚΠίϯͰͦ͜·Ͱ͍ͨ͠Ϣʔεέʔε͋·Γͳͦ͞͏