Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
続・mruby/cにUTF-8 を実装する
Search
ima1zumi
August 19, 2023
1
37
続・mruby/cにUTF-8 を実装する
ima1zumi
August 19, 2023
Tweet
Share
More Decks by ima1zumi
See All by ima1zumi
Ruby Taught Me About Under the Hood
ima1zumi
6
18k
Exploring Reline: Enhancing Command Line Usability
ima1zumi
1
110
10年物のRailsアプリにキャッチアップ!〜コードを読まずに理解したかった〜
ima1zumi
0
120
RubyKaigiの登壇者一覧ページを作った
ima1zumi
0
490
Relineのその後の生活
ima1zumi
0
250
IRB and Reline Kaigi 2024
ima1zumi
0
22
Exploring Reline: Enhancing Command Line Usability
ima1zumi
3
15k
Reline 1分 Cooking
ima1zumi
0
44
UTF-8 is coming to mruby/c
ima1zumi
4
5.5k
Featured
See All Featured
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
The Cost Of JavaScript in 2023
addyosmani
55
9.3k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.3k
RailsConf 2023
tenderlove
30
1.3k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.8k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
285
14k
The World Runs on Bad Software
bkeepers
PRO
72
12k
Code Review Best Practice
trishagee
72
19k
4 Signs Your Business is Dying
shpigford
186
22k
It's Worth the Effort
3n
187
29k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
680
Transcript
ଓɾmruby/cʹUTF-8 Λ࣮͢Δ 2023-08-19 ima1zumi
Introduction • @ima1zumi (Mari Imaizumi) • Character encoding lover •
IRB and Reline committer • ESM, inc. 2
ఏڙ
RubyKaigi 2023Ͱͨ͜͠ͱ • mruby/cʹUTF-8Λ࣮ͯ͠·͢
mruby/cͱ • ϝϞϦ༻ྔ͕গͳ͍ • < 40KB • ϫϯνοϓϚΠίϯ͕λʔήοτ • CͱRubyͰ࣮͞Ε͍ͯΔ
5 [^3]: https://github.com/mrubyc/mrubyc
mruby/c !== mruby 6 Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γ https://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html
mruby/cͷString • RubyͰ͍͏ͱ͜ΖͷASCII-8BIT͔͑͠ͳ͍ • UTF-8͑ΔΑ͏ʹ࣮த
ରԠ͠ͳ͍͍ͯ͘ϝιου • new, +, *, to_i, to_f, to_s, b, clear,
chomp, chomp!, dup, empty?, getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?, end_with?, include?, bytes, ==, <=> • 25ݸ 8
ରԠ͕ඞཁ • ࣮ࡁ: 12 • index, size(length), slice([]), slice!([]=), insert,
inspect, ord • <<, Integer#chr, each_char, encoding, valid_encoding? • Todo: 3 • tr, tr!, split 9
Integer#chr • selfΛίʔυϙΠϯτͱΈͳͯ͠จࣈΛฦ͢ • UTF-8ͷ߹Unicode scalar valueͱΈͳ͢ • Scalar value্Լαϩήʔτ(0xD800͔Β0xDFFF)ؚ·ͳ͍
https://www.unicode.org/versions/ Unicode15.0.0/ch03.pdf
None
0b0011_0000_0100_0010 (U+3042) ↑scalar value
0b0011_0000_0100_0010 (U+3042) 1110zzzz 10yyyyyy 10xxxxxx ↑scalar value
1110zzzz 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
11100011 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
0b0011_0000_0100_0010 (U+3042) 11100011 10yyyyyy 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10000010 ↑scalar value
0b00110000_01000010 (U+3042) 0b11100011_10000001_10000010 "あ".bytes.map { _1.to_s(2) } => ["11100011", "10000001",
"10000010"]
None
String#<<
String#encoding • mruby/cͷจࣈίʔυΓସ͑ϏϧυΦϓγϣϯͰ੍ޚ • ϏϧυΦϓγϣϯΛݟͯฦ͚ͩ͢
String#valid_encoding? • ASCII-8BITͳΜͰOK • UTF-8well-formed͔Ͳ͏͔ͷఆ͕ඞཁ • ϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͍͑ͯΔͨΊɺUTF-8ͱͯ͠well-formed͔Ͳ͏͔ Λͯ͢ͷStringͰ֬ೝ͢ΔͱόΠφϦ͕StringʹೖΒͳ͍ͱ͍͏͕͋Δ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
Well-formed UTF-8 byte sequences • mruby/cϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͑Δ • શStringͰਖ਼͍͔֬͠ೝ͢ΔͱόΠφϦ͕࡞ෆՄೳʹͳΔ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
None
None
valid_encoding? • UTF-8ͷόϦσʔγϣϯΞϧΰϦζϜ͍Ζ͍Ζ • [ߴͳUTF-8όϦσʔγϣϯɺrangeΞϧΰϦζϜͷհ - ͖ͯͱ͏ ͳ͍͞ͱɻ͐ͨΜ](https://tekitoh-memdhoi.info/views/872) • https://github.com/cyb70289/utf8
• [[2010.03090] Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090) • Θ͔Γ͍ͨ
upcase, downcase • ରԠ͠ͳ͍ • UnicodeͷେจࣈɺখจࣈมϚοϐϯάςʔϒϧ͕ඞਢ • "LJ".downcase == "lj"
• ϚΠίϯͰͦ͜·Ͱ͍ͨ͠Ϣʔεέʔε͋·Γͳͦ͞͏