Slide 1

Slide 1 text

ଓɾmruby/cʹUTF-8 Λ࣮૷͢Δ 2023-08-19 ima1zumi

Slide 2

Slide 2 text

Introduction • @ima1zumi (Mari Imaizumi) • Character encoding lover • IRB and Reline committer • ESM, inc. 2

Slide 3

Slide 3 text

ఏڙ

Slide 4

Slide 4 text

RubyKaigi 2023Ͱ࿩ͨ͜͠ͱ • mruby/cʹUTF-8Λ࣮૷ͯ͠·͢

Slide 5

Slide 5 text

mruby/cͱ͸ • ϝϞϦ࢖༻ྔ͕গͳ͍ • < 40KB • ϫϯνοϓϚΠίϯ͕λʔήοτ • CͱRubyͰ࣮૷͞Ε͍ͯΔ 5 [^3]: https://github.com/mrubyc/mrubyc

Slide 6

Slide 6 text

mruby/c !== mruby 6 Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γ https://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html

Slide 7

Slide 7 text

mruby/cͷString • RubyͰ͍͏ͱ͜ΖͷASCII-8BIT͔͠࢖͑ͳ͍ • UTF-8΋࢖͑ΔΑ͏ʹ࣮૷த

Slide 8

Slide 8 text

ରԠ͠ͳ͍͍ͯ͘ϝιου • new, +, *, to_i, to_f, to_s, b, clear, chomp, chomp!, dup, empty?, getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?, end_with?, include?, bytes, ==, <=> • 25ݸ 8

Slide 9

Slide 9 text

ରԠ͕ඞཁ • ࣮૷ࡁ: 12 • index, size(length), slice([]), slice!([]=), insert, inspect, ord • <<, Integer#chr, each_char, encoding, valid_encoding? • Todo: 3 • tr, tr!, split 9

Slide 10

Slide 10 text

Integer#chr • selfΛίʔυϙΠϯτͱΈͳͯ͠จࣈΛฦ͢ • UTF-8ͷ৔߹͸Unicode scalar valueͱΈͳ͢ • Scalar value͸্Լαϩήʔτ(0xD800͔Β0xDFFF)͸ؚ·ͳ͍

Slide 11

Slide 11 text

https://www.unicode.org/versions/ Unicode15.0.0/ch03.pdf

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

0b0011_0000_0100_0010 (U+3042) ↑scalar value

Slide 14

Slide 14 text

0b0011_0000_0100_0010 (U+3042) 1110zzzz 10yyyyyy 10xxxxxx ↑scalar value

Slide 15

Slide 15 text

1110zzzz 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)

Slide 16

Slide 16 text

11100011 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)

Slide 17

Slide 17 text

0b0011_0000_0100_0010 (U+3042) 11100011 10yyyyyy 10xxxxxx ↑scalar value

Slide 18

Slide 18 text

0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value

Slide 19

Slide 19 text

0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value

Slide 20

Slide 20 text

0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10000010 ↑scalar value

Slide 21

Slide 21 text

0b00110000_01000010 (U+3042) 0b11100011_10000001_10000010 "あ".bytes.map { _1.to_s(2) } => ["11100011", "10000001", "10000010"]

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

String#<<

Slide 24

Slide 24 text

String#encoding • mruby/cͷจࣈίʔυ੾Γସ͑͸ϏϧυΦϓγϣϯͰ੍ޚ • ϏϧυΦϓγϣϯΛݟͯฦ͚ͩ͢

Slide 25

Slide 25 text

String#valid_encoding? • ASCII-8BIT͸ͳΜͰ΋OK • UTF-8͸well-formed͔Ͳ͏͔ͷ൑ఆ͕ඞཁ • ϏϧυΦϓγϣϯͰStringͷencodingΛ੾Γସ͍͑ͯΔͨΊɺUTF-8ͱͯ͠well-formed͔Ͳ͏͔ Λ͢΂ͯͷStringͰ֬ೝ͢ΔͱόΠφϦ͕StringʹೖΒͳ͍ͱ͍͏໰୊͕͋Δ • ͔ͱ͍ͬͯෆਖ਼UTF-8͸ڐͨ͘͠ͳ͍ • ηΩϡϦςΟϦεΫ • Stringͷߏ଄ମΛେ͖ͨ͘͘͠΋ͳ͍ • ં஭Ҋ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ

Slide 26

Slide 26 text

Well-formed UTF-8 byte sequences • mruby/c͸ϏϧυΦϓγϣϯͰStringͷencodingΛ੾Γସ͑Δ • શStringͰਖ਼͍͔֬͠ೝ͢ΔͱόΠφϦ͕࡞੒ෆՄೳʹͳΔ • ͔ͱ͍ͬͯෆਖ਼UTF-8͸ڐͨ͘͠ͳ͍ • ηΩϡϦςΟϦεΫ • Stringͷߏ଄ମΛେ͖ͨ͘͘͠΋ͳ͍ • ં஭Ҋ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

valid_encoding? • UTF-8ͷόϦσʔγϣϯΞϧΰϦζϜ͸͍Ζ͍Ζ • [௒ߴ଎ͳUTF-8όϦσʔγϣϯɺrangeΞϧΰϦζϜͷ঺հ - ͖ͯͱ͏ ͳ͍͞ͱɻ΂͐ͨ͹Μ](https://tekitoh-memdhoi.info/views/872) • https://github.com/cyb70289/utf8 • [[2010.03090] Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090) • Θ͔Γ͍ͨ

Slide 30

Slide 30 text

upcase, downcase • ରԠ͠ͳ͍ • Unicodeͷେจࣈɺখจࣈม׵͸Ϛοϐϯάςʔϒϧ͕ඞਢ • "LJ".downcase == "lj" • ϚΠίϯͰͦ͜·Ͱ͍ͨ͠Ϣʔεέʔε͸͋·Γͳͦ͞͏