UTF-8 is coming tomruby/c2023-05-11 ima1zumi
View Slide
Introduction• @ima1zumi (Mari Imaizumi)• Character encoding lover• IRB and Reline committer• ESM, inc.2
• ڈͷը૾ΛషΔ3
Day 2
Coffeehouse sponsor
Distribute Ruby Method Karuta6
Question
QuestionHave you ever used UTF-8?
UTF-8 is de facto standard• > UTF-8 is used by 97.9% of all the websites whose characterencoding they know.• https://w3techs.com/technologies/overview/character_encoding• Many programing languages support UTF-8• Ruby, Python, Rust, C++[^1], PHP, Java[^2]9[^1]: char8_t: A type for UTF-8 characters and strings https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html[^2]: JEP 400: UTF-8 by Default https://openjdk.org/jeps/400
What is UTF-810
What is UTF-8• 8-bit UCS Transformation Format• Variable length• 1 byte ~ 4 bytes11
UTF-8 implementation in mruby/c (in progress)12https://www.s-itoc.jp/support/technical-support/mrubyc/mrubyc-logo/
Before implementation13
Before implementation14
Before implementation15
Before implementation16
Before implementation17
Before implementation18
Before implementation19
Before implementation20
Before implementation21
Before implementation22
Before implementation23
Before implementation24
Before implementation25
Before implementation26
After implementation27
How to implement in mruby/chttps://github.com/mrubyc/mrubyc/pull/191
What is mruby/c• mruby/c is another implementation of mruby29[^3]: https://github.com/mrubyc/mrubyc
What is mruby• mruby is the lightweight implementation of the Ruby language• Focus on compatibility with Ruby• Memory size < 400KB30
What is mruby/c• Small memory consumption• < 40KB• Concurrent• Target• one-chip micro processors• Written in C or Ruby31[^3]: https://github.com/mrubyc/mrubyc
mruby/c !== mruby32Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γhttps://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html
Use case• Micro controller• IoT devise• Reduction of defect rate in industrial sewing machines[^4]• prk_firmware33[^4]: https://www.ruby.or.jp/ja/showcase/case80.html
Why I startedhttps://slide.rabbit-shocker.org/authors/hasumikin/RubyWorldConference2022/?page=5834
Advantages of using UTF-8 in mruby/c• prk_firmware allows UTF-8• Network programming• Shell35
It would be nice if UTF-8 is implemented in mruby/c!
What should we implement?
mruby/c string is just a sequence of bytes38
What does String mean as a byte sequence?• It doesn't have a character encoding.39
What is a character encoding• Label to indicate what character code the byte sequence is• Does not affect the byte sequence40
What is a character encoding• Label to indicate what character code the byte sequence is• Does not affect the byte sequence4111100011 10000001 10000010
What is a character encoding• Label to indicate what character code the byte sequence is• Does not affect the byte sequence4211100011 10000001 10000010UTF-8
If String is a byte sequence43
Why "💎❤🏯".size is 11• Because "💎❤🏯" is 11 bytes in UTF-8• mruby/c string has no character encoding, so returns bytes44
How can I get `"💎❤🏯".size` to return 3?45
How do we know "how manycharacters" from byte sequences?
In the case of UTF-8, you can tell by looking atthefirst n bits of a byte• UTF-8 is variable length from 1 to 4 bytes• There are 1 byte, 2 bytes, 3 bytes, 4 bytes, and ongoing byte sequences• 1 byte: 0xxxxxxx• 2 bytes: 110xxxxx• 3 bytes: 1110xxxx• 4 bytes: 11110xxx• ongoing: 10xxxxxx47
Sample48
Sample49• 1 byte: 0xxxxxxx• 2 bytes: 110xxxxx• 3 bytes: 1110xxxx• 4 bytes: 11110xxx• ongoing: 10xxxxxx
Sample50• 1 byte: 0xxxxxxx• 2 bytes: 110xxxxx• 3 bytes: 1110xxxx• 4 bytes: 11110xxx• ongoing: 10xxxxxx
Sample51• 1 byte: 0xxxxxxx• 2 bytes: 110xxxxx• 3 bytes: 1110xxxx• 4 bytes: 11110xxx• ongoing: 10xxxxxx
Sample52• 1 byte: 0xxxxxxx• 2 bytes: 110xxxxx• 3 bytes: 1110xxxx• 4 bytes: 11110xxx• ongoing: 10xxxxxx
Implementation
Written in Ruby54
Written in C55
size (length) is now working56
Tests57
Implement in the same way
Status of UTF-8 support inmruby/c
Implementation not required• new, +, *, to_i, to_f, to_s, b, clear, chomp, chomp!, dup, empty?,getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?,end_with?, include?, bytes, ==, <=>• Byte sequence operations can handle UTF-860
Implementation required• Implemented:• index, size(length), slice([]), slice!([]=), insert, inspect, ord• To be implemented:• <<, tr, tr!, chr, each_char, split61
Implementation required• Handling character count (index, size, slice, insert)• Handling Unicode code point (ord, chr, <<)• Only ASCII was supported (inspect)• Multipurpose (tr, tr!)• Segmentation fault (split)62
What is counting characters of Unicode• Unicode code point• Grapheme clusters63
Unicode code point• Any value in the Unicode codespace• That is, the range of integers from 0 to 10FFFF.• Not all code points are assigned to encoded characters.64
Grapheme clusters• Visually perceived text units of combined Unicode code points• Almost user-perceived character65
Emoji zero width joiner sequences66
Unicode code point / grapheme clusters• Unicode code point is easier to manipulate strings• grapheme_clusters is slower67
Unicode code point / grapheme clusters68
Handling character count• index, size, slice, insert• Convert character count to byte count or byte count to character count69
String#index before implementation70
String#index710 1 2 3 4 5 6 7 8 9 10byte F0 9F 92 8E E2 9D A4 F0 9F 8F AFcharacter index
String#index720 1 2 3 4 5 6 7 8 9 10byte F0 9F 92 8E E2 9D A4 F0 9F 8F AFcharacter index 0 1 2
String#index730 1 2 3 4 5 6 7 8 9 10byte F0 9F 92 8E E2 9D A4 F0 9F 8F AFcharacter index 0 1 2
String#slice74
Handling Unicode code point75
Handling Unicode code point• ord, chr, <<76
String#ord• Returns the integer ordinal of thefirst character of self77
(digression) String#ord• Useful snippet in .irbrc78
Relationship between Unicode codepoint andUTF-8• Unicode codepoint is a value that uniquely defines a Unicodecharacter• UTF-8 encodes Unicode code points• UTF-16, UTF-32• One-to-one correspondence between UTF-8 byte sequences andUnicode code points• Can convert from byte sequences to Unicode code points• Also can convert from Unicode code points to byte sequences79
UTF-8 Bit Distribution• https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf D9280
Implementation81
ASCII Only supported82
The status quo• 695 String tests passed• 80% of String methods have been implemented to support UTF-883
Future issues• Support chr, <<, each_char, tr, tr!• Performance evaluation• Enable to use UTF-8 in prk_firmware• Error with invalid strings as UTF-884
Happy Binary Hacking!