Slide 1

Slide 1 text

UTF-8 is coming to mruby/c 2023-05-11 ima1zumi

Slide 2

Slide 2 text

Introduction • @ima1zumi (Mari Imaizumi) • Character encoding lover • IRB and Reline committer • ESM, inc. 2

Slide 3

Slide 3 text

• ڈ೥ͷը૾ΛషΔ 3

Slide 4

Slide 4 text

Day 2

Slide 5

Slide 5 text

Coffeehouse sponsor

Slide 6

Slide 6 text

Distribute Ruby Method Karuta 6

Slide 7

Slide 7 text

Question

Slide 8

Slide 8 text

Question Have you ever used UTF-8?

Slide 9

Slide 9 text

UTF-8 is de facto standard • > UTF-8 is used by 97.9% of all the websites whose character encoding they know. • https://w3techs.com/technologies/overview/character_encoding • Many programing languages support UTF-8 • Ruby, Python, Rust, C++[^1], PHP, Java[^2] 9 [^1]: char8_t: A type for UTF-8 characters and strings https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html [^2]: JEP 400: UTF-8 by Default https://openjdk.org/jeps/400

Slide 10

Slide 10 text

What is UTF-8 10

Slide 11

Slide 11 text

What is UTF-8 • 8-bit UCS Transformation Format • Variable length • 1 byte ~ 4 bytes 11

Slide 12

Slide 12 text

UTF-8 implementation in mruby/c (in progress) 12 https://www.s-itoc.jp/support/technical-support/mrubyc/mrubyc-logo/

Slide 13

Slide 13 text

Before implementation 13

Slide 14

Slide 14 text

Before implementation 14

Slide 15

Slide 15 text

Before implementation 15

Slide 16

Slide 16 text

Before implementation 16

Slide 17

Slide 17 text

Before implementation 17

Slide 18

Slide 18 text

Before implementation 18

Slide 19

Slide 19 text

Before implementation 19

Slide 20

Slide 20 text

Before implementation 20

Slide 21

Slide 21 text

Before implementation 21

Slide 22

Slide 22 text

Before implementation 22

Slide 23

Slide 23 text

Before implementation 23

Slide 24

Slide 24 text

Before implementation 24

Slide 25

Slide 25 text

Before implementation 25

Slide 26

Slide 26 text

Before implementation 26

Slide 27

Slide 27 text

After implementation 27

Slide 28

Slide 28 text

How to implement in mruby/c https://github.com/mrubyc/mrubyc/pull/191

Slide 29

Slide 29 text

What is mruby/c • mruby/c is another implementation of mruby 29 [^3]: https://github.com/mrubyc/mrubyc

Slide 30

Slide 30 text

What is mruby • mruby is the lightweight implementation of the Ruby language • Focus on compatibility with Ruby • Memory size < 400KB 30

Slide 31

Slide 31 text

What is mruby/c • Small memory consumption • < 40KB • Concurrent • Target • one-chip micro processors • Written in C or Ruby 31 [^3]: https://github.com/mrubyc/mrubyc

Slide 32

Slide 32 text

mruby/c !== mruby 32 Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γ https://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html

Slide 33

Slide 33 text

Use case • Micro controller • IoT devise • Reduction of defect rate in industrial sewing machines[^4] • prk_ fi rmware 33 [^4]: https://www.ruby.or.jp/ja/showcase/case80.html

Slide 34

Slide 34 text

Why I started https://slide.rabbit-shocker.org/authors/hasumikin/RubyWorldConference2022/?page=58 34

Slide 35

Slide 35 text

Advantages of using UTF-8 in mruby/c • prk_ fi rmware allows UTF-8 • Network programming • Shell 35

Slide 36

Slide 36 text

It would be nice if UTF-8 is implemented in mruby/c!

Slide 37

Slide 37 text

What should we implement?

Slide 38

Slide 38 text

mruby/c string is just a sequence of bytes 38

Slide 39

Slide 39 text

What does String mean as a byte sequence? • It doesn't have a character encoding. 39

Slide 40

Slide 40 text

What is a character encoding • Label to indicate what character code the byte sequence is • Does not affect the byte sequence 40

Slide 41

Slide 41 text

What is a character encoding • Label to indicate what character code the byte sequence is • Does not affect the byte sequence 41 11100011 10000001 10000010

Slide 42

Slide 42 text

What is a character encoding • Label to indicate what character code the byte sequence is • Does not affect the byte sequence 42 11100011 10000001 10000010 UTF-8

Slide 43

Slide 43 text

If String is a byte sequence 43

Slide 44

Slide 44 text

Why "💎❤🏯".size is 11 • Because "💎❤🏯" is 11 bytes in UTF-8 • mruby/c string has no character encoding, so returns bytes 44

Slide 45

Slide 45 text

How can I get `"💎❤🏯".size` to return 3? 45

Slide 46

Slide 46 text

How do we know "how many characters" from byte sequences?

Slide 47

Slide 47 text

In the case of UTF-8, you can tell by looking at the fi rst n bits of a byte • UTF-8 is variable length from 1 to 4 bytes • There are 1 byte, 2 bytes, 3 bytes, 4 bytes, and ongoing byte sequences • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx 47

Slide 48

Slide 48 text

Sample 48

Slide 49

Slide 49 text

Sample 49 • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx

Slide 50

Slide 50 text

Sample 50 • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx

Slide 51

Slide 51 text

Sample 51 • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx

Slide 52

Slide 52 text

Sample 52 • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx

Slide 53

Slide 53 text

Implementation

Slide 54

Slide 54 text

Written in Ruby 54

Slide 55

Slide 55 text

Written in C 55

Slide 56

Slide 56 text

size (length) is now working 56

Slide 57

Slide 57 text

Tests 57

Slide 58

Slide 58 text

Implement in the same way

Slide 59

Slide 59 text

Status of UTF-8 support in mruby/c

Slide 60

Slide 60 text

Implementation not required • new, +, *, to_i, to_f, to_s, b, clear, chomp, chomp!, dup, empty?, getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?, end_with?, include?, bytes, ==, <=> • Byte sequence operations can handle UTF-8 60

Slide 61

Slide 61 text

Implementation required • Implemented: • index, size(length), slice([]), slice!([]=), insert, inspect, ord • To be implemented: • <<, tr, tr!, chr, each_char, split 61

Slide 62

Slide 62 text

Implementation required • Handling character count (index, size, slice, insert) • Handling Unicode code point (ord, chr, <<) • Only ASCII was supported (inspect) • Multipurpose (tr, tr!) • Segmentation fault (split) 62

Slide 63

Slide 63 text

What is counting characters of Unicode • Unicode code point • Grapheme clusters 63

Slide 64

Slide 64 text

Unicode code point • Any value in the Unicode codespace • That is, the range of integers from 0 to 10FFFF. • Not all code points are assigned to encoded characters. 64

Slide 65

Slide 65 text

Grapheme clusters • Visually perceived text units of combined Unicode code points • Almost user-perceived character 65

Slide 66

Slide 66 text

Emoji zero width joiner sequences 66

Slide 67

Slide 67 text

Unicode code point / grapheme clusters • Unicode code point is easier to manipulate strings • grapheme_clusters is slower 67

Slide 68

Slide 68 text

Unicode code point / grapheme clusters 68

Slide 69

Slide 69 text

Handling character count • index, size, slice, insert • Convert character count to byte count or byte count to character count 69

Slide 70

Slide 70 text

String#index before implementation 70

Slide 71

Slide 71 text

String#index 71 0 1 2 3 4 5 6 7 8 9 10 byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF character index

Slide 72

Slide 72 text

String#index 72 0 1 2 3 4 5 6 7 8 9 10 byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF character index 0 1 2

Slide 73

Slide 73 text

String#index 73 0 1 2 3 4 5 6 7 8 9 10 byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF character index 0 1 2

Slide 74

Slide 74 text

String#slice 74

Slide 75

Slide 75 text

Handling Unicode code point 75

Slide 76

Slide 76 text

Handling Unicode code point • ord, chr, << 76

Slide 77

Slide 77 text

String#ord • Returns the integer ordinal of the fi rst character of self 77

Slide 78

Slide 78 text

(digression) String#ord • Useful snippet in .irbrc 78

Slide 79

Slide 79 text

Relationship between Unicode codepoint and UTF-8 • Unicode codepoint is a value that uniquely de fi nes a Unicode character • UTF-8 encodes Unicode code points • UTF-16, UTF-32 • One-to-one correspondence between UTF-8 byte sequences and Unicode code points • Can convert from byte sequences to Unicode code points • Also can convert from Unicode code points to byte sequences 79

Slide 80

Slide 80 text

UTF-8 Bit Distribution • https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf D92 80

Slide 81

Slide 81 text

Implementation 81

Slide 82

Slide 82 text

ASCII Only supported 82

Slide 83

Slide 83 text

The status quo • 695 String tests passed • 80% of String methods have been implemented to support UTF-8 83

Slide 84

Slide 84 text

Future issues • Support chr, <<, each_char, tr, tr! • Performance evaluation • Enable to use UTF-8 in prk_ fi rmware • Error with invalid strings as UTF-8 84

Slide 85

Slide 85 text

Happy Binary Hacking!