Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UTF-8 is coming to mruby/c

ima1zumi
May 11, 2023
2.8k

UTF-8 is coming to mruby/c

ima1zumi

May 11, 2023
Tweet

Transcript

  1. UTF-8 is coming to
    mruby/c
    2023-05-11 ima1zumi

    View Slide

  2. Introduction
    • @ima1zumi (Mari Imaizumi)


    • Character encoding lover


    • IRB and Reline committer


    • ESM, inc.
    2

    View Slide

  3. • ڈ೥ͷը૾ΛషΔ
    3

    View Slide

  4. Day 2

    View Slide

  5. Coffeehouse sponsor

    View Slide

  6. Distribute Ruby Method Karuta
    6

    View Slide

  7. Question

    View Slide

  8. Question
    Have you ever used UTF-8?

    View Slide

  9. UTF-8 is de facto standard
    • > UTF-8 is used by 97.9% of all the websites whose character
    encoding they know.


    • https://w3techs.com/technologies/overview/character_encoding


    • Many programing languages support UTF-8


    • Ruby, Python, Rust, C++[^1], PHP, Java[^2]
    9
    [^1]: char8_t: A type for UTF-8 characters and strings https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html


    [^2]: JEP 400: UTF-8 by Default https://openjdk.org/jeps/400

    View Slide

  10. What is UTF-8
    10

    View Slide

  11. What is UTF-8
    • 8-bit UCS Transformation Format


    • Variable length


    • 1 byte ~ 4 bytes
    11

    View Slide

  12. UTF-8 implementation in mruby/c (in progress)
    12
    https://www.s-itoc.jp/support/technical-support/mrubyc/mrubyc-logo/

    View Slide

  13. Before implementation
    13

    View Slide

  14. Before implementation
    14

    View Slide

  15. Before implementation
    15

    View Slide

  16. Before implementation
    16

    View Slide

  17. Before implementation
    17

    View Slide

  18. Before implementation
    18

    View Slide

  19. Before implementation
    19

    View Slide

  20. Before implementation
    20

    View Slide

  21. Before implementation
    21

    View Slide

  22. Before implementation
    22

    View Slide

  23. Before implementation
    23

    View Slide

  24. Before implementation
    24

    View Slide

  25. Before implementation
    25

    View Slide

  26. Before implementation
    26

    View Slide

  27. After implementation
    27

    View Slide

  28. How to implement in mruby/c
    https://github.com/mrubyc/mrubyc/pull/191

    View Slide

  29. What is mruby/c
    • mruby/c is another implementation of mruby
    29
    [^3]: https://github.com/mrubyc/mrubyc

    View Slide

  30. What is mruby
    • mruby is the lightweight implementation of the Ruby language


    • Focus on compatibility with Ruby


    • Memory size < 400KB
    30

    View Slide

  31. What is mruby/c
    • Small memory consumption


    • < 40KB


    • Concurrent


    • Target


    • one-chip micro processors


    • Written in C or Ruby
    31
    [^3]: https://github.com/mrubyc/mrubyc

    View Slide

  32. mruby/c !== mruby
    32
    Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γ


    https://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html

    View Slide

  33. Use case
    • Micro controller


    • IoT devise


    • Reduction of defect rate in industrial sewing machines[^4]


    • prk_
    fi
    rmware
    33
    [^4]: https://www.ruby.or.jp/ja/showcase/case80.html

    View Slide

  34. Why I started
    https://slide.rabbit-shocker.org/authors/hasumikin/RubyWorldConference2022/?page=58
    34

    View Slide

  35. Advantages of using UTF-8 in mruby/c
    • prk_
    fi
    rmware allows UTF-8


    • Network programming


    • Shell
    35

    View Slide

  36. It would be nice if UTF-8 is implemented in mruby/c!

    View Slide

  37. What should we implement?

    View Slide

  38. mruby/c string is just a sequence of bytes
    38

    View Slide

  39. What does String mean as a byte sequence?
    • It doesn't have a character encoding.
    39

    View Slide

  40. What is a character encoding
    • Label to indicate what character code the byte sequence is


    • Does not affect the byte sequence
    40

    View Slide

  41. What is a character encoding
    • Label to indicate what character code the byte sequence is


    • Does not affect the byte sequence
    41
    11100011 10000001 10000010

    View Slide

  42. What is a character encoding
    • Label to indicate what character code the byte sequence is


    • Does not affect the byte sequence
    42
    11100011 10000001 10000010
    UTF-8

    View Slide

  43. If String is a byte sequence
    43

    View Slide

  44. Why "💎❤🏯".size is 11
    • Because "💎❤🏯" is 11 bytes in UTF-8


    • mruby/c string has no character encoding, so returns bytes
    44

    View Slide

  45. How can I get `"💎❤🏯".size` to return 3?
    45

    View Slide

  46. How do we know "how many
    characters" from byte sequences?

    View Slide

  47. In the case of UTF-8, you can tell by looking at
    the
    fi
    rst n bits of a byte
    • UTF-8 is variable length from 1 to 4 bytes


    • There are 1 byte, 2 bytes, 3 bytes, 4 bytes, and ongoing byte sequences


    • 1 byte: 0xxxxxxx


    • 2 bytes: 110xxxxx


    • 3 bytes: 1110xxxx


    • 4 bytes: 11110xxx


    • ongoing: 10xxxxxx
    47

    View Slide

  48. Sample
    48

    View Slide

  49. Sample
    49
    • 1 byte: 0xxxxxxx


    • 2 bytes: 110xxxxx


    • 3 bytes: 1110xxxx


    • 4 bytes: 11110xxx


    • ongoing: 10xxxxxx

    View Slide

  50. Sample
    50
    • 1 byte: 0xxxxxxx


    • 2 bytes: 110xxxxx


    • 3 bytes: 1110xxxx


    • 4 bytes: 11110xxx


    • ongoing: 10xxxxxx

    View Slide

  51. Sample
    51
    • 1 byte: 0xxxxxxx


    • 2 bytes: 110xxxxx


    • 3 bytes: 1110xxxx


    • 4 bytes: 11110xxx


    • ongoing: 10xxxxxx

    View Slide

  52. Sample
    52
    • 1 byte: 0xxxxxxx


    • 2 bytes: 110xxxxx


    • 3 bytes: 1110xxxx


    • 4 bytes: 11110xxx


    • ongoing: 10xxxxxx

    View Slide

  53. Implementation

    View Slide

  54. Written in Ruby
    54

    View Slide

  55. Written in C
    55

    View Slide

  56. size (length) is now working
    56

    View Slide

  57. Tests
    57

    View Slide

  58. Implement in the same way

    View Slide

  59. Status of UTF-8 support in
    mruby/c

    View Slide

  60. Implementation not required
    • new, +, *, to_i, to_f, to_s, b, clear, chomp, chomp!, dup, empty?,
    getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?,
    end_with?, include?, bytes, ==, <=>


    • Byte sequence operations can handle UTF-8
    60

    View Slide

  61. Implementation required
    • Implemented:


    • index, size(length), slice([]), slice!([]=), insert, inspect, ord


    • To be implemented:


    • <<, tr, tr!, chr, each_char, split
    61

    View Slide

  62. Implementation required
    • Handling character count (index, size, slice, insert)


    • Handling Unicode code point (ord, chr, <<)


    • Only ASCII was supported (inspect)


    • Multipurpose (tr, tr!)


    • Segmentation fault (split)
    62

    View Slide

  63. What is counting characters of Unicode
    • Unicode code point


    • Grapheme clusters
    63

    View Slide

  64. Unicode code point
    • Any value in the Unicode codespace


    • That is, the range of integers from 0 to 10FFFF.


    • Not all code points are assigned to encoded characters.
    64

    View Slide

  65. Grapheme clusters
    • Visually perceived text units of combined Unicode code points


    • Almost user-perceived character
    65

    View Slide

  66. Emoji zero width joiner sequences
    66

    View Slide

  67. Unicode code point / grapheme clusters
    • Unicode code point is easier to manipulate strings


    • grapheme_clusters is slower
    67

    View Slide

  68. Unicode code point / grapheme clusters
    68

    View Slide

  69. Handling character count
    • index, size, slice, insert


    • Convert character count to byte count or byte count to character count
    69

    View Slide

  70. String#index before implementation
    70

    View Slide

  71. String#index
    71
    0 1 2 3 4 5 6 7 8 9 10
    byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF
    character index

    View Slide

  72. String#index
    72
    0 1 2 3 4 5 6 7 8 9 10
    byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF
    character index 0 1 2

    View Slide

  73. String#index
    73
    0 1 2 3 4 5 6 7 8 9 10
    byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF
    character index 0 1 2

    View Slide

  74. String#slice
    74

    View Slide

  75. Handling Unicode code point
    75

    View Slide

  76. Handling Unicode code point
    • ord, chr, <<
    76

    View Slide

  77. String#ord
    • Returns the integer ordinal of the
    fi
    rst character of self
    77

    View Slide

  78. (digression) String#ord
    • Useful snippet in .irbrc
    78

    View Slide

  79. Relationship between Unicode codepoint and
    UTF-8
    • Unicode codepoint is a value that uniquely de
    fi
    nes a Unicode
    character


    • UTF-8 encodes Unicode code points


    • UTF-16, UTF-32


    • One-to-one correspondence between UTF-8 byte sequences and
    Unicode code points


    • Can convert from byte sequences to Unicode code points


    • Also can convert from Unicode code points to byte sequences
    79

    View Slide

  80. UTF-8 Bit Distribution
    • https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf D92
    80

    View Slide

  81. Implementation
    81

    View Slide

  82. ASCII Only supported
    82

    View Slide

  83. The status quo
    • 695 String tests passed


    • 80% of String methods have been implemented to support UTF-8
    83

    View Slide

  84. Future issues
    • Support chr, <<, each_char, tr, tr!


    • Performance evaluation


    • Enable to use UTF-8 in prk_
    fi
    rmware


    • Error with invalid strings as UTF-8
    84

    View Slide

  85. Happy Binary Hacking!

    View Slide