Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dive into Encoding

ima1zumi
September 11, 2021
1.9k

Dive into Encoding

ima1zumi

September 11, 2021
Tweet

Transcript

  1. View Slide

  2. ima1zumi
    ESM, inc.
    Ruby on Rails engineer
    irb, reline, rurema contributor
    Learning about character codes out
    of curiosity
    2
    [1]
    1. Japanese Ruby Reference Manual: https://docs.ruby-
    lang.org/ja/latest/doc/index.html

    View Slide

  3. Reason for talking
    The bug fix in reline
    Bug
    Fix※
    ※Some terminals will not display correctly.
    😢 Because ZWJ(U+200D) is sometimes not supported.
    1. https://github.com/ruby/reline/pull/217 ↩︎
    3
    [1]

    View Slide

  4. Reason for talking
    Ruby CSI??
    Character Encoding…???
    4

    View Slide

  5. Reason for talking
    I thought it would be better to make self-made character encoding and add
    it to Ruby.
    Do it
    5

    View Slide

  6. 1 p '
    いろは'.encode(Encoding::IROHA)
    2 # => "\x80\x81\x82"
    6

    View Slide

  7. What I want to say
    Create character encoding
    How to add a charset to CRuby
    7

    View Slide

  8. Agenda
    About character encoding
    Create self-made encoding IROHA
    Ruby character encoding
    How to add character encoding to CRuby
    8
    ` `

    View Slide

  9. Not talk
    Non-CRuby implementations
    9

    View Slide

  10. Request
    If you find any mistakes, inaccuracies, etc,

    please contact #rubykaigiB or @ima1zumi!
    10

    View Slide

  11. Defining Character
    Encoding
    Defining Character
    Encoding

    View Slide

  12. Character Encoding meaning
    Coded Character Set
    Character Encoding Scheme
    12

    View Slide

  13. Coded Character Set (CCS)
    Character Set: A set of characters collected without duplication
    Coded Character Set: Each character in a character set is assigned a
    number
    A number is called a code point, code position, etc.
    (e.g.) Unicode, ASCII, JIS X 0213
    13

    View Slide

  14. Character Encoding Scheme (CES)
    A transformation method of each characters to byte sequences
    (e.g.) UTF-8, UTF-16, EUC-JP
    1 '
    あ'.encode(Encoding::UTF_8).bytes.map { _1.to_s(16) }
    2 # => ["e3", "81", "82"]
    UTF_16BE stands for UTF-16 (Big Endian)
    14
    3 '
    あ'.encode(Encoding::UTF_16BE).bytes.map { _1.to_s(16) }
    4 # => ["30", "42"]

    View Slide

  15. What is Character Encoding?
    Coded Character Set
    Character Encoding Scheme
    A collection of abstract characters with code numbers assigned to them
    What byte sequence to use to represent abstract characters
    15
    [1]
    1. RFC 2130 https://www.rfc-editor.org/rfc/rfc2130.txt

    View Slide

  16. What do we need to create a character encoding?
    Name
    Characters
    Byte sequence
    16

    View Slide

  17. Self-made encoding: IROHA
    Name: IROHA
    Characters: ASCII + Iroha uta
    Byte sequence: 1byte
    17

    View Slide

  18. Iroha uta
    18
    Like "The quick brown fox jumps over the lazy dog."

    View Slide

  19. ASCII Table
    19
    7bit encoding

    View Slide

  20. IROHA Table
    20
    1byte encoding (Like ISO/IEC 8859)

    View Slide

  21. Ruby character
    encoding
    Ruby character
    encoding

    View Slide

  22. Ruby M17N
    M17N: Multilingualization
    Since Ruby 1.9
    Code Set Independent(CSI)
    22

    View Slide

  23. Code Set Independent (CSI)
    Treat all encodings fair
    Ruby, Solaris
    Advantages:
    independent of specific character set
    can handle multiple character sets in a single application
    less overhead
    Disadvantages: Probably difficult to implement※
    ※ref: https://jp.quora.com/Ruby-deha-naze-UCS-seiki-ka-wo-saiyou-shi-tei-nai-node-shou-ka
    23

    View Slide

  24. Each instance of String has Encoding information
    1 str1.encoding
    2 # => #
    3
    4 str2.encoding
    5 # => #
    24
    String#encoding
    ` `

    View Slide

  25. Character encoding can be converted for each
    String instance
    1 str1 = 'abc'
    2 p str1.encoding
    3 # => #
    4
    5 str2 = str1.encode(Encoding::US_ASCII)
    6 p str2.encoding
    7 # => #
    If you want to do more detailed conversion, use Encoding::Converter .
    25
    ` `

    View Slide

  26. Universal Coded Set (UCS)
    Has only one internal code. Converts between external code and UCS in
    input/output
    Many programming languages
    C#, Java, JavaScript, Perl, Python, etc.
    Advantages: Implementation and string handling can be unified.
    Disadvantages:
    Conversion may occur at input/output.
    Sometimes some information is lost.
    26

    View Slide

  27. Ruby’s implementation of CSI
    How do they implement it?
    In order to find out
    1. make self-made character encoding
    2. make it work in Ruby
    Change CRuby code
    Build
    27

    View Slide

  28. Define character
    encoding in Ruby
    Define character
    encoding in Ruby

    View Slide

  29. What you need to define character encoding in Ruby
    Character code conversion table
    Convert between a character and a character
    Constants of Encoding class
    (e.g.) Self-made encoding name: IROHA

    -> Encoding::IROHA .
    29
    ` `

    View Slide

  30. Files to add/modify in CRuby
    Conversion table
    enc/trans/iroha-tbl.rb -> Conversion table
    enc/trans/single_byte.trans -> use iroha-tbl.rb
    Encoding class constant definition
    enc/ascii.c
    ref: https://github.com/ima1zumi/ruby/pull/2
    30
    ` `
    ` `
    ` `

    View Slide

  31. diff (1/3)
    1 diff --git a/enc/ascii.c b/enc/ascii.c
    2 index a2fef2f879..0d248bd129 100644
    3 --- a/enc/ascii.c
    4 +++ b/enc/ascii.c
    5 @@ -74,6 +74,7 @@ ENC_REPLICATE("CP852", "IBM852")
    6 7 ENC_REPLICATE("IBM855", "ASCII-8BIT")
    8 9 ENC_REPLICATE("CP855", "IBM855")
    10 11 ENC_REPLICATE("IBM857", "ASCII-8BIT")
    12 +13 ENC_REPLICATE("IROHA", "ASCII-8BIT")
    14 15 ENC_ALIAS("CP857", "IBM857")
    16 17 ENC_REPLICATE("IBM860", "ASCII-8BIT")
    18 19 ENC_ALIAS("CP860", "IBM860")
    31

    View Slide

  32. diff (2/3)
    1 diff --git a/enc/trans/iroha-tbl.rb b/enc/trans/iroha-tbl.rb
    2 new file mode 100644
    3 index 0000000000..1d170e221e
    4 --- /dev/null
    5 +++ b/enc/trans/iroha-tbl.rb
    6 @@ -0,0 +1,49 @@
    7 + 8 IROHA_TO_UCS_TBL = [
    9 +10 ["80", 0x3044], #

    11 +12 ["81", 0x308d], #

    13 +14 ["82", 0x306f], #

    15 +16 ["83", 0x306b], #

    17 +18 ["84", 0x307b], #

    19 #
    中略
    20 +21 ["AD", 0x305b], #

    22 +23 ["AE", 0x3059], #

    24 +25 ]
    32

    View Slide

  33. diff (3/3)
    1 diff --git a/enc/trans/single_byte.trans b/enc/trans/single_byte.trans
    2 index 0d5407b918..57eb87a9c9 100644
    3 --- a/enc/trans/single_byte.trans
    4 +++ b/enc/trans/single_byte.trans
    5 @@ -64,6 +64,7 @@
    6 7 transcode_tblgen_singlebyte "IBM865"
    8 9 transcode_tblgen_singlebyte "IBM866"
    10 11 transcode_tblgen_singlebyte "IBM869"
    12 +13 transcode_tblgen_singlebyte "IROHA"
    14 15 transcode_tblgen_singlebyte "MACCROATIAN"
    16 17 transcode_tblgen_singlebyte "MACCYRILLIC"
    18 19 transcode_tblgen_singlebyte "MACGREEK"
    33

    View Slide

  34. Conversion table: iroha-tbl.rb
    1 IROHA_TO_UCS_TBL = [
    2 ["80", 0x3044], ["81", 0x308d], ["82", 0x306f], ["83", 0x306b],
    3 ["84", 0x307b], ["85", 0x3078], ["86", 0x3068], ["87", 0x3061],
    4 ["88", 0x308a], ["89", 0x306c], ["8A", 0x308b], ["8B", 0x3092],
    5 ["8C", 0x308f], ["8D", 0x304b], ["8E", 0x3088], ["8F", 0x305f],
    6 ["90", 0x308c], ["91", 0x305d], ["92", 0x3064], ["93", 0x306d],
    7 ["94", 0x306a], ["95", 0x3089], ["96", 0x3080], ["97", 0x3046],
    8 ["98", 0x3090], ["99", 0x306e], ["9A", 0x304a], ["9B", 0x304f],
    9 ["9C", 0x3084], ["9D", 0x307e], ["9E", 0x3051], ["9F", 0x3075],
    10 ["A0", 0x3053], ["A1", 0x3048], ["A2", 0x3066], ["A3", 0x3042],
    11 ["A4", 0x3055], ["A5", 0x304d], ["A6", 0x3086], ["A7", 0x3081],
    12 ["A8", 0x307f], ["A9", 0x3057], ["AA", 0x3048], ["AB", 0x3072],
    13 ["AC", 0x3082], ["AD", 0x305b], ["AE", 0x3059]
    14 ]
    You don’t have to write the ASCII byte sequence.
    1. https://github.com/ruby/ruby/blob/d92f09a5eea009fa28cd046e9d0eb698e3d94c5c/tool/transcode-
    tblgen.rb#L882-L883 ↩︎ 34
    [1]

    View Slide

  35. Conversion table
    35

    View Slide

  36. enc/trans/single_byte.trans
    single_byte.trans
    Generate a character encoding conversion table as C code with erb
    generate enc/trans/single_byte.c
    It will be included when CRuby build.
    36
    ` `

    View Slide

  37. enc/trans/single_byte.c
    see: https://github.com/ima1zumi/encoding_iroha/blob/1a58e8d/ext/encoding_iroha/iroha-tbl.h
    1 // abbr
    2 #define from_IROHA_offsets 21206
    3 0, 255,
    4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    5 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    6 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    7 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    8 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    9 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    10 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    11 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    12 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
    13 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
    14 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 34, 43, 44, 45, 46, 47,
    15 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47,
    16 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47
    37
    abstract

    View Slide

  38. enc/ascii.c
    Make it possible to refer to the Encoding class as a constant.
    Can be referenced by Encoding::IROHA .
    1 ENC_REPLICATE("IROHA", "ASCII-8BIT")
    38
    ` `

    View Slide

  39. Let’s try using Encoding::IROHA
    Encoding.find
    String#encode
    39

    View Slide

  40. Encoding.find
    1 p Encoding.find('IROHA')
    2 # => #
    1. class Encoding - Documentation for Ruby 3.0.0: https://docs.ruby-lang.org/en/3.0.0/Encoding.html#method-c-
    find ↩︎
    40
    Search the encoding with specified name. name should be a string. [1]

    View Slide

  41. String#encode
    1 p '
    いろは'.encode(Encoding::IROHA)
    2 # => "\x80\x81\x82"
    1. class String - Documentation for Ruby 3.0.0 ↩︎
    41
    Return a copy of string transcoded to encoding. [1]

    View Slide

  42. Conversion error
    1 'α'.encode(Encoding::IROHA)
    2 # error: in `encode': U+03B1 from UTF-8 to IROHA
    3 # (Encoding::UndefinedConversionError)
    42

    View Slide

  43. encode
    43

    View Slide

  44. encode
    44

    View Slide

  45. extra
    Add Encoding::IROHA gem
    gem install encoding_iroha .
    Call private APIs
    See this commit
    https://github.com/ima1zumi/encoding_iroha/commit/1a58e8d
    45
    ` `
    ` `

    View Slide

  46. Conclusion
    Character Encoding
    Coded Character Set
    Character Encoding Scheme
    Add Encoding
    Conversion table
    Encoding constant
    46

    View Slide

  47. References (1/3)
    Ruby M17N
    成瀬 ゆい. "Ruby M17N の設計と実装". Rubyist Magazine. 2009-02-12.
    https://magazine.rubyist.net/articles/0025/0025-Ruby19_m17n.html, (Accessed 2021-08-26)
    Martin J. Dürst. "Ruby M17N". 2008-06-21. https://www.sw.it.aoyama.ac.jp/2008/pub/RubyKaigiM17N.html,
    (Accessed 2021-08-26)
    成瀬 ゆい. "なるせにっき". 2008-06-23. https://naruse.hateblo.jp/entries/2008/06/23, (Accessed 2021-08-26)
    成瀬 ゆい, Martin J. Dürst. "Ruby M17N". 2008-06-21.
    https://web.archive.org/web/20150925234827/http://jp.rubyist.net/RubyKaigi2008/video/2008-06-
    21_rubykaigi2008-day1_5.ogg, (Accessed 2021-09-06)
    47

    View Slide

  48. References (2/3)
    Character Encoding
    成瀬 ゆい. "A Reintroduction To Ruby M17N". 2010-03-03. https://www.slideshare.net/nalsh/a-reintroduction-to-
    ruby-m17-n, (Accessed 2021-08-26)
    矢野 啓介. [改訂新版]プログラマのための文字コード技術入門. 技術評論社, 2018, 400p. 978-4297102913
    杜甫々. "文字コード入門". とほほのWWW入門. 2020-03-01. https://www.tohoho-web.com/ex/charset.html, (Accessed
    2021-08-26)
    伊藤 喜一. "(プログラマのための)いまさら聞けない標準規格の話 第1回 文字コード概要編". 2021-07-14.
    https://www.ogis-ri.co.jp/otc/hiroba/technical/program_standards/part1.html, (Accessed 2021-08-26)
    小林 龍生・安岡 孝一・戸村 哲・三上 喜貴 編. インターネット時代の文字コード. 共立出版, 2002, 277p. 4-320-12038-8
    Jukka K. Korpela. Unicode Explained. O’Reilly, 0-596-10121-X
    48

    View Slide

  49. References (3/3)
    Unicode
    Create IROHA
    Ruby logo
    Unicode. "Unicode Terminology English - Japanese". unknown. http://www.unicode.org/terminology/term_en_ja.html,
    (Accessed 2021-08-26)
    Unicode. "UTR#17: Unicode Character Encoding Model". 2008-11-11. https://unicode.org/reports/tr17/, (Accessed
    2021-08-26)
    larskanis. "Add string encoding IBM720 alias CP720 by larskanis · Pull Request #3803 · ruby/ruby". GitHub. 2020-11-
    22. https://github.com/ruby/ruby/pull/3803, (Accessed 2021-08-27)
    "Ruby のロゴについて". unknown. https://www.ruby-lang.org/ja/about/logo/ (Accessed 2021-09-10)
    49

    View Slide