Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dive into Encoding

ima1zumi
September 11, 2021
2.3k

Dive into Encoding

ima1zumi

September 11, 2021
Tweet

Transcript

  1. ima1zumi ESM, inc. Ruby on Rails engineer irb, reline, rurema

    contributor Learning about character codes out of curiosity 2 [1] 1. Japanese Ruby Reference Manual: https://docs.ruby- lang.org/ja/latest/doc/index.html
  2. Reason for talking The bug fix in reline Bug Fix※

    ※Some terminals will not display correctly. 😢 Because ZWJ(U+200D) is sometimes not supported. 1. https://github.com/ruby/reline/pull/217 ↩︎ 3 [1]
  3. Reason for talking I thought it would be better to

    make self-made character encoding and add it to Ruby. Do it 5
  4. Agenda About character encoding Create self-made encoding IROHA Ruby character

    encoding How to add character encoding to CRuby 8 ` `
  5. Coded Character Set (CCS) Character Set: A set of characters

    collected without duplication Coded Character Set: Each character in a character set is assigned a number A number is called a code point, code position, etc. (e.g.) Unicode, ASCII, JIS X 0213 13
  6. Character Encoding Scheme (CES) A transformation method of each characters

    to byte sequences (e.g.) UTF-8, UTF-16, EUC-JP 1 ' あ'.encode(Encoding::UTF_8).bytes.map { _1.to_s(16) } 2 # => ["e3", "81", "82"] UTF_16BE stands for UTF-16 (Big Endian) 14 3 ' あ'.encode(Encoding::UTF_16BE).bytes.map { _1.to_s(16) } 4 # => ["30", "42"]
  7. What is Character Encoding? Coded Character Set Character Encoding Scheme

    A collection of abstract characters with code numbers assigned to them What byte sequence to use to represent abstract characters 15 [1] 1. RFC 2130 https://www.rfc-editor.org/rfc/rfc2130.txt
  8. Code Set Independent (CSI) Treat all encodings fair Ruby, Solaris

    Advantages: independent of specific character set can handle multiple character sets in a single application less overhead Disadvantages: Probably difficult to implement※ ※ref: https://jp.quora.com/Ruby-deha-naze-UCS-seiki-ka-wo-saiyou-shi-tei-nai-node-shou-ka 23
  9. Each instance of String has Encoding information 1 str1.encoding 2

    # => #<Encoding:UTF-8> 3 4 str2.encoding 5 # => #<Encoding:US-ASCII> 24 String#encoding ` `
  10. Character encoding can be converted for each String instance 1

    str1 = 'abc' 2 p str1.encoding 3 # => #<Encoding:UTF-8> 4 5 str2 = str1.encode(Encoding::US_ASCII) 6 p str2.encoding 7 # => #<Encoding:US-ASCII> If you want to do more detailed conversion, use Encoding::Converter . 25 ` `
  11. Universal Coded Set (UCS) Has only one internal code. Converts

    between external code and UCS in input/output Many programming languages C#, Java, JavaScript, Perl, Python, etc. Advantages: Implementation and string handling can be unified. Disadvantages: Conversion may occur at input/output. Sometimes some information is lost. 26
  12. Ruby’s implementation of CSI How do they implement it? In

    order to find out 1. make self-made character encoding 2. make it work in Ruby Change CRuby code Build 27
  13. What you need to define character encoding in Ruby Character

    code conversion table Convert between a character and a character Constants of Encoding class (e.g.) Self-made encoding name: IROHA -> Encoding::IROHA . 29 ` `
  14. Files to add/modify in CRuby Conversion table enc/trans/iroha-tbl.rb -> Conversion

    table enc/trans/single_byte.trans -> use iroha-tbl.rb Encoding class constant definition enc/ascii.c ref: https://github.com/ima1zumi/ruby/pull/2 30 ` ` ` ` ` `
  15. diff (1/3) 1 diff --git a/enc/ascii.c b/enc/ascii.c 2 index a2fef2f879..0d248bd129

    100644 3 --- a/enc/ascii.c 4 +++ b/enc/ascii.c 5 @@ -74,6 +74,7 @@ ENC_REPLICATE("CP852", "IBM852") 6 7 ENC_REPLICATE("IBM855", "ASCII-8BIT") 8 9 ENC_REPLICATE("CP855", "IBM855") 10 11 ENC_REPLICATE("IBM857", "ASCII-8BIT") 12 +13 ENC_REPLICATE("IROHA", "ASCII-8BIT") 14 15 ENC_ALIAS("CP857", "IBM857") 16 17 ENC_REPLICATE("IBM860", "ASCII-8BIT") 18 19 ENC_ALIAS("CP860", "IBM860") 31
  16. diff (2/3) 1 diff --git a/enc/trans/iroha-tbl.rb b/enc/trans/iroha-tbl.rb 2 new file

    mode 100644 3 index 0000000000..1d170e221e 4 --- /dev/null 5 +++ b/enc/trans/iroha-tbl.rb 6 @@ -0,0 +1,49 @@ 7 + 8 IROHA_TO_UCS_TBL = [ 9 +10 ["80", 0x3044], # い 11 +12 ["81", 0x308d], # ろ 13 +14 ["82", 0x306f], # は 15 +16 ["83", 0x306b], # に 17 +18 ["84", 0x307b], # ほ 19 # 中略 20 +21 ["AD", 0x305b], # せ 22 +23 ["AE", 0x3059], # す 24 +25 ] 32
  17. diff (3/3) 1 diff --git a/enc/trans/single_byte.trans b/enc/trans/single_byte.trans 2 index 0d5407b918..57eb87a9c9

    100644 3 --- a/enc/trans/single_byte.trans 4 +++ b/enc/trans/single_byte.trans 5 @@ -64,6 +64,7 @@ 6 7 transcode_tblgen_singlebyte "IBM865" 8 9 transcode_tblgen_singlebyte "IBM866" 10 11 transcode_tblgen_singlebyte "IBM869" 12 +13 transcode_tblgen_singlebyte "IROHA" 14 15 transcode_tblgen_singlebyte "MACCROATIAN" 16 17 transcode_tblgen_singlebyte "MACCYRILLIC" 18 19 transcode_tblgen_singlebyte "MACGREEK" 33
  18. Conversion table: iroha-tbl.rb 1 IROHA_TO_UCS_TBL = [ 2 ["80", 0x3044],

    ["81", 0x308d], ["82", 0x306f], ["83", 0x306b], 3 ["84", 0x307b], ["85", 0x3078], ["86", 0x3068], ["87", 0x3061], 4 ["88", 0x308a], ["89", 0x306c], ["8A", 0x308b], ["8B", 0x3092], 5 ["8C", 0x308f], ["8D", 0x304b], ["8E", 0x3088], ["8F", 0x305f], 6 ["90", 0x308c], ["91", 0x305d], ["92", 0x3064], ["93", 0x306d], 7 ["94", 0x306a], ["95", 0x3089], ["96", 0x3080], ["97", 0x3046], 8 ["98", 0x3090], ["99", 0x306e], ["9A", 0x304a], ["9B", 0x304f], 9 ["9C", 0x3084], ["9D", 0x307e], ["9E", 0x3051], ["9F", 0x3075], 10 ["A0", 0x3053], ["A1", 0x3048], ["A2", 0x3066], ["A3", 0x3042], 11 ["A4", 0x3055], ["A5", 0x304d], ["A6", 0x3086], ["A7", 0x3081], 12 ["A8", 0x307f], ["A9", 0x3057], ["AA", 0x3048], ["AB", 0x3072], 13 ["AC", 0x3082], ["AD", 0x305b], ["AE", 0x3059] 14 ] You don’t have to write the ASCII byte sequence. 1. https://github.com/ruby/ruby/blob/d92f09a5eea009fa28cd046e9d0eb698e3d94c5c/tool/transcode- tblgen.rb#L882-L883 ↩︎ 34 [1]
  19. enc/trans/single_byte.trans single_byte.trans Generate a character encoding conversion table as C

    code with erb generate enc/trans/single_byte.c It will be included when CRuby build. 36 ` `
  20. enc/trans/single_byte.c see: https://github.com/ima1zumi/encoding_iroha/blob/1a58e8d/ext/encoding_iroha/iroha-tbl.h 1 // abbr 2 #define from_IROHA_offsets 21206

    3 0, 255, 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 13 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 14 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 34, 43, 44, 45, 46, 47, 15 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 16 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 37 abstract
  21. enc/ascii.c Make it possible to refer to the Encoding class

    as a constant. Can be referenced by Encoding::IROHA . 1 ENC_REPLICATE("IROHA", "ASCII-8BIT") 38 ` `
  22. Encoding.find 1 p Encoding.find('IROHA') 2 # => #<Encoding:IROHA> 1. class

    Encoding - Documentation for Ruby 3.0.0: https://docs.ruby-lang.org/en/3.0.0/Encoding.html#method-c- find ↩︎ 40 Search the encoding with specified name. name should be a string. [1]
  23. String#encode 1 p ' いろは'.encode(Encoding::IROHA) 2 # => "\x80\x81\x82" 1.

    class String - Documentation for Ruby 3.0.0 ↩︎ 41 Return a copy of string transcoded to encoding. [1]
  24. Conversion error 1 'α'.encode(Encoding::IROHA) 2 # error: in `encode': U+03B1

    from UTF-8 to IROHA 3 # (Encoding::UndefinedConversionError) 42
  25. extra Add Encoding::IROHA gem gem install encoding_iroha . Call private

    APIs See this commit https://github.com/ima1zumi/encoding_iroha/commit/1a58e8d 45 ` ` ` `
  26. References (1/3) Ruby M17N 成瀬 ゆい. "Ruby M17N の設計と実装". Rubyist

    Magazine. 2009-02-12. https://magazine.rubyist.net/articles/0025/0025-Ruby19_m17n.html, (Accessed 2021-08-26) Martin J. Dürst. "Ruby M17N". 2008-06-21. https://www.sw.it.aoyama.ac.jp/2008/pub/RubyKaigiM17N.html, (Accessed 2021-08-26) 成瀬 ゆい. "なるせにっき". 2008-06-23. https://naruse.hateblo.jp/entries/2008/06/23, (Accessed 2021-08-26) 成瀬 ゆい, Martin J. Dürst. "Ruby M17N". 2008-06-21. https://web.archive.org/web/20150925234827/http://jp.rubyist.net/RubyKaigi2008/video/2008-06- 21_rubykaigi2008-day1_5.ogg, (Accessed 2021-09-06) 47
  27. References (2/3) Character Encoding 成瀬 ゆい. "A Reintroduction To Ruby

    M17N". 2010-03-03. https://www.slideshare.net/nalsh/a-reintroduction-to- ruby-m17-n, (Accessed 2021-08-26) 矢野 啓介. [改訂新版]プログラマのための文字コード技術入門. 技術評論社, 2018, 400p. 978-4297102913 杜甫々. "文字コード入門". とほほのWWW入門. 2020-03-01. https://www.tohoho-web.com/ex/charset.html, (Accessed 2021-08-26) 伊藤 喜一. "(プログラマのための)いまさら聞けない標準規格の話 第1回 文字コード概要編". 2021-07-14. https://www.ogis-ri.co.jp/otc/hiroba/technical/program_standards/part1.html, (Accessed 2021-08-26) 小林 龍生・安岡 孝一・戸村 哲・三上 喜貴 編. インターネット時代の文字コード. 共立出版, 2002, 277p. 4-320-12038-8 Jukka K. Korpela. Unicode Explained. O’Reilly, 0-596-10121-X 48
  28. References (3/3) Unicode Create IROHA Ruby logo Unicode. "Unicode Terminology

    English - Japanese". unknown. http://www.unicode.org/terminology/term_en_ja.html, (Accessed 2021-08-26) Unicode. "UTR#17: Unicode Character Encoding Model". 2008-11-11. https://unicode.org/reports/tr17/, (Accessed 2021-08-26) larskanis. "Add string encoding IBM720 alias CP720 by larskanis · Pull Request #3803 · ruby/ruby". GitHub. 2020-11- 22. https://github.com/ruby/ruby/pull/3803, (Accessed 2021-08-27) "Ruby のロゴについて". unknown. https://www.ruby-lang.org/ja/about/logo/ (Accessed 2021-09-10) 49