Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dive into Encoding

74b5a82b8710accc1e6189a19f0b4935?s=47 ima1zumi
September 11, 2021
870

Dive into Encoding

74b5a82b8710accc1e6189a19f0b4935?s=128

ima1zumi

September 11, 2021
Tweet

Transcript

  1. None
  2. ima1zumi ESM, inc. Ruby on Rails engineer irb, reline, rurema

    contributor Learning about character codes out of curiosity 2 [1] 1. Japanese Ruby Reference Manual: https://docs.ruby- lang.org/ja/latest/doc/index.html
  3. Reason for talking The bug fix in reline Bug Fix※

    ※Some terminals will not display correctly. 😢 Because ZWJ(U+200D) is sometimes not supported. 1. https://github.com/ruby/reline/pull/217 ↩︎ 3 [1]
  4. Reason for talking Ruby CSI?? Character Encoding…??? 4

  5. Reason for talking I thought it would be better to

    make self-made character encoding and add it to Ruby. Do it 5
  6. 1 p ' いろは'.encode(Encoding::IROHA) 2 # => "\x80\x81\x82" 6

  7. What I want to say Create character encoding How to

    add a charset to CRuby 7
  8. Agenda About character encoding Create self-made encoding IROHA Ruby character

    encoding How to add character encoding to CRuby 8 ` `
  9. Not talk Non-CRuby implementations 9

  10. Request If you find any mistakes, inaccuracies, etc, please contact

    #rubykaigiB or @ima1zumi! 10
  11. Defining Character Encoding Defining Character Encoding

  12. Character Encoding meaning Coded Character Set Character Encoding Scheme 12

  13. Coded Character Set (CCS) Character Set: A set of characters

    collected without duplication Coded Character Set: Each character in a character set is assigned a number A number is called a code point, code position, etc. (e.g.) Unicode, ASCII, JIS X 0213 13
  14. Character Encoding Scheme (CES) A transformation method of each characters

    to byte sequences (e.g.) UTF-8, UTF-16, EUC-JP 1 ' あ'.encode(Encoding::UTF_8).bytes.map { _1.to_s(16) } 2 # => ["e3", "81", "82"] UTF_16BE stands for UTF-16 (Big Endian) 14 3 ' あ'.encode(Encoding::UTF_16BE).bytes.map { _1.to_s(16) } 4 # => ["30", "42"]
  15. What is Character Encoding? Coded Character Set Character Encoding Scheme

    A collection of abstract characters with code numbers assigned to them What byte sequence to use to represent abstract characters 15 [1] 1. RFC 2130 https://www.rfc-editor.org/rfc/rfc2130.txt
  16. What do we need to create a character encoding? Name

    Characters Byte sequence 16
  17. Self-made encoding: IROHA Name: IROHA Characters: ASCII + Iroha uta

    Byte sequence: 1byte 17
  18. Iroha uta 18 Like "The quick brown fox jumps over

    the lazy dog."
  19. ASCII Table 19 7bit encoding

  20. IROHA Table 20 1byte encoding (Like ISO/IEC 8859)

  21. Ruby character encoding Ruby character encoding

  22. Ruby M17N M17N: Multilingualization Since Ruby 1.9 Code Set Independent(CSI)

    22
  23. Code Set Independent (CSI) Treat all encodings fair Ruby, Solaris

    Advantages: independent of specific character set can handle multiple character sets in a single application less overhead Disadvantages: Probably difficult to implement※ ※ref: https://jp.quora.com/Ruby-deha-naze-UCS-seiki-ka-wo-saiyou-shi-tei-nai-node-shou-ka 23
  24. Each instance of String has Encoding information 1 str1.encoding 2

    # => #<Encoding:UTF-8> 3 4 str2.encoding 5 # => #<Encoding:US-ASCII> 24 String#encoding ` `
  25. Character encoding can be converted for each String instance 1

    str1 = 'abc' 2 p str1.encoding 3 # => #<Encoding:UTF-8> 4 5 str2 = str1.encode(Encoding::US_ASCII) 6 p str2.encoding 7 # => #<Encoding:US-ASCII> If you want to do more detailed conversion, use Encoding::Converter . 25 ` `
  26. Universal Coded Set (UCS) Has only one internal code. Converts

    between external code and UCS in input/output Many programming languages C#, Java, JavaScript, Perl, Python, etc. Advantages: Implementation and string handling can be unified. Disadvantages: Conversion may occur at input/output. Sometimes some information is lost. 26
  27. Ruby’s implementation of CSI How do they implement it? In

    order to find out 1. make self-made character encoding 2. make it work in Ruby Change CRuby code Build 27
  28. Define character encoding in Ruby Define character encoding in Ruby

  29. What you need to define character encoding in Ruby Character

    code conversion table Convert between a character and a character Constants of Encoding class (e.g.) Self-made encoding name: IROHA -> Encoding::IROHA . 29 ` `
  30. Files to add/modify in CRuby Conversion table enc/trans/iroha-tbl.rb -> Conversion

    table enc/trans/single_byte.trans -> use iroha-tbl.rb Encoding class constant definition enc/ascii.c ref: https://github.com/ima1zumi/ruby/pull/2 30 ` ` ` ` ` `
  31. diff (1/3) 1 diff --git a/enc/ascii.c b/enc/ascii.c 2 index a2fef2f879..0d248bd129

    100644 3 --- a/enc/ascii.c 4 +++ b/enc/ascii.c 5 @@ -74,6 +74,7 @@ ENC_REPLICATE("CP852", "IBM852") 6 7 ENC_REPLICATE("IBM855", "ASCII-8BIT") 8 9 ENC_REPLICATE("CP855", "IBM855") 10 11 ENC_REPLICATE("IBM857", "ASCII-8BIT") 12 +13 ENC_REPLICATE("IROHA", "ASCII-8BIT") 14 15 ENC_ALIAS("CP857", "IBM857") 16 17 ENC_REPLICATE("IBM860", "ASCII-8BIT") 18 19 ENC_ALIAS("CP860", "IBM860") 31
  32. diff (2/3) 1 diff --git a/enc/trans/iroha-tbl.rb b/enc/trans/iroha-tbl.rb 2 new file

    mode 100644 3 index 0000000000..1d170e221e 4 --- /dev/null 5 +++ b/enc/trans/iroha-tbl.rb 6 @@ -0,0 +1,49 @@ 7 + 8 IROHA_TO_UCS_TBL = [ 9 +10 ["80", 0x3044], # い 11 +12 ["81", 0x308d], # ろ 13 +14 ["82", 0x306f], # は 15 +16 ["83", 0x306b], # に 17 +18 ["84", 0x307b], # ほ 19 # 中略 20 +21 ["AD", 0x305b], # せ 22 +23 ["AE", 0x3059], # す 24 +25 ] 32
  33. diff (3/3) 1 diff --git a/enc/trans/single_byte.trans b/enc/trans/single_byte.trans 2 index 0d5407b918..57eb87a9c9

    100644 3 --- a/enc/trans/single_byte.trans 4 +++ b/enc/trans/single_byte.trans 5 @@ -64,6 +64,7 @@ 6 7 transcode_tblgen_singlebyte "IBM865" 8 9 transcode_tblgen_singlebyte "IBM866" 10 11 transcode_tblgen_singlebyte "IBM869" 12 +13 transcode_tblgen_singlebyte "IROHA" 14 15 transcode_tblgen_singlebyte "MACCROATIAN" 16 17 transcode_tblgen_singlebyte "MACCYRILLIC" 18 19 transcode_tblgen_singlebyte "MACGREEK" 33
  34. Conversion table: iroha-tbl.rb 1 IROHA_TO_UCS_TBL = [ 2 ["80", 0x3044],

    ["81", 0x308d], ["82", 0x306f], ["83", 0x306b], 3 ["84", 0x307b], ["85", 0x3078], ["86", 0x3068], ["87", 0x3061], 4 ["88", 0x308a], ["89", 0x306c], ["8A", 0x308b], ["8B", 0x3092], 5 ["8C", 0x308f], ["8D", 0x304b], ["8E", 0x3088], ["8F", 0x305f], 6 ["90", 0x308c], ["91", 0x305d], ["92", 0x3064], ["93", 0x306d], 7 ["94", 0x306a], ["95", 0x3089], ["96", 0x3080], ["97", 0x3046], 8 ["98", 0x3090], ["99", 0x306e], ["9A", 0x304a], ["9B", 0x304f], 9 ["9C", 0x3084], ["9D", 0x307e], ["9E", 0x3051], ["9F", 0x3075], 10 ["A0", 0x3053], ["A1", 0x3048], ["A2", 0x3066], ["A3", 0x3042], 11 ["A4", 0x3055], ["A5", 0x304d], ["A6", 0x3086], ["A7", 0x3081], 12 ["A8", 0x307f], ["A9", 0x3057], ["AA", 0x3048], ["AB", 0x3072], 13 ["AC", 0x3082], ["AD", 0x305b], ["AE", 0x3059] 14 ] You don’t have to write the ASCII byte sequence. 1. https://github.com/ruby/ruby/blob/d92f09a5eea009fa28cd046e9d0eb698e3d94c5c/tool/transcode- tblgen.rb#L882-L883 ↩︎ 34 [1]
  35. Conversion table 35

  36. enc/trans/single_byte.trans single_byte.trans Generate a character encoding conversion table as C

    code with erb generate enc/trans/single_byte.c It will be included when CRuby build. 36 ` `
  37. enc/trans/single_byte.c see: https://github.com/ima1zumi/encoding_iroha/blob/1a58e8d/ext/encoding_iroha/iroha-tbl.h 1 // abbr 2 #define from_IROHA_offsets 21206

    3 0, 255, 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 13 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 14 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 34, 43, 44, 45, 46, 47, 15 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 16 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 37 abstract
  38. enc/ascii.c Make it possible to refer to the Encoding class

    as a constant. Can be referenced by Encoding::IROHA . 1 ENC_REPLICATE("IROHA", "ASCII-8BIT") 38 ` `
  39. Let’s try using Encoding::IROHA Encoding.find String#encode 39

  40. Encoding.find 1 p Encoding.find('IROHA') 2 # => #<Encoding:IROHA> 1. class

    Encoding - Documentation for Ruby 3.0.0: https://docs.ruby-lang.org/en/3.0.0/Encoding.html#method-c- find ↩︎ 40 Search the encoding with specified name. name should be a string. [1]
  41. String#encode 1 p ' いろは'.encode(Encoding::IROHA) 2 # => "\x80\x81\x82" 1.

    class String - Documentation for Ruby 3.0.0 ↩︎ 41 Return a copy of string transcoded to encoding. [1]
  42. Conversion error 1 'α'.encode(Encoding::IROHA) 2 # error: in `encode': U+03B1

    from UTF-8 to IROHA 3 # (Encoding::UndefinedConversionError) 42
  43. encode 43

  44. encode 44

  45. extra Add Encoding::IROHA gem gem install encoding_iroha . Call private

    APIs See this commit https://github.com/ima1zumi/encoding_iroha/commit/1a58e8d 45 ` ` ` `
  46. Conclusion Character Encoding Coded Character Set Character Encoding Scheme Add

    Encoding Conversion table Encoding constant 46
  47. References (1/3) Ruby M17N 成瀬 ゆい. "Ruby M17N の設計と実装". Rubyist

    Magazine. 2009-02-12. https://magazine.rubyist.net/articles/0025/0025-Ruby19_m17n.html, (Accessed 2021-08-26) Martin J. Dürst. "Ruby M17N". 2008-06-21. https://www.sw.it.aoyama.ac.jp/2008/pub/RubyKaigiM17N.html, (Accessed 2021-08-26) 成瀬 ゆい. "なるせにっき". 2008-06-23. https://naruse.hateblo.jp/entries/2008/06/23, (Accessed 2021-08-26) 成瀬 ゆい, Martin J. Dürst. "Ruby M17N". 2008-06-21. https://web.archive.org/web/20150925234827/http://jp.rubyist.net/RubyKaigi2008/video/2008-06- 21_rubykaigi2008-day1_5.ogg, (Accessed 2021-09-06) 47
  48. References (2/3) Character Encoding 成瀬 ゆい. "A Reintroduction To Ruby

    M17N". 2010-03-03. https://www.slideshare.net/nalsh/a-reintroduction-to- ruby-m17-n, (Accessed 2021-08-26) 矢野 啓介. [改訂新版]プログラマのための文字コード技術入門. 技術評論社, 2018, 400p. 978-4297102913 杜甫々. "文字コード入門". とほほのWWW入門. 2020-03-01. https://www.tohoho-web.com/ex/charset.html, (Accessed 2021-08-26) 伊藤 喜一. "(プログラマのための)いまさら聞けない標準規格の話 第1回 文字コード概要編". 2021-07-14. https://www.ogis-ri.co.jp/otc/hiroba/technical/program_standards/part1.html, (Accessed 2021-08-26) 小林 龍生・安岡 孝一・戸村 哲・三上 喜貴 編. インターネット時代の文字コード. 共立出版, 2002, 277p. 4-320-12038-8 Jukka K. Korpela. Unicode Explained. O’Reilly, 0-596-10121-X 48
  49. References (3/3) Unicode Create IROHA Ruby logo Unicode. "Unicode Terminology

    English - Japanese". unknown. http://www.unicode.org/terminology/term_en_ja.html, (Accessed 2021-08-26) Unicode. "UTR#17: Unicode Character Encoding Model". 2008-11-11. https://unicode.org/reports/tr17/, (Accessed 2021-08-26) larskanis. "Add string encoding IBM720 alias CP720 by larskanis · Pull Request #3803 · ruby/ruby". GitHub. 2020-11- 22. https://github.com/ruby/ruby/pull/3803, (Accessed 2021-08-27) "Ruby のロゴについて". unknown. https://www.ruby-lang.org/ja/about/logo/ (Accessed 2021-09-10) 49