Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

ima1zumi ESM, inc. Ruby on Rails engineer irb, reline, rurema contributor Learning about character codes out of curiosity 2 [1] 1. Japanese Ruby Reference Manual: https://docs.ruby- lang.org/ja/latest/doc/index.html

Slide 3

Slide 3 text

Reason for talking The bug fix in reline Bug Fix※ ※Some terminals will not display correctly. 😢 Because ZWJ(U+200D) is sometimes not supported. 1. https://github.com/ruby/reline/pull/217 ↩︎ 3 [1]

Slide 4

Slide 4 text

Reason for talking Ruby CSI?? Character Encoding…??? 4

Slide 5

Slide 5 text

Reason for talking I thought it would be better to make self-made character encoding and add it to Ruby. Do it 5

Slide 6

Slide 6 text

1 p ' いろは'.encode(Encoding::IROHA) 2 # => "\x80\x81\x82" 6

Slide 7

Slide 7 text

What I want to say Create character encoding How to add a charset to CRuby 7

Slide 8

Slide 8 text

Agenda About character encoding Create self-made encoding IROHA Ruby character encoding How to add character encoding to CRuby 8 ` `

Slide 9

Slide 9 text

Not talk Non-CRuby implementations 9

Slide 10

Slide 10 text

Request If you find any mistakes, inaccuracies, etc, please contact #rubykaigiB or @ima1zumi! 10

Slide 11

Slide 11 text

Defining Character Encoding Defining Character Encoding

Slide 12

Slide 12 text

Character Encoding meaning Coded Character Set Character Encoding Scheme 12

Slide 13

Slide 13 text

Coded Character Set (CCS) Character Set: A set of characters collected without duplication Coded Character Set: Each character in a character set is assigned a number A number is called a code point, code position, etc. (e.g.) Unicode, ASCII, JIS X 0213 13

Slide 14

Slide 14 text

Character Encoding Scheme (CES) A transformation method of each characters to byte sequences (e.g.) UTF-8, UTF-16, EUC-JP 1 ' あ'.encode(Encoding::UTF_8).bytes.map { _1.to_s(16) } 2 # => ["e3", "81", "82"] UTF_16BE stands for UTF-16 (Big Endian) 14 3 ' あ'.encode(Encoding::UTF_16BE).bytes.map { _1.to_s(16) } 4 # => ["30", "42"]

Slide 15

Slide 15 text

What is Character Encoding? Coded Character Set Character Encoding Scheme A collection of abstract characters with code numbers assigned to them What byte sequence to use to represent abstract characters 15 [1] 1. RFC 2130 https://www.rfc-editor.org/rfc/rfc2130.txt

Slide 16

Slide 16 text

What do we need to create a character encoding? Name Characters Byte sequence 16

Slide 17

Slide 17 text

Self-made encoding: IROHA Name: IROHA Characters: ASCII + Iroha uta Byte sequence: 1byte 17

Slide 18

Slide 18 text

Iroha uta 18 Like "The quick brown fox jumps over the lazy dog."

Slide 19

Slide 19 text

ASCII Table 19 7bit encoding

Slide 20

Slide 20 text

IROHA Table 20 1byte encoding (Like ISO/IEC 8859)

Slide 21

Slide 21 text

Ruby character encoding Ruby character encoding

Slide 22

Slide 22 text

Ruby M17N M17N: Multilingualization Since Ruby 1.9 Code Set Independent(CSI) 22

Slide 23

Slide 23 text

Code Set Independent (CSI) Treat all encodings fair Ruby, Solaris Advantages: independent of specific character set can handle multiple character sets in a single application less overhead Disadvantages: Probably difficult to implement※ ※ref: https://jp.quora.com/Ruby-deha-naze-UCS-seiki-ka-wo-saiyou-shi-tei-nai-node-shou-ka 23

Slide 24

Slide 24 text

Each instance of String has Encoding information 1 str1.encoding 2 # => # 3 4 str2.encoding 5 # => # 24 String#encoding ` `

Slide 25

Slide 25 text

Character encoding can be converted for each String instance 1 str1 = 'abc' 2 p str1.encoding 3 # => # 4 5 str2 = str1.encode(Encoding::US_ASCII) 6 p str2.encoding 7 # => # If you want to do more detailed conversion, use Encoding::Converter . 25 ` `

Slide 26

Slide 26 text

Universal Coded Set (UCS) Has only one internal code. Converts between external code and UCS in input/output Many programming languages C#, Java, JavaScript, Perl, Python, etc. Advantages: Implementation and string handling can be unified. Disadvantages: Conversion may occur at input/output. Sometimes some information is lost. 26

Slide 27

Slide 27 text

Ruby’s implementation of CSI How do they implement it? In order to find out 1. make self-made character encoding 2. make it work in Ruby Change CRuby code Build 27

Slide 28

Slide 28 text

Define character encoding in Ruby Define character encoding in Ruby

Slide 29

Slide 29 text

What you need to define character encoding in Ruby Character code conversion table Convert between a character and a character Constants of Encoding class (e.g.) Self-made encoding name: IROHA -> Encoding::IROHA . 29 ` `

Slide 30

Slide 30 text

Files to add/modify in CRuby Conversion table enc/trans/iroha-tbl.rb -> Conversion table enc/trans/single_byte.trans -> use iroha-tbl.rb Encoding class constant definition enc/ascii.c ref: https://github.com/ima1zumi/ruby/pull/2 30 ` ` ` ` ` `

Slide 31

Slide 31 text

diff (1/3) 1 diff --git a/enc/ascii.c b/enc/ascii.c 2 index a2fef2f879..0d248bd129 100644 3 --- a/enc/ascii.c 4 +++ b/enc/ascii.c 5 @@ -74,6 +74,7 @@ ENC_REPLICATE("CP852", "IBM852") 6 7 ENC_REPLICATE("IBM855", "ASCII-8BIT") 8 9 ENC_REPLICATE("CP855", "IBM855") 10 11 ENC_REPLICATE("IBM857", "ASCII-8BIT") 12 +13 ENC_REPLICATE("IROHA", "ASCII-8BIT") 14 15 ENC_ALIAS("CP857", "IBM857") 16 17 ENC_REPLICATE("IBM860", "ASCII-8BIT") 18 19 ENC_ALIAS("CP860", "IBM860") 31

Slide 32

Slide 32 text

diff (2/3) 1 diff --git a/enc/trans/iroha-tbl.rb b/enc/trans/iroha-tbl.rb 2 new file mode 100644 3 index 0000000000..1d170e221e 4 --- /dev/null 5 +++ b/enc/trans/iroha-tbl.rb 6 @@ -0,0 +1,49 @@ 7 + 8 IROHA_TO_UCS_TBL = [ 9 +10 ["80", 0x3044], # い 11 +12 ["81", 0x308d], # ろ 13 +14 ["82", 0x306f], # は 15 +16 ["83", 0x306b], # に 17 +18 ["84", 0x307b], # ほ 19 # 中略 20 +21 ["AD", 0x305b], # せ 22 +23 ["AE", 0x3059], # す 24 +25 ] 32

Slide 33

Slide 33 text

diff (3/3) 1 diff --git a/enc/trans/single_byte.trans b/enc/trans/single_byte.trans 2 index 0d5407b918..57eb87a9c9 100644 3 --- a/enc/trans/single_byte.trans 4 +++ b/enc/trans/single_byte.trans 5 @@ -64,6 +64,7 @@ 6 7 transcode_tblgen_singlebyte "IBM865" 8 9 transcode_tblgen_singlebyte "IBM866" 10 11 transcode_tblgen_singlebyte "IBM869" 12 +13 transcode_tblgen_singlebyte "IROHA" 14 15 transcode_tblgen_singlebyte "MACCROATIAN" 16 17 transcode_tblgen_singlebyte "MACCYRILLIC" 18 19 transcode_tblgen_singlebyte "MACGREEK" 33

Slide 34

Slide 34 text

Conversion table: iroha-tbl.rb 1 IROHA_TO_UCS_TBL = [ 2 ["80", 0x3044], ["81", 0x308d], ["82", 0x306f], ["83", 0x306b], 3 ["84", 0x307b], ["85", 0x3078], ["86", 0x3068], ["87", 0x3061], 4 ["88", 0x308a], ["89", 0x306c], ["8A", 0x308b], ["8B", 0x3092], 5 ["8C", 0x308f], ["8D", 0x304b], ["8E", 0x3088], ["8F", 0x305f], 6 ["90", 0x308c], ["91", 0x305d], ["92", 0x3064], ["93", 0x306d], 7 ["94", 0x306a], ["95", 0x3089], ["96", 0x3080], ["97", 0x3046], 8 ["98", 0x3090], ["99", 0x306e], ["9A", 0x304a], ["9B", 0x304f], 9 ["9C", 0x3084], ["9D", 0x307e], ["9E", 0x3051], ["9F", 0x3075], 10 ["A0", 0x3053], ["A1", 0x3048], ["A2", 0x3066], ["A3", 0x3042], 11 ["A4", 0x3055], ["A5", 0x304d], ["A6", 0x3086], ["A7", 0x3081], 12 ["A8", 0x307f], ["A9", 0x3057], ["AA", 0x3048], ["AB", 0x3072], 13 ["AC", 0x3082], ["AD", 0x305b], ["AE", 0x3059] 14 ] You don’t have to write the ASCII byte sequence. 1. https://github.com/ruby/ruby/blob/d92f09a5eea009fa28cd046e9d0eb698e3d94c5c/tool/transcode- tblgen.rb#L882-L883 ↩︎ 34 [1]

Slide 35

Slide 35 text

Conversion table 35

Slide 36

Slide 36 text

enc/trans/single_byte.trans single_byte.trans Generate a character encoding conversion table as C code with erb generate enc/trans/single_byte.c It will be included when CRuby build. 36 ` `

Slide 37

Slide 37 text

enc/trans/single_byte.c see: https://github.com/ima1zumi/encoding_iroha/blob/1a58e8d/ext/encoding_iroha/iroha-tbl.h 1 // abbr 2 #define from_IROHA_offsets 21206 3 0, 255, 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 13 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 14 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 34, 43, 44, 45, 46, 47, 15 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 16 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 37 abstract

Slide 38

Slide 38 text

enc/ascii.c Make it possible to refer to the Encoding class as a constant. Can be referenced by Encoding::IROHA . 1 ENC_REPLICATE("IROHA", "ASCII-8BIT") 38 ` `

Slide 39

Slide 39 text

Let’s try using Encoding::IROHA Encoding.find String#encode 39

Slide 40

Slide 40 text

Encoding.find 1 p Encoding.find('IROHA') 2 # => # 1. class Encoding - Documentation for Ruby 3.0.0: https://docs.ruby-lang.org/en/3.0.0/Encoding.html#method-c- find ↩︎ 40 Search the encoding with specified name. name should be a string. [1]

Slide 41

Slide 41 text

String#encode 1 p ' いろは'.encode(Encoding::IROHA) 2 # => "\x80\x81\x82" 1. class String - Documentation for Ruby 3.0.0 ↩︎ 41 Return a copy of string transcoded to encoding. [1]

Slide 42

Slide 42 text

Conversion error 1 'α'.encode(Encoding::IROHA) 2 # error: in `encode': U+03B1 from UTF-8 to IROHA 3 # (Encoding::UndefinedConversionError) 42

Slide 43

Slide 43 text

encode 43

Slide 44

Slide 44 text

encode 44

Slide 45

Slide 45 text

extra Add Encoding::IROHA gem gem install encoding_iroha . Call private APIs See this commit https://github.com/ima1zumi/encoding_iroha/commit/1a58e8d 45 ` ` ` `

Slide 46

Slide 46 text

Conclusion Character Encoding Coded Character Set Character Encoding Scheme Add Encoding Conversion table Encoding constant 46

Slide 47

Slide 47 text

References (1/3) Ruby M17N 成瀬 ゆい. "Ruby M17N の設計と実装". Rubyist Magazine. 2009-02-12. https://magazine.rubyist.net/articles/0025/0025-Ruby19_m17n.html, (Accessed 2021-08-26) Martin J. Dürst. "Ruby M17N". 2008-06-21. https://www.sw.it.aoyama.ac.jp/2008/pub/RubyKaigiM17N.html, (Accessed 2021-08-26) 成瀬 ゆい. "なるせにっき". 2008-06-23. https://naruse.hateblo.jp/entries/2008/06/23, (Accessed 2021-08-26) 成瀬 ゆい, Martin J. Dürst. "Ruby M17N". 2008-06-21. https://web.archive.org/web/20150925234827/http://jp.rubyist.net/RubyKaigi2008/video/2008-06- 21_rubykaigi2008-day1_5.ogg, (Accessed 2021-09-06) 47

Slide 48

Slide 48 text

References (2/3) Character Encoding 成瀬 ゆい. "A Reintroduction To Ruby M17N". 2010-03-03. https://www.slideshare.net/nalsh/a-reintroduction-to- ruby-m17-n, (Accessed 2021-08-26) 矢野 啓介. [改訂新版]プログラマのための文字コード技術入門. 技術評論社, 2018, 400p. 978-4297102913 杜甫々. "文字コード入門". とほほのWWW入門. 2020-03-01. https://www.tohoho-web.com/ex/charset.html, (Accessed 2021-08-26) 伊藤 喜一. "(プログラマのための)いまさら聞けない標準規格の話 第1回 文字コード概要編". 2021-07-14. https://www.ogis-ri.co.jp/otc/hiroba/technical/program_standards/part1.html, (Accessed 2021-08-26) 小林 龍生・安岡 孝一・戸村 哲・三上 喜貴 編. インターネット時代の文字コード. 共立出版, 2002, 277p. 4-320-12038-8 Jukka K. Korpela. Unicode Explained. O’Reilly, 0-596-10121-X 48

Slide 49

Slide 49 text

References (3/3) Unicode Create IROHA Ruby logo Unicode. "Unicode Terminology English - Japanese". unknown. http://www.unicode.org/terminology/term_en_ja.html, (Accessed 2021-08-26) Unicode. "UTR#17: Unicode Character Encoding Model". 2008-11-11. https://unicode.org/reports/tr17/, (Accessed 2021-08-26) larskanis. "Add string encoding IBM720 alias CP720 by larskanis · Pull Request #3803 · ruby/ruby". GitHub. 2020-11- 22. https://github.com/ruby/ruby/pull/3803, (Accessed 2021-08-27) "Ruby のロゴについて". unknown. https://www.ruby-lang.org/ja/about/logo/ (Accessed 2021-09-10) 49