Slide 1

Slide 1 text

Ruby Taught Me About Encoding Under the Hood Mari Imaizumi @ima1zumi RubyKaigi 2025 2025-04-16

Slide 2

Slide 2 text

About me Mari Imaizumi @ima1zumi 🍊 Originally from Matsuyama Working at STORES, Inc. A member of IRB team Ruby committer 2

Slide 3

Slide 3 text

About me Mari Imaizumi @ima1zumi Originally from Matsuyama 🏯 Working at STORES, Inc. A member of IRB team Ruby committer 3

Slide 4

Slide 4 text

I'm back on this stage after 20 years 4

Slide 5

Slide 5 text

About me Mari Imaizumi @ima1zumi Originally from Matsuyama Working at STORES, Inc. 💻 A member of IRB team Ruby committer 5

Slide 6

Slide 6 text

Nursery Sponsor Speakers: LT: Sponsor booth Let's try IRB Treasure Hunt 💎 6

Slide 7

Slide 7 text

About me Mari Imaizumi @ima1zumi Originally from Matsuyama Working at STORES, Inc. A member of IRB team 💎 Ruby committer 7

Slide 8

Slide 8 text

About me Mari Imaizumi @ima1zumi Originally from Matsuyama Working at STORES, Inc. A member of IRB team Ruby committer 🆕 8

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

About me Mari Imaizumi @ima1zumi Originally from Matsuyama Working at STORES, Inc. A member of IRB team Ruby committer Character encoding enthusiast 🖋 10

Slide 11

Slide 11 text

Character encoding is still interesting 11

Slide 12

Slide 12 text

Agenda History of Character Encodings Fell down the rabbit hole of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 12

Slide 13

Slide 13 text

History of Character Encodings

Slide 14

Slide 14 text

Character Encoding 14 A, B, C, ..., Z, a, b, c, ..., z, 0, 1, 2, ..., 9, SP, !, ", #, ... A -> 0x41 B -> 0x42 a -> 0x61 0 -> 0x30 Character Set Character Encoding Scheme จࣈू߹ จࣈූ߸Խํࣜ

Slide 15

Slide 15 text

Why do we need character encodings? 15

Slide 16

Slide 16 text

non-electric long-distance communication methods 16

Slide 17

Slide 17 text

Smoke Signals 17 ࿛Ԏ

Slide 18

Slide 18 text

Optical telegraph ޫֶࣜి৴ (Semaphore) ηϚϑΥ/࿹໦௨৴ 18 Public Domain https://commons.wikimedia.org/wiki/File:Chappe_semaphore.jpg

Slide 19

Slide 19 text

19 Photo by Patrick87 / CC BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Chappe.svg Optical telegraph ޫֶࣜి৴ (Semaphore) ηϚϑΥ/࿹໦௨৴

Slide 20

Slide 20 text

Morse code Late 18th to 19th century: Practical application of electricity 1837–1844: Invention and practical use of Morse code 20

Slide 21

Slide 21 text

Morse code 21 Public Domain https://commons.wikimedia.org/w/index.php?curid=3902977

Slide 22

Slide 22 text

ASCII, EBCDIC 1963 ASCII American Standard Code for Information Interchange Established by ASA now known as ANSI 7-bit encoding 1964 EBCDIC Extended Binary Coded Decimal Interchange Code 8-bit encoding, Established by IBM 22

Slide 23

Slide 23 text

ASCII 23

Slide 24

Slide 24 text

EBCDIC 24

Slide 25

Slide 25 text

25 Character ASCII (Hex) EBCDIC (Hex) A \x41 \xC1 B \x42 \xC2 a \x61 \x81 b \x62 \x82 Space \x20 \x40

Slide 26

Slide 26 text

A Maze of Character Encodings 1967 ISO/IEC 646 1969 JIS X 0201 1978 JIS X 0208 1984 EUC-JP 1986 ISO/IEC 2022 1987 ISO/IEC 8859-1 (Latin-1) 26

Slide 27

Slide 27 text

Unify to Unicode® 1991: Unicode 1.0 27

Slide 28

Slide 28 text

The Unicode Standard A universal character encoding standard Developed by the Unicode Consortium 28

Slide 29

Slide 29 text

Univarsal Use a single character set for all scripts worldwide e.g., Latin, Chinese, Hiragana, Katakana, Greek, Cyrillic, Arabic, Hangul, Devanagari, Tamil, etc. E ffi cient Unambiguous 29 Unicode Design Goals a ͋ ׽ Ω क Д ن ઑ அ

Slide 30

Slide 30 text

Unicode Code Points Code points (U+xxxx) to represent abstract characters U+0000-10FFFF Each code point uniquely encodes one abstract character a U+0061 Ѫ U+611B Å U+00C5 🍊 U+1F34A 30

Slide 31

Slide 31 text

UTF-8 Unicode de fi nes a universal set of characters with unique code points. UTF-8 transforms these code points into a variable-length sequence of 1–4 bytes. In short, Unicode is the “what,” and UTF-8 is the “how.” 31

Slide 32

Slide 32 text

32 Abstract Character a ͋ 🍊 Name LATIN SMALL LETTER A HIRAGANA LETTER A TANGERINE Code Point U+0061 U+3042 U+1F34A UTF-8 byte sequences \x61 \xE3\x81\x82 \xF0\x9F \x8D\x8A

Slide 33

Slide 33 text

Unicode Speci fi cations Unicode Standard The Unicode Character Database (UCD) Unicode Code Charts Unicode Standard Annexes (UAX) 33

Slide 34

Slide 34 text

Unify to Unicode 1991: Unicode 1.0 1993: UTF-8 was presented 2008: UTF-8 becomes the most common encoding 2010: Unicode 6.0 released (Emoji added) 2024: Unicode 16.0.0 released 34

Slide 35

Slide 35 text

35 History of Character Encodings

Slide 36

Slide 36 text

Character Encodings and Me I started programming around 2016, when the world was already using Unicode. 36

Slide 37

Slide 37 text

Character Encodings and Me I started programming around 2016, when the world was already using Unicode. The major issues with character encoding proliferation and incompatibility primarily surfaced in the 1990s. 37

Slide 38

Slide 38 text

Character Encodings and Me I started programming around 2016, when the world was already using Unicode. The major issues with character encoding proliferation and incompatibility primarily surfaced in the 1990s. So why did I become a “character encoding enthusiast” in a Unicode era? 38

Slide 39

Slide 39 text

Recap From pre-electric signals like smoke and semaphore, we moved to ASCII and other encodings. Unicode became the universal solution. Even in the Unicode era, deep knowledge of character encodings remains essential. 39

Slide 40

Slide 40 text

Fell down the rabbit hole of character encodings

Slide 41

Slide 41 text

Agenda History of Character Encodings Fell down the rabbit hole of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 41

Slide 42

Slide 42 text

How I Met Character Encodings 2016: My fi rst assignment was on a mainframe COBOL, Assembler, JCL, z/OS 42

Slide 43

Slide 43 text

Our Development Environment 43 Mainframe z/OS, EBCDIC TSO ISPF EDIT

Slide 44

Slide 44 text

EBCDIC and Japanese EBCDIC uses 8-bit Only 256 characters It’s impossible to fully represent Japanese: Hiragana: about 50 characters Katakana: about 50 characters Joyo kanji (commonly used kanji): 2,136 characters 44

Slide 45

Slide 45 text

EBCDIC Katakana extension 45 CP290 EBCDIC

Slide 46

Slide 46 text

Halfwidth Katakana 46

Slide 47

Slide 47 text

Halfwidth Katakana 47

Slide 48

Slide 48 text

Halfwidth Katakana 48 It's hard to read 🥺

Slide 49

Slide 49 text

Halfwidth Katakana 49

Slide 50

Slide 50 text

EBCDIC with Kanji Even with only 8 bits, there was still a need to input kanji Use Shift-In (SI) and Shift-Out (SO) Control Character 50

Slide 51

Slide 51 text

51 example bytes Shift-In, Shift-Out bytes ↓

Slide 52

Slide 52 text

52 😵💫 Shift-In, Shift-Out bytes ↓ bytes ↓ example bytes

Slide 53

Slide 53 text

Multiple Character Sets: Complexity Outside alphabets & halfwidth kana, everything was cumbersome in our environment Constantly checked hex bytes to avoid overwriting SI/SO control chars Realized that correct character input isn’t guaranteed 53

Slide 54

Slide 54 text

Recap EBCDIC’s limited code space for Japanese required halfwidth kana and SI/SO switching. Accidental overwriting of control characters caused data corruption. 😢 Showing the characters you typed isn't easy 54

Slide 55

Slide 55 text

Agenda History of Character Encodings Fell down the rabbit hole of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 55

Slide 56

Slide 56 text

A Few Years Later… 56

Slide 57

Slide 57 text

Reuniting with Character Encodings Learned Ruby & Ruby on Rails at Fjord Boot Camp 57

Slide 58

Slide 58 text

Reuniting with Character Encodings Learned Ruby & Ruby on Rails at Fjord Boot Camp @igaiga showed me family emoji 🧑🧑🧒🧒 that crashed IRB 58

Slide 59

Slide 59 text

Reuniting with Character Encodings Learned Ruby & Ruby on Rails at Fjord Boot Camp @igaiga showed me family emoji that crashed IRB Reported the issue 59

Slide 60

Slide 60 text

Reuniting with Character Encodings Learned Ruby & Ruby on Rails at Fjord Boot Camp @igaiga showed me family emoji that crashed IRB Reported the issue @aycabta (Reline’s author) said, “Fix it yourself!” 60

Slide 61

Slide 61 text

Reuniting with Character Encodings Learned Ruby & Ruby on Rails at Fjord Boot Camp @igaiga showed me family emoji that crashed IRB Reported the issue @aycabta (Reline’s author) said, “Fix it yourself!” So I did 61

Slide 62

Slide 62 text

What kind of bug was it? 🧑🧑🧒🧒 + Backspace + Backspace => IRB crash Why did this happen? 62

Slide 63

Slide 63 text

63

Slide 64

Slide 64 text

Family emoji 🧑🧑🧒🧒 64

Slide 65

Slide 65 text

Family emoji 🧑🧑🧒🧒 65 "🧑🧑🧒🧒".chars.size # => 7

Slide 66

Slide 66 text

Family emoji 🧑🧑🧒🧒 66 "🧑🧑🧒🧒".chars.size # => 7 "🧑🧑🧒🧒".chars # => ["👨", "‍ ", "👩", "‍ ", "👧", "‍ ", "👦"]

Slide 67

Slide 67 text

Family emoji 🧑🧑🧒🧒 67 "🧑🧑🧒🧒".chars.size # => 7 "🧑🧑🧒🧒".chars # => ["👨", "‍ ", "👩", "‍ ", "👧", "‍ ", "👦"] "🧑🧑🧒🧒".chars.map { it.ord.to_s(16) } # => ["1f468", "200d", "1f469", "200d", "1f467", "200d", "1f466"]

Slide 68

Slide 68 text

68 Backspace Backspace Paste 🧑🧑🧒🧒 /usr/local/lib/ruby/2.7.0/reline/line_editor.rb:1568:in `-': nil can't be coerced into Integer (TypeError)

Slide 69

Slide 69 text

69 1 2 3 4 5 6 7 8 9 10 11 12 > █ > 🧑🧑🧒🧒 █ Paste 🧑🧑🧒🧒

Slide 70

Slide 70 text

70 1 2 3 4 5 6 7 8 9 10 11 12 > █ > 🧑🧑🧒🧒 █ > 👨 ZWJ 👩 ZWJ 👧 ZWJ 👦 █ Paste 🧑🧑🧒🧒 ※ ZWJ is Zero Width Joiner (U+200D)

Slide 71

Slide 71 text

71 1 2 3 4 5 6 7 8 9 10 11 12 > █ > 🧑🧑🧒🧒 █ > █ Paste 🧑🧑🧒🧒 Backspace

Slide 72

Slide 72 text

Code Points vs Visible Characters : 1 code point : 7 code points Still one character visually How to handle? Grapheme Cluster (ॻهૉΫϥελ) 72 a 🧑🧑🧒🧒

Slide 73

Slide 73 text

73 Abstract Character a ͋ 🍊 Name LATIN SMALL LETTER A HIRAGANA LETTER A TANGERINE Code Point U+0061 U+3042 U+1F34A UTF-8 byte sequences \x61 \xE3\x81\x82 \xF0\x9F \x8D\x8A

Slide 74

Slide 74 text

74 Grapheme Cluster a ͋ 🍊 🧑🧑🧒🧒 Abstract Characters a ͋ 🍊 "👨", "\u200D", "👩", "\u200D", "👧", "‍ \u200D", "👦" Name LATIN SMALL LETTER A HIRAGANA LETTER A TANGERINE nil Code Points U+0061 U+3042 U+1F34A U+1F468, U+200D, U+1F469, U+200D, U+1F467, U+200D, U+1F466 UTF-8 byte sequences \x61 \xE3\x81\x82 \xF0\x9F \x8D\x8A \xF0\x9F\x91\xA8\xE2\ x80\x8D\xF0\x9F\x91\x A9\xE2\x80\x8D\xF0\x9 F\x91\xA7\xE2\x80\x8D \xF0\x9F\x91\xA6

Slide 75

Slide 75 text

Grapheme Cluster Multiple code points seen as one character De fi ned by Unicode UAX #29: Unicode Text Segmentation Ensures user-expected cursor movement & deletion 75

Slide 76

Slide 76 text

76 Å U+00C5 Combining Characters Å षि U+0041 U+030A び U+0937 U+093F 🧑🧑🧒🧒 U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 U+3072 U+3099 Precomposed Characters U+3073 nil nil ͼ

Slide 77

Slide 77 text

77

Slide 78

Slide 78 text

Merged into Reline (Ruby 3.0) Explored Grapheme Clusters & Unicode depth Realized text isn’t just simple code points Found encoding to be fascinating, not just tricky 78

Slide 79

Slide 79 text

Recap Family emoji caused IRB to crash because multiple code points formed a single visible character. We added Grapheme Cluster handling in Reline to respect Unicode text segmentation. This fi x ensures cursor movement and deletion align with user expectations, revealing the complexity of multi- codepoint characters. 79

Slide 80

Slide 80 text

Agenda History of Character Encodings Fell down the rabbit hole of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 80

Slide 81

Slide 81 text

Unicode 15.1.0

Slide 82

Slide 82 text

A Few Years Later… 82

Slide 83

Slide 83 text

Ruby Hackathon at RubyWorld Conference 2024 Noticed stalled Unicode updates in Ruby Commented on Redmine and took action 83

Slide 84

Slide 84 text

Upgrading Ruby to Unicode 15.1.0 Why We Needed an Update New rule: Indic_Conjunct_Break for Devanagari e.g. श क् ति (śakti) Without update, combined chars aren’t recognized as one Staying current improves international text handling 84

Slide 85

Slide 85 text

Ruby and Unicode 85 name2ctype.h casefold.h Ruby UnicodeData.txt DerivedCoreProperty.txt etc Unicode Character Database

Slide 86

Slide 86 text

Ruby Unicode Upgrades New characters added e.g. Unicode 15.1.0 added 627 characters Properties added or updated InCB, Age, etc Aligned with Unicode specs 86

Slide 87

Slide 87 text

Unicode Character Database De fi nes characters and properties in text fi les Lists code points, categories, etc Machine-readable data Ruby references UnicodeData.txt, etc. 87

Slide 88

Slide 88 text

UnicodeData.txt 0000;;Cc;0;BN;;;;;N;NULL;;;; ... 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; 0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062; 0043;LATIN CAPITAL LETTER C;Lu;0;L;;;;;N;;;;0063; 0044;LATIN CAPITAL LETTER D;Lu;0;L;;;;;N;;;;0064; 0045;LATIN CAPITAL LETTER E;Lu;0;L;;;;;N;;;;0065; 0046;LATIN CAPITAL LETTER F;Lu;0;L;;;;;N;;;;0066; ... 304C;HIRAGANA LETTER GA;Lo;0;L;304B 3099;;;;N;;;;; ... 094D;DEVANAGARI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;; ... 10FFFD;;Co;0;L;;;;;N;;;;; 88

Slide 89

Slide 89 text

DerivedCoreProperies.txt # DerivedCoreProperties-16.0.0.txt # Date: 2024-05-31, 18:09:32 GMT # © 2024 Unicode®, Inc. # ================================================ # Derived Property: Math # Generated from: Sm + Other_Math 002B ; Math # Sm PLUS SIGN 003C..003E ; Math # Sm [3] LESS-THAN SIGN..GREATER-THAN SIGN # ================================================ # Derived Property: Lowercase # Generated from: Ll + Other_Lowercase 0061..007A ; Lowercase # L& [26] LATIN SMALL LETTER A..LATIN SMALL LETTER Z # ================================================ # Derived Property: Uppercase # Generated from: Lu + Other_Uppercase 0041..005A ; Uppercase # L& [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z 89

Slide 90

Slide 90 text

Onigmo A regex engine used by Ruby Supports Unicode property matches: \p{PropertyName} and grapheme cluster matches with \X String#grapheme_clusters also calls \X 90

Slide 91

Slide 91 text

Property "abc".match?(/\p{ASCII}/) # => true "͍͋͏".match?(/\p{ASCII}/) # => false "🏯".match?(/\p{Emoji}/) # => true "1".match?(/\p{Emoji}/) # => true https://docs.ruby-lang.org/en/3.4/regexp/ unicode_properties_rdoc.html 91

Slide 92

Slide 92 text

etc GraphemeBre akTest.txt DerivedCore Propeties.txt Ruby Unicode Upgrades UnicodeData. txt enc-unicode.rb name2ctype.h casefold.h Auto generated test 92

Slide 93

Slide 93 text

Ruby's Unicode Update Process Increase the speci fi ed Unicode version in the build process Run scripts to auto-generate tables Add new tests (some manually) 93

Slide 94

Slide 94 text

94

Slide 95

Slide 95 text

95

Slide 96

Slide 96 text

96

Slide 97

Slide 97 text

Issues with the Unicode 15.1.0 Update 97 etc GraphemeBre akTest.txt DerivedCore Propeties.txt UnicodeData. txt enc-unicode.rb name2ctype.h casefold.h Auto generated test

Slide 98

Slide 98 text

Issues with the Unicode 15.1.0 Update Failed to parse the UCD 98 /* 'InCB': Derived Property */
 #endif /* USE_UNICODE_PROPERTIES */
 tool/enc-unicode.rb:52:in 'Object#pair_codepoints': undefined method 'sort!' for nil (NoMethodError) codepoints.sort!
 ^^^^^^
 from tool/enc-unicode.rb:282:in 'Object#make_const'
 from tool/enc-unicode.rb:441:in 'block (2 levels) in '
 from tool/enc-unicode.rb:434:in 'Array#each'
 from tool/enc-unicode.rb:434:in 'block in '

Slide 99

Slide 99 text

99 # DerivedCoreProperties-15.1.0.txt (snip) # Derived Property: Indic_Conjunct_Break # Generated from the Grapheme_Cluster_Break, Indic_Syllabic_Category, # Canonical_Combining_Class, and Script properties as described in UAX #44: # ================================================ # Indic_Conjunct_Break=Linker 094D ; InCB; Linker # Mn DEVANAGARI SIGN VIRAMA (snip) # Total code points: 6 # ================================================ # Indic_Conjunct_Break=Consonant 0915..0939 ; InCB; Consonant # Lo [37] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA 0958..095F ; InCB; Consonant # Lo [8] DEVANAGARI LETTER QA..DEVANAGARI LETTER YYA (snip) # Total code points: 240 # ================================================ # Indic_Conjunct_Break=Extend 0300..036F ; InCB; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X (snip) # Total code points: 2192

Slide 100

Slide 100 text

Indic Conjunct Break(InCB) Preserves consonant + linker + consonant as one unit e.g. क + ् + त = क्त Prevents splitting of Indic ligatures (e.g., Devanagari) Crucial for correct grapheme cluster handling 100

Slide 101

Slide 101 text

DerivedCoreProperies.txt # DerivedCoreProperties-15.1.0.txt # ================================================ # Derived Property: Math 002B ; Math # Sm PLUS SIGN # ================================================ # Indic_Conjunct_Break=Linker 094D ; InCB; Linker # Mn DEVANAGARI SIGN VIRAMA # ================================================ # Indic_Conjunct_Break=Consonant 0915..0939 ; InCB; Consonant # Lo [37] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA # ================================================ # Indic_Conjunct_Break=Extend 0300..036F ; InCB; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X 101

Slide 102

Slide 102 text

Fixed Parsing Logic Updated parsing to correctly handle properties like InCB; Consonant 102

Slide 103

Slide 103 text

Successfully generated name2ctype.h 103 etc GraphemeBre akTest.txt DerivedCore Propeties.txt UnicodeData. txt enc-unicode.rb name2ctype.h casefold.h Auto generated test

Slide 104

Slide 104 text

Test failed... 104 etc GraphemeBre akTest.txt DerivedCore Propeties.txt UnicodeData. txt enc-unicode.rb name2ctype.h casefold.h Auto generated test

Slide 105

Slide 105 text

105 # GraphemeBreakTest-15.1.0.txt # Format: # (# )? # contains hex Unicode code points, with # ÷ wherever there is a break opportunity, and # × wherever there is not. # # These samples may be extended or changed in the future. # ÷ 0020 ÷ 0020 ÷ # ÷ [0.2] SPACE (Other) ÷ [999.0] SPACE (Other) ÷ [0.3] ÷ 0020 × 0308 ÷ 0020 ÷ # ÷ [0.2] SPACE (Other) × [9.0] COMBINING DIAERESIS (Extend_ExtCccZwj) ÷ [999.0] SPACE (Other) ÷ [0.3] ÷ 0020 ÷ 000D ÷ # ÷ [0.2] SPACE (Other) ÷ [5.0] (CR) ÷ [0.3] ÷ 0915 × 094D × 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (ConjunctLinkingScripts_LinkingConsonant) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinkingScripts_ConjunctLinker_ExtCccZwj) × [9.3] DEVANAGARI LETTER TA (ConjunctLinkingScripts_LinkingConsonant) ÷ [0.3] `

Slide 106

Slide 106 text

106 make check # Running tests: 2) Failure: TestGraphemeBreaksFromFile#test_each_grapheme_cluster [/Users/mi/ghq/github.com/ruby/ruby/test/ruby/enc/ test_grapheme_breaks.rb:67]: line 1202, expected '[" क्त "]', but got '["क ् ", "त"]', comment: (snip) <[" क् त "]> expected but was <["क ् ", "त"]>. `

Slide 107

Slide 107 text

107 ` https://unicode.org/reports/tr29/ ※Ruby supports extended grapheme clusters

Slide 108

Slide 108 text

Example: GB11 "👨👩👧👦".match?(/\A\p{Extended_Pictographic}+\z/) => true 👨 ZWJ 👩 ZWJ 👧 ZWJ 👦 x x x ÷ x ÷ ÷ 108

Slide 109

Slide 109 text

109 GB9c \p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend} \p{InCB=Linker} ]* × \p{InCB=Consonant}

Slide 110

Slide 110 text

GB9c क्त U+0915 U+094D U+0924 क ् त U+0915 DEVANAGARI LETTER KA InCB= Consonant U+094D DEVANAGARI SIGN VIRAMA InCB= Linker U+0924 DEVANAGARI LETTER TA InCB= Consonant + + 110

Slide 111

Slide 111 text

111 क्त \p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend} \p{InCB=Linker} ]* × \p{InCB=Consonant} क

Slide 112

Slide 112 text

112 क्त \p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend} \p{InCB=Linker} ]* × \p{InCB=Consonant} क nil ्

Slide 113

Slide 113 text

113 क्त \p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend} \p{InCB=Linker} ]* × \p{InCB=Consonant} क nil ् nil त

Slide 114

Slide 114 text

114 क्त \p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend} \p{InCB=Linker} ]* × \p{InCB=Consonant} क nil ् nil त

Slide 115

Slide 115 text

node_extended_grapheme_cluster Builds the internal node structure for \X Implements complex Unicode Grapheme Break rules Creates ALT/SEQ/CCLASS nodes for CR, LF, Control, etc. Hard-coded logic that must stay synced with Unicode updates 115

Slide 116

Slide 116 text

116 static int node_extended_grapheme_cluster(Node** np, ScanEnv* env) { ... /* xpicto-sequence := \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})* */ { Node **XP_list = core_alts + 5; /* size: 3 */ R_ERR(create_property_node(XP_list+0, env, "Extended_Pictographic")); /* (Extend* ZWJ \p{Extended_Pictographic})* */ { Node **Ex_list = XP_list + 2; /* size: 4 */ R_ERR(quantify_property_node(Ex_list+0, env, "Grapheme_Cluster_Break=Extend", '*')); /* ZWJ (ZERO WIDTH JOINER) * r = ONIGENC_CODE_TO_MBC(env->enc, 0x200D, buf); if (r < 0) goto err; Ex_list[1] = node_new_str_raw(buf, buf + r); if (IS_NULL(Ex_list[1])) goto err; R_ERR(create_property_node(Ex_list+2, env, "Extended_Pictographic")); R_ERR(create_node_from_array(LIST, XP_list+1, Ex_list)); } R_ERR(quantify_node(XP_list+1, 0, REPEAT_INFINITE)); /* TODO: Check about node freeing */ R_ERR(create_node_from_array(LIST, core_alts+4, XP_list)); }

Slide 117

Slide 117 text

Create nodes in Onigmo to align with the regex engine 117 /* conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+ */ { // \p{InCB=Consonant} Node **CC_list = core_alts + 6; /* size: 3 */ R_ERR(create_property_node(CC_list+0, env, "InCB=Consonant")); { Node **CC_inner_list = CC_list + 2; /* size: 5 */ { // [\p{InCB=Extend} \p{InCB=Linker}]* R_ERR(create_property_node(CC_inner_list+0, env, "InCB=Extend")); R_ERR(add_property_to_cc(NCCLASS(CC_inner_list[0]), "InCB=Linker", 0, env)); R_ERR(quantify_node(CC_inner_list+0, 0, REPEAT_INFINITE)); } R_ERR(quantify_node(CC_list+1, 1, REPEAT_INFINITE)); } (snip)

Slide 118

Slide 118 text

Grapheme Clusters Implementation Create Nodes for \p{InCB=Consonant} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant} All test passed! 118

Slide 119

Slide 119 text

Unicode 15.1.0 Devanagari consonant clusters no longer split 119 # Before " क् त ".grapheme_clusters # => ["क ् ", "त"] # After " क् त ".grapheme_clusters # => [" क् त "]

Slide 120

Slide 120 text

Merged! 120

Slide 121

Slide 121 text

Recap 121 Ruby now supports Unicode 15.1.0, adding Indic_Conjunct_Break (InCB) for Devanagari ligatures. Onigmo’s grapheme cluster logic (\X) was updated with new break rules (GB9c). Devanagari consonant clusters (e.g., क्त ) no longer split In Ruby 3.5, Unicode 15.1.0 is available.

Slide 122

Slide 122 text

Agenda History of Character Encodings Fell down the rabbit hole of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 122

Slide 123

Slide 123 text

Future works: Unicode 16.0.0

Slide 124

Slide 124 text

Unicode 16.0.0 Ruby's Unicode 16.0.0 update currently in progress Normalization tests are failing 124 WIP etc GraphemeBre akTest.txt DerivedCore Propeties.txt UnicodeData. txt enc-unicode.rb name2ctype.h casefold.h Auto generated test

Slide 125

Slide 125 text

Unicode Normalization Unicode normalization uni fi es strings that look identical but di ff er internally. NFD/NFC use canonical equivalence (e.g., e + ⤆  㲗 é). NFKD/NFKC use compatibility equivalence (e.g., ᶃ → 1). Normalization reduces search mismatches and security risks. Prevents garbled text across OS/ fi le systems and boosts data compatibility. 125

Slide 126

Slide 126 text

126 1611E 16123 1611E 1611E 1611F NFD NFC 16121 1611F 16126 Expected 1611E 16123 1611E 1611E 1611F NFD 1611E 16123 Actual NFC

Slide 127

Slide 127 text

Implementation Rewrote most of the normalization logic Just for understanding https://github.com/ruby/ruby/pull/13117 Referenced Rust's unicode-normalization project https://github.com/unicode-rs/unicode-normalization 127

Slide 128

Slide 128 text

Future works All tests pass, but performance may have regressed. I removed optimizations temporarily. Plan to maintain performance while upgrading to Unicode 16.0.0. Considering “Quick Check” for faster validation. 128

Slide 129

Slide 129 text

Acknowledgements

Slide 130

Slide 130 text

Acknowledgments Thank you, STORES, Inc. team, for your reviews and scheduling support. Special thanks to fujimura-san, @ko1, and @mame for repeatedly reviewing my work. My husband Takuya’s support made this presentation possible. I’m also grateful to everyone who reviewed my PRs and provided valuable advice. 130

Slide 131

Slide 131 text

RubyKaigi 2025 Has Begun!

Slide 132

Slide 132 text

132 https://x.com/spikeolaf/status/1909531905747484889

Slide 133

Slide 133 text

Ask the Speaker Share Your Thoughts 133

Slide 134

Slide 134 text

🧑🧑🧒🧒 134

Slide 135

Slide 135 text

Value Curiosity 135

Slide 136

Slide 136 text

No content