About me
Mari Imaizumi @ima1zumi
Originally from Matsuyama
Working at STORES, Inc.
A member of IRB team 💎
Ruby committer
7
Slide 8
Slide 8 text
About me
Mari Imaizumi @ima1zumi
Originally from Matsuyama
Working at STORES, Inc.
A member of IRB team
Ruby committer 🆕
8
Slide 9
Slide 9 text
9
Slide 10
Slide 10 text
About me
Mari Imaizumi @ima1zumi
Originally from Matsuyama
Working at STORES, Inc.
A member of IRB team
Ruby committer
Character encoding enthusiast 🖋
10
Slide 11
Slide 11 text
Character encoding is
still interesting
11
Slide 12
Slide 12 text
Agenda
History of Character Encodings
Fell down the rabbit hole of character encodings
Encounter with EBCDIC
The pitfalls of character counting
Upgrading Ruby to Unicode 15.1.0
Future works
12
Slide 13
Slide 13 text
History of Character
Encodings
Slide 14
Slide 14 text
Character Encoding
14
A, B, C, ..., Z,
a, b, c, ..., z,
0, 1, 2, ..., 9,
SP, !, ", #, ...
A -> 0x41
B -> 0x42
a -> 0x61
0 -> 0x30
Character Set Character Encoding Scheme
จࣈू߹ จࣈූ߸Խํࣜ
Slide 15
Slide 15 text
Why do we need character
encodings?
15
Slide 16
Slide 16 text
non-electric long-distance
communication methods
16
Slide 17
Slide 17 text
Smoke Signals
17
Ԏ
Slide 18
Slide 18 text
Optical telegraph
ޫֶࣜి৴
(Semaphore)
ηϚϑΥ/௨৴
18
Public Domain
https://commons.wikimedia.org/wiki/File:Chappe_semaphore.jpg
Slide 19
Slide 19 text
19
Photo by Patrick87 / CC BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Chappe.svg
Optical telegraph
ޫֶࣜి৴
(Semaphore)
ηϚϑΥ/௨৴
Slide 20
Slide 20 text
Morse code
Late 18th to 19th century:
Practical application of electricity
1837–1844:
Invention and practical use of Morse code
20
Slide 21
Slide 21 text
Morse code
21
Public Domain
https://commons.wikimedia.org/w/index.php?curid=3902977
Slide 22
Slide 22 text
ASCII, EBCDIC
1963 ASCII
American Standard Code for Information Interchange
Established by ASA now known as ANSI
7-bit encoding
1964 EBCDIC
Extended Binary Coded Decimal Interchange Code
8-bit encoding, Established by IBM
22
Slide 23
Slide 23 text
ASCII
23
Slide 24
Slide 24 text
EBCDIC
24
Slide 25
Slide 25 text
25
Character
ASCII
(Hex)
EBCDIC
(Hex)
A \x41 \xC1
B \x42 \xC2
a \x61 \x81
b \x62 \x82
Space \x20 \x40
Slide 26
Slide 26 text
A Maze of Character Encodings
1967 ISO/IEC 646
1969 JIS X 0201
1978 JIS X 0208
1984 EUC-JP
1986 ISO/IEC 2022
1987 ISO/IEC 8859-1 (Latin-1)
26
Slide 27
Slide 27 text
Unify to Unicode®
1991: Unicode 1.0
27
Slide 28
Slide 28 text
The Unicode Standard
A universal character encoding standard
Developed by the Unicode Consortium
28
Slide 29
Slide 29 text
Univarsal
Use a single character set for all scripts worldwide
e.g., Latin, Chinese, Hiragana, Katakana, Greek, Cyrillic,
Arabic, Hangul, Devanagari, Tamil, etc.
E
ffi
cient
Unambiguous
29
Unicode Design Goals
a
͋
Ω
क
Д
ن
ઑ
அ
Slide 30
Slide 30 text
Unicode Code Points
Code points (U+xxxx) to represent abstract characters
U+0000-10FFFF
Each code point uniquely encodes one abstract character
a
U+0061
Ѫ
U+611B
Å
U+00C5
🍊
U+1F34A
30
Slide 31
Slide 31 text
UTF-8
Unicode de
fi
nes a universal set of characters with unique
code points.
UTF-8 transforms these code points into a variable-length
sequence of 1–4 bytes.
In short, Unicode is the “what,” and UTF-8 is the “how.”
31
Slide 32
Slide 32 text
32
Abstract
Character
a ͋ 🍊
Name
LATIN SMALL
LETTER A
HIRAGANA
LETTER A
TANGERINE
Code Point U+0061 U+3042 U+1F34A
UTF-8 byte
sequences
\x61 \xE3\x81\x82
\xF0\x9F
\x8D\x8A
Slide 33
Slide 33 text
Unicode Speci
fi
cations
Unicode Standard
The Unicode Character Database (UCD)
Unicode Code Charts
Unicode Standard Annexes (UAX)
33
Slide 34
Slide 34 text
Unify to Unicode
1991: Unicode 1.0
1993: UTF-8 was presented
2008: UTF-8 becomes the most common encoding
2010: Unicode 6.0 released (Emoji added)
2024: Unicode 16.0.0 released
34
Slide 35
Slide 35 text
35
History of Character Encodings
Slide 36
Slide 36 text
Character Encodings and Me
I started programming around 2016, when the world was
already using Unicode.
36
Slide 37
Slide 37 text
Character Encodings and Me
I started programming around 2016, when the world was
already using Unicode.
The major issues with character encoding proliferation
and incompatibility primarily surfaced in the 1990s.
37
Slide 38
Slide 38 text
Character Encodings and Me
I started programming around 2016, when the world was
already using Unicode.
The major issues with character encoding proliferation
and incompatibility primarily surfaced in the 1990s.
So why did I become a “character encoding enthusiast”
in a Unicode era?
38
Slide 39
Slide 39 text
Recap
From pre-electric signals like smoke and semaphore, we
moved to ASCII and other encodings.
Unicode became the universal solution.
Even in the Unicode era, deep knowledge of character
encodings remains essential.
39
Slide 40
Slide 40 text
Fell down the rabbit hole of
character encodings
Slide 41
Slide 41 text
Agenda
History of Character Encodings
Fell down the rabbit hole of character encodings
Encounter with EBCDIC
The pitfalls of character counting
Upgrading Ruby to Unicode 15.1.0
Future works
41
Slide 42
Slide 42 text
How I Met Character Encodings
2016: My
fi
rst assignment was on a mainframe
COBOL, Assembler, JCL, z/OS
42
Slide 43
Slide 43 text
Our Development Environment
43
Mainframe
z/OS, EBCDIC
TSO
ISPF EDIT
Slide 44
Slide 44 text
EBCDIC and Japanese
EBCDIC uses 8-bit
Only 256 characters
It’s impossible to fully represent Japanese:
Hiragana: about 50 characters
Katakana: about 50 characters
Joyo kanji (commonly used kanji): 2,136 characters
44
Slide 45
Slide 45 text
EBCDIC Katakana extension
45
CP290 EBCDIC
Slide 46
Slide 46 text
Halfwidth Katakana
46
Slide 47
Slide 47 text
Halfwidth Katakana
47
Slide 48
Slide 48 text
Halfwidth Katakana
48
It's hard to read 🥺
Slide 49
Slide 49 text
Halfwidth Katakana
49
Slide 50
Slide 50 text
EBCDIC with Kanji
Even with only 8 bits, there was still a need to input kanji
Use Shift-In (SI) and Shift-Out (SO)
Control Character
50
Slide 51
Slide 51 text
51
example bytes
Shift-In, Shift-Out
bytes ↓
Slide 52
Slide 52 text
52
😵💫
Shift-In, Shift-Out
bytes ↓
bytes ↓
example bytes
Slide 53
Slide 53 text
Multiple Character Sets:
Complexity
Outside alphabets & halfwidth kana, everything was
cumbersome in our environment
Constantly checked hex bytes to avoid overwriting SI/SO
control chars
Realized that correct character input isn’t guaranteed
53
Slide 54
Slide 54 text
Recap
EBCDIC’s limited code space for Japanese required
halfwidth kana and SI/SO switching.
Accidental overwriting of control characters caused data
corruption. 😢
Showing the characters you typed isn't easy
54
Slide 55
Slide 55 text
Agenda
History of Character Encodings
Fell down the rabbit hole of character encodings
Encounter with EBCDIC
The pitfalls of character counting
Upgrading Ruby to Unicode 15.1.0
Future works
55
Slide 56
Slide 56 text
A Few Years Later…
56
Slide 57
Slide 57 text
Reuniting with Character
Encodings
Learned Ruby & Ruby on Rails at Fjord Boot Camp
57
Slide 58
Slide 58 text
Reuniting with Character
Encodings
Learned Ruby & Ruby on Rails at Fjord Boot Camp
@igaiga showed me family emoji 🧑🧑🧒🧒 that crashed IRB
58
Slide 59
Slide 59 text
Reuniting with Character
Encodings
Learned Ruby & Ruby on Rails at Fjord Boot Camp
@igaiga showed me family emoji that crashed IRB
Reported the issue
59
Slide 60
Slide 60 text
Reuniting with Character
Encodings
Learned Ruby & Ruby on Rails at Fjord Boot Camp
@igaiga showed me family emoji that crashed IRB
Reported the issue
@aycabta (Reline’s author) said, “Fix it yourself!”
60
Slide 61
Slide 61 text
Reuniting with Character
Encodings
Learned Ruby & Ruby on Rails at Fjord Boot Camp
@igaiga showed me family emoji that crashed IRB
Reported the issue
@aycabta (Reline’s author) said, “Fix it yourself!”
So I did
61
Slide 62
Slide 62 text
What kind of bug was it?
🧑🧑🧒🧒 + Backspace + Backspace => IRB crash
Why did this happen?
62
Code Points vs Visible Characters
: 1 code point
: 7 code points
Still one character visually
How to handle?
Grapheme Cluster (ॻهૉΫϥελ)
72
a
🧑🧑🧒🧒
Slide 73
Slide 73 text
73
Abstract
Character
a ͋ 🍊
Name
LATIN SMALL
LETTER A
HIRAGANA
LETTER A
TANGERINE
Code Point U+0061 U+3042 U+1F34A
UTF-8 byte
sequences
\x61 \xE3\x81\x82
\xF0\x9F
\x8D\x8A
Slide 74
Slide 74 text
74
Grapheme
Cluster
a ͋ 🍊 🧑🧑🧒🧒
Abstract
Characters
a ͋ 🍊 "👨", "\u200D",
"👩", "\u200D",
"👧",
"
\u200D",
"👦"
Name
LATIN SMALL
LETTER A
HIRAGANA
LETTER A
TANGERINE nil
Code Points U+0061 U+3042 U+1F34A
U+1F468, U+200D,
U+1F469, U+200D,
U+1F467, U+200D,
U+1F466
UTF-8 byte
sequences
\x61 \xE3\x81\x82
\xF0\x9F
\x8D\x8A
\xF0\x9F\x91\xA8\xE2\
x80\x8D\xF0\x9F\x91\x
A9\xE2\x80\x8D\xF0\x9
F\x91\xA7\xE2\x80\x8D
\xF0\x9F\x91\xA6
Slide 75
Slide 75 text
Grapheme Cluster
Multiple code points seen as one character
De
fi
ned by Unicode
UAX #29: Unicode Text Segmentation
Ensures user-expected cursor movement & deletion
75
Merged into Reline (Ruby 3.0)
Explored Grapheme Clusters & Unicode depth
Realized text isn’t just simple code points
Found encoding to be fascinating, not just tricky
78
Slide 79
Slide 79 text
Recap
Family emoji caused IRB to crash because multiple code
points formed a single visible character.
We added Grapheme Cluster handling in Reline to respect
Unicode text segmentation.
This
fi
x ensures cursor movement and deletion align with
user expectations, revealing the complexity of multi-
codepoint characters.
79
Slide 80
Slide 80 text
Agenda
History of Character Encodings
Fell down the rabbit hole of character encodings
Encounter with EBCDIC
The pitfalls of character counting
Upgrading Ruby to Unicode 15.1.0
Future works
80
Slide 81
Slide 81 text
Unicode 15.1.0
Slide 82
Slide 82 text
A Few Years Later…
82
Slide 83
Slide 83 text
Ruby Hackathon at RubyWorld
Conference 2024
Noticed stalled Unicode updates
in Ruby
Commented on Redmine and took
action
83
Slide 84
Slide 84 text
Upgrading Ruby to Unicode 15.1.0
Why We Needed an Update
New rule: Indic_Conjunct_Break for Devanagari
e.g. श
क्
ति
(śakti)
Without update, combined chars aren’t recognized as one
Staying current improves international text handling
84
Slide 85
Slide 85 text
Ruby and Unicode
85
name2ctype.h
casefold.h
Ruby
UnicodeData.txt
DerivedCoreProperty.txt
etc
Unicode Character Database
Slide 86
Slide 86 text
Ruby Unicode Upgrades
New characters added
e.g. Unicode 15.1.0 added 627 characters
Properties added or updated
InCB, Age, etc
Aligned with Unicode specs
86
Slide 87
Slide 87 text
Unicode Character Database
De
fi
nes characters and properties in text
fi
les
Lists code points, categories, etc
Machine-readable data
Ruby references UnicodeData.txt, etc.
87
Slide 88
Slide 88 text
UnicodeData.txt
0000;;Cc;0;BN;;;;;N;NULL;;;;
...
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062;
0043;LATIN CAPITAL LETTER C;Lu;0;L;;;;;N;;;;0063;
0044;LATIN CAPITAL LETTER D;Lu;0;L;;;;;N;;;;0064;
0045;LATIN CAPITAL LETTER E;Lu;0;L;;;;;N;;;;0065;
0046;LATIN CAPITAL LETTER F;Lu;0;L;;;;;N;;;;0066;
...
304C;HIRAGANA LETTER GA;Lo;0;L;304B 3099;;;;N;;;;;
...
094D;DEVANAGARI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
...
10FFFD;;Co;0;L;;;;;N;;;;;
88
Onigmo
A regex engine used by Ruby
Supports Unicode property matches: \p{PropertyName}
and grapheme cluster matches with \X
String#grapheme_clusters also calls \X
90
etc
GraphemeBre
akTest.txt
DerivedCore
Propeties.txt
Ruby Unicode Upgrades
UnicodeData.
txt
enc-unicode.rb
name2ctype.h
casefold.h
Auto generated
test
92
Slide 93
Slide 93 text
Ruby's Unicode Update Process
Increase the speci
fi
ed Unicode version in the build
process
Run scripts to auto-generate tables
Add new tests (some manually)
93
Slide 94
Slide 94 text
94
Slide 95
Slide 95 text
95
Slide 96
Slide 96 text
96
Slide 97
Slide 97 text
Issues with the Unicode 15.1.0
Update
97
etc
GraphemeBre
akTest.txt
DerivedCore
Propeties.txt
UnicodeData.
txt
enc-unicode.rb
name2ctype.h
casefold.h
Auto generated
test
Slide 98
Slide 98 text
Issues with the Unicode 15.1.0
Update
Failed to parse the UCD
98
/* 'InCB': Derived Property */
#endif /* USE_UNICODE_PROPERTIES */
tool/enc-unicode.rb:52:in 'Object#pair_codepoints': undefined method 'sort!' for nil (NoMethodError)
codepoints.sort!
^^^^^^
from tool/enc-unicode.rb:282:in 'Object#make_const'
from tool/enc-unicode.rb:441:in 'block (2 levels) in '
from tool/enc-unicode.rb:434:in 'Array#each'
from tool/enc-unicode.rb:434:in 'block in '
Slide 99
Slide 99 text
99
# DerivedCoreProperties-15.1.0.txt
(snip)
# Derived Property: Indic_Conjunct_Break
# Generated from the Grapheme_Cluster_Break, Indic_Syllabic_Category,
# Canonical_Combining_Class, and Script properties as described in UAX #44:
# ================================================
# Indic_Conjunct_Break=Linker
094D ; InCB; Linker # Mn DEVANAGARI SIGN VIRAMA
(snip)
# Total code points: 6
# ================================================
# Indic_Conjunct_Break=Consonant
0915..0939 ; InCB; Consonant # Lo [37] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA
0958..095F ; InCB; Consonant # Lo [8] DEVANAGARI LETTER QA..DEVANAGARI LETTER YYA
(snip)
# Total code points: 240
# ================================================
# Indic_Conjunct_Break=Extend
0300..036F ; InCB; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X
(snip)
# Total code points: 2192
Slide 100
Slide 100 text
Indic Conjunct Break(InCB)
Preserves consonant + linker + consonant as one unit
e.g. क + ् + त =
क्त
Prevents splitting of Indic ligatures (e.g., Devanagari)
Crucial for correct grapheme cluster handling
100
Slide 101
Slide 101 text
DerivedCoreProperies.txt
# DerivedCoreProperties-15.1.0.txt
# ================================================
# Derived Property: Math
002B ; Math # Sm PLUS SIGN
# ================================================
# Indic_Conjunct_Break=Linker
094D ; InCB; Linker # Mn DEVANAGARI SIGN VIRAMA
# ================================================
# Indic_Conjunct_Break=Consonant
0915..0939 ; InCB; Consonant # Lo [37] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA
# ================================================
# Indic_Conjunct_Break=Extend
0300..036F ; InCB; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X
101
Slide 102
Slide 102 text
Fixed Parsing
Logic
Updated parsing to
correctly handle
properties like
InCB; Consonant
102
Slide 103
Slide 103 text
Successfully generated
name2ctype.h
103
etc
GraphemeBre
akTest.txt
DerivedCore
Propeties.txt
UnicodeData.
txt
enc-unicode.rb
name2ctype.h
casefold.h
Auto generated
test
Slide 104
Slide 104 text
Test failed...
104
etc
GraphemeBre
akTest.txt
DerivedCore
Propeties.txt
UnicodeData.
txt
enc-unicode.rb
name2ctype.h
casefold.h
Auto generated
test
Slide 105
Slide 105 text
105
# GraphemeBreakTest-15.1.0.txt
# Format:
# (# )?
# contains hex Unicode code points, with
# ÷ wherever there is a break opportunity, and
# × wherever there is not.
#
# These samples may be extended or changed in the future.
#
÷ 0020 ÷ 0020 ÷ # ÷ [0.2] SPACE (Other) ÷ [999.0] SPACE (Other) ÷ [0.3]
÷ 0020 × 0308 ÷ 0020 ÷ # ÷ [0.2] SPACE (Other) × [9.0] COMBINING DIAERESIS (Extend_ExtCccZwj) ÷ [999.0]
SPACE (Other) ÷ [0.3]
÷ 0020 ÷ 000D ÷ # ÷ [0.2] SPACE (Other) ÷ [5.0] (CR) ÷ [0.3]
÷ 0915 × 094D × 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (ConjunctLinkingScripts_LinkingConsonant) × [9.0]
DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinkingScripts_ConjunctLinker_ExtCccZwj) × [9.3] DEVANAGARI LETTER TA
(ConjunctLinkingScripts_LinkingConsonant) ÷ [0.3]
`
Slide 106
Slide 106 text
106
make check
# Running tests:
2) Failure:
TestGraphemeBreaksFromFile#test_each_grapheme_cluster [/Users/mi/ghq/github.com/ruby/ruby/test/ruby/enc/
test_grapheme_breaks.rb:67]:
line 1202, expected '["
क्त
"]', but got '["क
् ", "त"]', comment: (snip)
<["
क् त
"]> expected but was <["क
् ", "त"]>.
`
node_extended_grapheme_cluster
Builds the internal node structure for \X
Implements complex Unicode Grapheme Break rules
Creates ALT/SEQ/CCLASS nodes for CR, LF, Control, etc.
Hard-coded logic that must stay synced with Unicode
updates
115
Grapheme Clusters
Implementation
Create Nodes for
\p{InCB=Consonant} [\p{InCB=Extend} \p{InCB=Linker}]*
\p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]*
\p{InCB=Consonant}
All test passed!
118
Slide 119
Slide 119 text
Unicode 15.1.0
Devanagari consonant clusters no longer split
119
# Before
"
क्
त
".grapheme_clusters
# => ["क
् ", "त"]
# After
"
क्
त
".grapheme_clusters
# => ["
क् त
"]
Slide 120
Slide 120 text
Merged!
120
Slide 121
Slide 121 text
Recap
121
Ruby now supports Unicode 15.1.0, adding
Indic_Conjunct_Break (InCB) for Devanagari ligatures.
Onigmo’s grapheme cluster logic (\X) was updated with
new break rules (GB9c).
Devanagari consonant clusters (e.g.,
क्त
) no longer split
In Ruby 3.5, Unicode 15.1.0 is available.
Slide 122
Slide 122 text
Agenda
History of Character Encodings
Fell down the rabbit hole of character encodings
Encounter with EBCDIC
The pitfalls of character counting
Upgrading Ruby to Unicode 15.1.0
Future works
122
Slide 123
Slide 123 text
Future works:
Unicode 16.0.0
Slide 124
Slide 124 text
Unicode 16.0.0
Ruby's Unicode 16.0.0 update currently in progress
Normalization tests are failing
124
WIP
etc
GraphemeBre
akTest.txt
DerivedCore
Propeties.txt
UnicodeData.
txt
enc-unicode.rb
name2ctype.h
casefold.h
Auto generated
test
Slide 125
Slide 125 text
Unicode Normalization
Unicode normalization uni
fi
es strings that look identical but
di
ff
er internally.
NFD/NFC use canonical equivalence (e.g., e + ⤆
㲗 é).
NFKD/NFKC use compatibility equivalence (e.g., ᶃ → 1).
Normalization reduces search mismatches and security risks.
Prevents garbled text across OS/
fi
le systems and boosts data
compatibility.
125
Implementation
Rewrote most of the normalization logic
Just for understanding
https://github.com/ruby/ruby/pull/13117
Referenced Rust's unicode-normalization project
https://github.com/unicode-rs/unicode-normalization
127
Slide 128
Slide 128 text
Future works
All tests pass, but performance may have regressed.
I removed optimizations temporarily.
Plan to maintain performance while upgrading to Unicode
16.0.0.
Considering “Quick Check” for faster validation.
128
Slide 129
Slide 129 text
Acknowledgements
Slide 130
Slide 130 text
Acknowledgments
Thank you, STORES, Inc. team, for your reviews and
scheduling support.
Special thanks to fujimura-san, @ko1, and @mame for
repeatedly reviewing my work.
My husband Takuya’s support made this presentation
possible.
I’m also grateful to everyone who reviewed my PRs and
provided valuable advice.
130