Why we need Unicode
What Unicode is and how it’s organized
Terminology
Sorting
Emoji
Practical considerations in Javascript, MySQL,
Regular Expressions, and malformed text
Helper data and libraries
Slide 7
Slide 7 text
Purpose
Slide 8
Slide 8 text
Why do we need Unicode?
Reliably write, store and send text
Most living languages, and many historical
Does away with a multitude of standards
Slide 9
Slide 9 text
What is Unicode?
Slide 10
Slide 10 text
What is Unicode?
A numbered list of characters
Ways to encode these numbers
Information about the characters
Algorithms for sorting, displaying, etc
Slide 11
Slide 11 text
A numbered list of characters
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
304B;HIRAGANA LETTER KA;Lo;0;L;;;;;N;;;;;
1F409;DRAGON;So;0;ON;;;;;N;;;;;
http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
Slide 12
Slide 12 text
0041;LATIN CAPITAL LETTER A
Code point Character name
Ways to encode these numbers
UTF-8 UTF-16 UTF-32
A
U+0041 0x41 0x0041 0x00000041
͔
U+304B
0xE3818B 0x304B 0x0000304B
U+1F409
0xF09F9089 0xD83DDC09 0x0001F409
Slide 15
Slide 15 text
Information about the characters
0600..0604 ; Arabic # Cf [5] ARABIC NUMBER SIGN..ARABIC
SIGN SAMVAT
30FB;W # Po KATAKANA MIDDLE DOT
1F1F8 1F1EA ; Emoji_Flag_Sequence # 6.0 [1] (#)
Flag for Sweden
http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
http://www.unicode.org/Public/9.0.0/ucd/EastAsianWidth.txt
http://www.unicode.org/Public/emoji/3.0//emoji-sequences.txt
Slide 16
Slide 16 text
Algorithms for sorting,
displaying, etc
http://unicode.org/reports/
Bidirectional Algorithm
Line Breaking Algorithm
Text Segmentation
Collation Algorithm
Regular Expressions
Slide 17
Slide 17 text
“Unicode is complicated!”
Slide 18
Slide 18 text
Unicode is complicated
because language is complicated
because humans are complicated
Slide 19
Slide 19 text
1987 2016
2010
1991
Work started
1.0
7,161 characters
9.0
128,237 characters
Slide 20
Slide 20 text
Organization
17 planes, with 65,535 code points each
Blocks of related characters
Modern scripts in Plane 0
BMP, 0000-FFFF
Emoji in Plane 1
SMP, 10000-1FFFF
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
Terminology
Slide 24
Slide 24 text
Glyph Á Á
Grapheme Á
Character
Sequence
A ´
Code Point U+0041 + U+00B4
U+1F600 or
U+D83D + U+DE00
Code Unit
(UTF-8)
41 C2 B4 F0 9F 98 80
UCS-2
First!
65,535 characters are enough!
Only BMP
Older programming languages
Slide 28
Slide 28 text
UTF-16
BMP → One 16 bit code unit
U+0041 (A) → 0041
Above BMP → Two 16 bit code units
U+1F600 () → D83D + DE00
Slide 29
Slide 29 text
UTF-16
Systems and programming languages
Byte order matters
UTF-16: FEFF 0041
UTF-16 LE: FFFE 4100
Byte Order Mark (0xFEFF, ZERO WIDTH NO-BREAK SPACE
)
Slide 30
Slide 30 text
UTF-8
This is the encoding you are looking for
One – four 8 bit code units
Easy to parse
Byte order does not matter
Common Locale Data
Repository
Language and country data
Formatting dates, numbers, currency
Translations of scripts, languages, date
units
Script characters, sorting and
transliteration rules
Slide 75
Slide 75 text
International Components
for Unicode
C/C++ and Java libraries for Unicode handling
Uses CLDR data
Encoding conversion
Sorting, formatting, normalization
Time and calendar conversion
Regular expressions
Text segmentation
Slide 76
Slide 76 text
– Richard Feynman
“If you think you understand Unicode,
you don't understand Unicode.”