What is Unicode? - Speaker Deck

Slide 1

Slide 1 text

What is Unicode? Kim Ahlström – @kimtaro

Slide 2

Slide 2 text

https://xkcd.com/927/

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Why we need Unicode What Unicode is and how it’s organized Terminology Sorting Emoji Practical considerations in Javascript, MySQL, Regular Expressions, and malformed text Helper data and libraries

Slide 7

Slide 7 text

Purpose

Slide 8

Slide 8 text

Why do we need Unicode? Reliably write, store and send text Most living languages, and many historical Does away with a multitude of standards

Slide 9

Slide 9 text

What is Unicode?

Slide 10

Slide 10 text

What is Unicode? A numbered list of characters Ways to encode these numbers Information about the characters Algorithms for sorting, displaying, etc

Slide 11

Slide 11 text

A numbered list of characters 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; 304B;HIRAGANA LETTER KA;Lo;0;L;;;;;N;;;;; 1F409;DRAGON;So;0;ON;;;;;N;;;;; http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

Slide 12

Slide 12 text

0041;LATIN CAPITAL LETTER A Code point Character name

Slide 13

Slide 13 text

0 - 10FFFF 1,114,112 code points 271,792 allocated

Slide 14

Slide 14 text

Ways to encode these numbers UTF-8 UTF-16 UTF-32 A U+0041 0x41 0x0041 0x00000041 ͔ U+304B 0xE3818B 0x304B 0x0000304B U+1F409 0xF09F9089 0xD83DDC09 0x0001F409

Slide 15

Slide 15 text

Information about the characters 0600..0604 ; Arabic # Cf [5] ARABIC NUMBER SIGN..ARABIC SIGN SAMVAT 30FB;W # Po KATAKANA MIDDLE DOT 1F1F8 1F1EA ; Emoji_Flag_Sequence # 6.0 [1] (#) Flag for Sweden http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt http://www.unicode.org/Public/9.0.0/ucd/EastAsianWidth.txt http://www.unicode.org/Public/emoji/3.0//emoji-sequences.txt

Slide 16

Slide 16 text

Algorithms for sorting, displaying, etc http://unicode.org/reports/ Bidirectional Algorithm Line Breaking Algorithm Text Segmentation Collation Algorithm Regular Expressions

Slide 17

Slide 17 text

“Unicode is complicated!”

Slide 18

Slide 18 text

Unicode is complicated because language is complicated because humans are complicated

Slide 19

Slide 19 text

1987 2016 2010 1991 Work started 1.0 7,161 characters 9.0 128,237 characters

Slide 20

Slide 20 text

Organization 17 planes, with 65,535 code points each Blocks of related characters Modern scripts in Plane 0  BMP, 0000-FFFF Emoji in Plane 1  SMP, 10000-1FFFF

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Terminology

Slide 24

Slide 24 text

Glyph Á Á Grapheme Á Character Sequence A ´ Code Point U+0041 + U+00B4 U+1F600 or U+D83D + U+DE00 Code Unit (UTF-8) 41 C2 B4 F0 9F 98 80

Slide 25

Slide 25 text

Bytes Code points Glyphs $ 1 1 1 ¢ 2 1 1 € 3 1 1 4 1 1 ⛪ 3 1 1 4 1 1 1 4 2 1 ' 8 2 1 ( 11 3 1 ) 27 8 1 https://twitter.com/FakeUnicode/status/717751277342490624

Slide 26

Slide 26 text

Encodings UCS-2 UTF-8 UTF-16 UTF-32

Slide 27

Slide 27 text

UCS-2 First! 65,535 characters are enough! Only BMP Older programming languages

Slide 28

Slide 28 text

UTF-16 BMP → One 16 bit code unit U+0041 (A) → 0041 Above BMP → Two 16 bit code units U+1F600 () → D83D + DE00

Slide 29

Slide 29 text

UTF-16 Systems and programming languages Byte order matters UTF-16: FEFF 0041 UTF-16 LE: FFFE 4100 Byte Order Mark (0xFEFF, ZERO WIDTH NO-BREAK SPACE )

Slide 30

Slide 30 text

UTF-8 This is the encoding you are looking for One – four 8 bit code units Easy to parse Byte order does not matter

Slide 31

Slide 31 text

UTF-8 U+0041 A 41 U+00D6 Ö C3 96 U+304B ͔ E3 81 8B U+1F600 F0 9F 98 80

Slide 32

Slide 32 text

UTF-32 32 bit code units, U+0041 (A) → 00000041 Space inefﬁcient One code unit per character BORING!

Slide 33

Slide 33 text

Han uniﬁcation

Slide 34

Slide 34 text

Han uniﬁcation https://github.com/adobe-fonts/source-han-sans/raw/release/SourceHanSansReadMe.pdf zh-Hans zh-Hant ja ko

Slide 35

Slide 35 text

Han uniﬁcation Mississippi Mißißippi

Slide 36

Slide 36 text

Indicate language

Slide 37

Slide 37 text

Å Å Å U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE U+212B ANGSTROM SIGN U+0041 LATIN CAPITAL LETTER A U+030A COMBINING RING ABOVE

Slide 38

Slide 38 text

Å Å Å U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE U+212B ANGSTROM SIGN U+0041 LATIN CAPITAL LETTER A U+030A COMBINING RING ABOVE Equivalence

Slide 39

Slide 39 text

Å Å Å U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE U+212B ANGSTROM SIGN U+0041 LATIN CAPITAL LETTER A U+030A COMBINING RING ABOVE Normalization

Slide 40

Slide 40 text

Sorting By code point - BAD! Unicode Collation Algorithm Determined by language Customizations

Slide 41

Slide 41 text

Swedish  Z sorts before Ä German  Ä sorts before Z

Slide 42

Slide 42 text

Emoji

Slide 43

Slide 43 text

Emoji

Slide 44

Slide 44 text

Emoji 2,198 emoji Most in SMP  Two UTF-16 or four UTF-8 code units Separate data ﬁles

Slide 45

Slide 45 text

Emoji

Slide 46

Slide 46 text

Emoji

Slide 47

Slide 47 text

ZWJ sequences , ) . / 1 2 3 4 5 6 7 8 9 : ; < = >

Slide 48

Slide 48 text

ZWJ sequences = WOMAN ZERO WIDTH JOINER WOMAN ZERO WIDTH JOINER BOY ZERO WIDTH JOINER BOY

Slide 49

Slide 49 text

Genders A PEDESTRIAN ZERO WIDTH JOINER ♀ FEMALE SIGN VARIATION SELECTOR-16

Slide 50

Slide 50 text

Fitzpatrick scale Emoji Modifier Fitzpatrick Type-1-2 Emoji Modifier Fitzpatrick Type-3 Emoji Modifier Fitzpatrick Type-4 Emoji Modifier Fitzpatrick Type-5 Emoji Modifier Fitzpatrick Type-6 J K L M N + =

Slide 51

Slide 51 text

Genders + skin tones O RUNNER EMOJI MODIFIER FITZPATRICK TYPE-4 ZERO WIDTH JOINER ♀ FEMALE SIGN VARIATION SELECTOR-16

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Flags # REGIONAL INDICATOR SYMBOL LETTER S REGIONAL INDICATOR SYMBOL LETTER E

Slide 54

Slide 54 text

Javascript

Slide 55

Slide 55 text

Javascript

Slide 56

Slide 56 text

Javascript

Slide 57

Slide 57 text

ES6

Slide 58

Slide 58 text

ES6

Slide 59

Slide 59 text

ES6

Slide 60

Slide 60 text

ES6

Slide 61

Slide 61 text

ES6

Slide 62

Slide 62 text

ES6

Slide 63

Slide 63 text

ES6

Slide 64

Slide 64 text

ES6

Slide 65

Slide 65 text

ES6

Slide 66

Slide 66 text

MySQL “utf8” - only three bytes “utf8mb4” - four bytes

Slide 67

Slide 67 text

Regular Expressions Unicode properties \p{} Uppercase_Letter, Math_Symbol Scripts: Latin, Arabic, Katakana \w and \b match beyond ASCII, like Ä

Slide 68

Slide 68 text

Malformed text

Slide 69

Slide 69 text

Malformed text Don’t know the character encoding Broken UTF-8, UTF-16

Slide 70

Slide 70 text

Malformed text $ cat broken_utf8.txt ?BÄAÖC

Slide 71

Slide 71 text

Malformed text $ iconv -f utf-8 -t utf-8 -c broken_utf8.txt BÄAÖC

Slide 72

Slide 72 text

Malformed text

Slide 73

Slide 73 text

CLDR & ICU

Slide 74

Slide 74 text

Common Locale Data Repository Language and country data Formatting dates, numbers, currency Translations of scripts, languages, date units Script characters, sorting and transliteration rules

Slide 75

Slide 75 text

International Components for Unicode C/C++ and Java libraries for Unicode handling Uses CLDR data Encoding conversion Sorting, formatting, normalization Time and calendar conversion Regular expressions Text segmentation

Slide 76

Slide 76 text

– Richard Feynman “If you think you understand Unicode, you don't understand Unicode.”

Slide 77

Slide 77 text

Questions? @kimtaro http://www.unicode.org http://emojipedia.org http://graphemica.com http://site.icu-project.org