Unicode, a 💌 - Speaker Deck

Slide 1

Slide 1 text

Unicorns, a Florian Gilcher 6 sierpnia 2015

Slide 2

Slide 2 text

Unicode, a Florian Gilcher 6 sierpnia 2015

Slide 3

Slide 3 text

Let’s talk basic Strings Strings are: a memory abstraction saved as bytes, packed semantically mapped to characters

Slide 4

Slide 4 text

What are Strings used for? Strings are used for: Human readable, semi-structured, binary data Words and Text

Slide 5

Slide 5 text

string = "user/c9e4e625-b872-4e5b-9d6c-2d8dde25bbe1" hash = { "foo" => 2, "bar" => 2} json = ’{ "foo" : 2, "bar" : 2}’

Slide 6

Slide 6 text

string = "user/c9e4e625-b872-4e5b-9d6c-2d8dde25bbe1" string.bytes #=> [117, 115, 101, 114, 47, 99, 57, 101, # 52, 101, 54, 50, 53, 45, 98, 56, 55, 50, 45, # 52, 101, 53, 98, 45, 57, 100, 54, 99, 45, 50, # 100, 56, 100, 100, 101, 50, 53, 98, 98, 101, 49]

Slide 7

Slide 7 text

Information content An address of an object In a readable form A property of an object (the type) Structure is implicit, the reader has to know it

Slide 8

Slide 8 text

Operations on strings Determining length Indexing into the string (e.g. get characters 0-3) Parsing for structured information Splitting Merging

Slide 9

Slide 9 text

text = <

Slide 10

Slide 10 text

Information content The language (Polish) Pronounciation information (“o” vs. “ó”) A statement (“We want to be inclusive”) Sentences, Words, syllables, etc.

Slide 11

Slide 11 text

Strings have many uses Strings - the memory abstraction - have many uses. They can be data containers, hold semi-structured data or hold actual text. Each of those uses needs to be treated diﬀerently.

Slide 12

Slide 12 text

What do character encodings do? “a character encoding is used to represent a repertoire of characters by some kind of an encoding system.”

Slide 13

Slide 13 text

Morse .- / .–. .-. .. – .- .-. -.– / –. — .- .-.. / — ..-. / .- .-.. .-.. / - .... . / -.-. — -. ..-. . .-. . -. -.-. . ... / .- -. -.. / ..- ... . .-. / –. .-. — ..- .–. ...

Slide 14

Slide 14 text

US-ASCII 7 Bits, up to 128 characters Encodes the basic latin alphabet (a-zA-Z) Punctuation Additional special chars

Slide 15

Slide 15 text

Special chars? US-ASCII contains instructional data to the interpreting machine. The bell character Escape Carriage return

Slide 16

Slide 16 text

Special chars? A notification to the user A notification to the program A notification for the hardware

Slide 17

Slide 17 text

Characters are not always text! The most common mistake in text handling is to think that characters are all like “a” and “b”. They encode many things.

Slide 18

Slide 18 text

Glyphs Actual displayable things are called “glyphs”.

Slide 19

Slide 19 text

Operations on text Trimming (100 words only) Display (render the glyphs) Translation

Slide 20

Slide 20 text

Unicode is a Text encoding Deﬁnes a vast set of characters Assigns them to numeric codes Includes almost all characters used in writing systems... ...plus their control instructions. (e.g. right to left writing)

Slide 21

Slide 21 text

About unicode Many special encodings were developed for multiple writing systems Unicode uniﬁes (almost) all Maintained by the Unicode Consortium

Slide 22

Slide 22 text

Full members of the Consortium Adobe Apple Google Huawei IBM Microsoft Ministry of Awqwaf and Religious Aﬀairs of the Sultanate of Oman Oracle SAP Yahoo!

Slide 23

Slide 23 text

Institutions Government of Bangladesh Government of India Government of Timal Nadu University of Berkeley, California

Slide 24

Slide 24 text

Very brief history US-ASCII-compatible (UTF-8) Unicode uniﬁes (almost) all A universal character list Deﬁnes numeric values (Codepoints) for every of those Maintained by the Unicode Consortium

Slide 25

Slide 25 text

Unicode is not an encoding Unicode is a machine-readable description of many characters, not an encoding. Unicode has multiple ways to be encoded. Examples: UTF-8,16,32 UCS-2 (warning, this is what Postgres uses!) WTF-8 (for mapping between weird Unicode implementations)

Slide 26

Slide 26 text

Appreciation Slide Thank you, JavaScript!

Slide 27

Slide 27 text

UTF-8 Variable-width encoding Encodes each character number in 1-4 bytes

Slide 28

Slide 28 text

Encoding Strategy

Slide 29

Slide 29 text

Encoding Strategy Horray, we can ﬁt US-ASCII in Byte 1!

Slide 30

Slide 30 text

Consequences Characters have changing (memory) width Jumping in the String encoding the Text means having to read it left-to-right Compatibility with the most important legacy encoding.

Slide 31

Slide 31 text

So why not fixed-width? Wouldn’t it be better to have all characters in a fixed width, which would allow us to skip forward and back without reading the string? UTF-32 is fixed-length and encodes all Unicode codepoints

Slide 32

Slide 32 text

Composition What is an ¨ u even? A character on its own? A “u” with a trema on top?

Slide 33

Slide 33 text

Composition Both.

Slide 34

Slide 34 text

Composition Even with ﬁxed-length encoding, reading our desired info out of a String is still left-to-right. Reading a too low number of bytes might change the meaning of a character Reading from some point in between may miss context and yield non-sensical data.

Slide 35

Slide 35 text

Indexing isn’t clear What is: string = "\u0075\u0308" #=> ¨ u string[1] NilClass or the trema?

Slide 36

Slide 36 text

Length isn’t clear What is: string = "\u0075\u0308" #=> ¨ u string.length 1 or 2?

Slide 37

Slide 37 text

Operations become non-trivial string1 = "¨ u" string2 = "\u0075\u0308" string1 == string2 #=> false Unicode says they are equal.

Slide 38

Slide 38 text

Normalization All Unicode Strings need to be normalized before comparison. Normalization ensures ambiguities to be mapped to one. Normalization is costly. There are multiple Normalization strategies.

Slide 39

Slide 39 text

Bugs aware This is a common and silent bug in search implementations.

Slide 40

Slide 40 text

”Fully Unicode compatible Strings” Many languages boast themselves as fully compatible. This is deceiving. You don’t want many of the text operations on Strings.

Slide 41

Slide 41 text

Unicode upcase/downcase "¨ u".upcase #=> "¨ u"

Slide 42

Slide 42 text

Unicode upcase/downcase "¨ u".upcase #=> "¨ u" Ruby, why ¨ U no upcase Unicode?

Slide 43

Slide 43 text

Unicode upcase/downcase Because it is the most practically useless feature ever. Upcase and downcase are operations for using Strings as program labels. Don’t use them on text.

Slide 44

Slide 44 text

Trivia question "i".upcase

Slide 45

Slide 45 text

Trivia question "i".upcase i in turkish local I in all others

Slide 46

Slide 46 text

Conclusion Even in basic western scripts (US-ASCII), upcase is not well deﬁned.

Slide 47

Slide 47 text

Reminder: Operations on text Trimming (100 words only) Display (render the glyphs) Translation

Slide 48

Slide 48 text

When handling text, learn about handling text Remember the following: folding: Like upcasing, but more general. Folding follows rules like “ß” to “SS” collation: Sorting values by the rules of the respective language locales: always work within the context of a locale.

Slide 49

Slide 49 text

Unicode metadata Unicode ships with huge metadata collections about characters. Type of character (e.g. non-printable, word-separating, whitespace, etc.) Origin of character Variants of character Role (Punctuation, etc.)

Slide 50

Slide 50 text

Unicode upcase/downcase Ruby Regexes support that metadata. "[({})]".gsub!(/\p{Ps}/, "") #=> "})]" Use it!

Slide 51

Slide 51 text

How much support to expect from a programming language? Validity checks (are all characters even Unicode?) Transcoding, conversion Knowledge of character boundaries Knowledge about character metadata in Regexes

Slide 52

Slide 52 text

Social aspects The Unicode can be racist (Emoji skin colors) The Unicode can be sexist (Emoji ﬁgures assigning roles to gender) The Unicode can be normative (Decides about correct writing and display) The Unicode can have severe social impact (for people where the name is not encodable)

Slide 53

Slide 53 text

Social aspects “I Can Text You A Pile of Poo, But I Can’t Write My Name” https://modelviewculture.com/pieces/ i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name “Han Uniﬁcation” https://en.wikipedia.org/wiki/Han_unification

Slide 54

Slide 54 text

Mojibake

Slide 55

Slide 55 text

ICU http://site.icu-project.org/

Slide 56

Slide 56 text

Why I love Unicode Because it is an incredible technological achievement with huge range.

Slide 57

Slide 57 text

Why I love Unicode Because it codiﬁes one of the oldest cultural techniques of humans.

Slide 58

Slide 58 text

Why I love Unicode Because it is inseperable from its social impact, for better or worse.

Slide 59

Slide 59 text

Is this text?