Make Your Next App 🎉👏😄💘👍

Make Your Next App

Me • Call me TP • Follow @uranusjr • https://uranusjr.com

www. .com

http://macdown.uranusjr.com

Encoding

en·code verb \in-ˈkōd, en-\ : to put (a message) into
the form of a code so that it can kept secret : to put information in the form of a code on (something) : to change (information) into a set of letters, numbers, or symbols that can be read by a computer — Merriam-Webster Dictionary

Info Code Encode Decode

Storage

Probably not going to work.

A 0100 0001 Encode Decode

* www.unicode.org

Unicode provides a unique number for every character,  no matter
what the platform,  no matter what the program,  no matter what the language. * www.unicode.org

SO CLOSE…

Glyph Code Point M 4D ꧅ 96C4

4D M C4 ꧅ 96 * Big Endian

4D ♃ C4 Ä 96 * Big Endian

Fixed-length v. Variable-length

UCS-2 • 2-byte Universal Character Set • Fixed-length, from 0x0000
to 0xFFFF • Superseded by UTF-16

4D M C4 ꧅ 96 * Big Endian 00

Unicode Unlimited Plane Name CP Range Supplementary Multilingual Plane 10000–1FFFF
Supplementary Ideographic Plane 20000—2FFFF Supplementary Special-purpose Plane E0000—E0FFF

Glyph Code Point U+1F4F1 U+1F574 * Unicode 7.0 (June 2014)

F1 C4 ꧅ 96 * Big Endian F4 01

UTF-16 • 16-bit Unicode Transformation Format • Variable-length, code points
take multiples of 16 bytes • D800–DFFF “surrogate pairs”

Lead \ Tail DC00 DC01 ⋯ DFFF D800 010000 010001
⋯ 0103FF D801 010400 010401 ⋯ 0107FF ⋮ ⋮ ⋮ ⋱ ⋮ DBFF 10FC00 10FC01 ⋯ 10FFFF

• = 1F4F1 • 1F4F1 – 10000 = F4F1 •
F4F1 = 3D × 400 + F1 • F1 + DC00 = DCF1 (low surrogate) • 3D + D800 = D83D (high surrogate)

F1 C4 ꧅ 96 * Big Endian DC 3D D8
In preserved space Must be a surrogate pair 3D D8

Okay cool, how do all these matter to me?

Unicode Programming • Variables are ﬁxed-length • Characters higher than
0x10000 are relatively new • Backward compatibility

F1 s = DC * Big Endian 3D D8 s[0]
s[1]

s = "MOPCON" print s[0]
// "M" s = "넝꧅" print s[0] // "넝" s = "" print s[0] // Uh-‐oh.

s = "" s.length //
6. s[0] == "" // False!!

Combining Characters • When precomposed characters do not apply (e.g.
Cyrillic) • 0300–036F “Combining Diacritical Marks” • Decorate the character before them

y + ˘ = y̆

y + ˘ = y̆ breve (U+02D8)

y + ˘ = y̆ combining breve (U+0306)

How many characters are there in világ?

Canonicalisation When é is not equal to é

Objective-C (NSString) • Characters are unichar • Basically just unsigned
short • Indexes are always 16-bit offsets • length = Count of 16-bit sequences • Ranges behave the same way

-‐characterAtIndex: -‐getCharacters:range: -‐length -‐isEqualToString: //
Also -‐isEqual: -‐substringFromIndex: -‐substringToIndex: -‐substringWithRange: * Not exhaustive

–rangeOfComposedCharacterSequenceAtIndex: –rangeOfComposedCharacterSequencesForRange: -‐enumerateSubstringsInRange:options:
usingBlock: -‐compare: // Or -‐localizedCompare: -‐decomposedStringWithCanonicalMapping -‐decomposedStringWithCompatibilityMapping -‐precomposedStringWithCanonicalMapping -‐precomposedStringWithCompatibilityMapping

Swift (String) • Full Unicode support • The Character type
• Splitting “just works” • Be careful with zero-width characters • Some APIs are still unichar-based

But. // String(seq:Array("")[0..<2]) // "".substringWithRange(
NSMakeRange(0, 2))

Java (String) • Characters are char • Primitive 16-bit •
Everything in String is dangerous!! • The Character wrapper class

java.lang.Character Character.codePointAt(...) Character.codePointBefore(...) Character.codePointCount(...) Character.isSurrogate(ch)
Character.isLowSurrogate(ch) Character.isHighSurrogate(ch) Character.isSurrogatePair(high, low) java.text.Normalizer Normalizer.normalize(src, form) Normalizer.isNormalized(src, form)

C# (String) • Characters are Char (16-bit) • Equals() and
Compare() can do folding and normalisation • But no direct surrogate support

Chars(index) Length() CompareTo(str) Equals(str) GetEnumerator()
SubString(...) // Operators == and != * Not exhaustive

System.String String.CompareTo(s1, s2, opts) IsNormailzed(...) Normalize(...)
Equals(str, options) System.CompareOptions System.Char Char.ConvertFromUtf32(...) Char.IsHighSurrogate(...) Char.IsLowSurrogate(...) Char.IsSurrogate(...) Char.IsSurrogatePair(...)

Summing Up • Learn your history and terminology • Pay
extra attention when • Indexing • Subscripting (to get a “character”) • Iterating through • Getting a substring • Look for high-level APIs if possible

WHAT IF I AM INTERESTED IN SOMETHING ELSE?

Make Your Next App 🎉👏😄💘👍

Make Your Next App 🎉👏😄💘👍

More Decks by Tzu-ping Chung

Other Decks in Programming

Featured

Transcript