Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make Your Next App 🎉👏😄💘👍

Make Your Next App 🎉👏😄💘👍

Does your app contain text? I bet it does. But can it handle them correctly? I bet it can’t! Your app is not handling text correctly. “Wait,” I hear you say, “our app is fine. It uses Unicode.” Right. Except you don't really understand Unicode, and as a consequence your code doesn't really handle it. Here I’ll tell you what Unicode really is, and how you really should handle it in your next app. Correctly.

Tzu-ping Chung

October 25, 2014
Tweet

More Decks by Tzu-ping Chung

Other Decks in Programming

Transcript

  1. en·code verb \in-ˈkōd, en-\ : to put (a message) into

    the form of a code so that it can kept secret : to put information in the form of a code on (something) : to change (information) into a set of letters, numbers, or symbols that can be read by a computer — Merriam-Webster Dictionary
  2. Unicode provides a unique number for every character,
 no matter

    what the platform,
 no matter what the program,
 no matter what the language. * www.unicode.org
  3. Unicode Unlimited Plane Name CP Range Supplementary Multilingual Plane 10000–1FFFF

    Supplementary Ideographic Plane 20000—2FFFF Supplementary Special-purpose Plane E0000—E0FFF
  4. UTF-16 • 16-bit Unicode Transformation Format • Variable-length, code points

    take multiples of 16 bytes • D800–DFFF “surrogate pairs”
  5. Lead \ Tail DC00 DC01 ⋯ DFFF D800 010000 010001

    ⋯ 0103FF D801 010400 010401 ⋯ 0107FF ⋮ ⋮ ⋮ ⋱ ⋮ DBFF 10FC00 10FC01 ⋯ 10FFFF
  6. • = 1F4F1 • 1F4F1 – 10000 = F4F1 •

    F4F1 = 3D × 400 + F1 • F1 + DC00 = DCF1 (low surrogate) • 3D + D800 = D83D (high surrogate)
  7. F1 C4 ꧅ 96 * Big Endian DC 3D D8

    In preserved space Must be a surrogate pair 3D D8
  8. Unicode Programming • Variables are fixed-length • Characters higher than

    0x10000 are relatively new • Backward compatibility
  9. s  =  "MOPCON"   print  s[0]        

    //  "M"   s  =  "넝꧅"   print  s[0]         //  "넝"   s  =  ""   print  s[0]         //  Uh-­‐oh.
  10. s  =  ""   s.length         //

     6.   s[0]  ==  ""     //  False!!
  11. NO.

  12. Combining Characters • When precomposed characters do not apply (e.g.

    Cyrillic) • 0300–036F “Combining Diacritical Marks” • Decorate the character before them
  13. Objective-C (NSString) • Characters are unichar • Basically just unsigned

    short • Indexes are always 16-bit offsets • length = Count of 16-bit sequences • Ranges behave the same way
  14. -­‐characterAtIndex:   -­‐getCharacters:range:   -­‐length   -­‐isEqualToString:     //

     Also  -­‐isEqual:   -­‐substringFromIndex:   -­‐substringToIndex:   -­‐substringWithRange: * Not exhaustive
  15. –rangeOfComposedCharacterSequenceAtIndex:   –rangeOfComposedCharacterSequencesForRange:   -­‐enumerateSubstringsInRange:options:          

                             usingBlock:   -­‐compare:     //  Or  -­‐localizedCompare:   -­‐decomposedStringWithCanonicalMapping   -­‐decomposedStringWithCompatibilityMapping   -­‐precomposedStringWithCanonicalMapping   -­‐precomposedStringWithCompatibilityMapping
  16. Swift (String) • Full Unicode support • The Character type

    • Splitting “just works” • Be careful with zero-width characters • Some APIs are still unichar-based
  17. But. //     String(seq:Array("")[0..<2])   //     "".substringWithRange(

                               NSMakeRange(0,  2))
  18. Java (String) • Characters are char • Primitive 16-bit •

    Everything in String is dangerous!! • The Character wrapper class
  19. java.lang.Character   Character.codePointAt(...)   Character.codePointBefore(...)   Character.codePointCount(...)   Character.isSurrogate(ch)  

    Character.isLowSurrogate(ch)   Character.isHighSurrogate(ch)   Character.isSurrogatePair(high,  low)   java.text.Normalizer   Normalizer.normalize(src,  form)   Normalizer.isNormalized(src,  form)
  20. C# (String) • Characters are Char (16-bit) • Equals() and

    Compare() can do folding and normalisation • But no direct surrogate support
  21. Chars(index)   Length()   CompareTo(str)   Equals(str)   GetEnumerator()  

    SubString(...)   //  Operators  ==  and  != * Not exhaustive
  22. System.String   String.CompareTo(s1,  s2,  opts)   IsNormailzed(...)   Normalize(...)  

    Equals(str,  options)   System.CompareOptions   System.Char   Char.ConvertFromUtf32(...)   Char.IsHighSurrogate(...)   Char.IsLowSurrogate(...)   Char.IsSurrogate(...)   Char.IsSurrogatePair(...)
  23. Summing Up • Learn your history and terminology • Pay

    extra attention when • Indexing • Subscripting (to get a “character”) • Iterating through • Getting a substring • Look for high-level APIs if possible