Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make Your Next App 🎉👏😄💘👍

Make Your Next App 🎉👏😄💘👍

Does your app contain text? I bet it does. But can it handle them correctly? I bet it can’t! Your app is not handling text correctly. “Wait,” I hear you say, “our app is fine. It uses Unicode.” Right. Except you don't really understand Unicode, and as a consequence your code doesn't really handle it. Here I’ll tell you what Unicode really is, and how you really should handle it in your next app. Correctly.

9dafad54b5b4f360b7aae5f482bc1c91?s=128

Tzu-ping Chung

October 25, 2014
Tweet

More Decks by Tzu-ping Chung

Other Decks in Programming

Transcript

  1. Make Your Next App

  2. Me • Call me TP • Follow @uranusjr • https://uranusjr.com

  3. None
  4. www. .com

  5. http://macdown.uranusjr.com

  6. Encoding

  7. en·code verb \in-ˈkōd, en-\ : to put (a message) into

    the form of a code so that it can kept secret : to put information in the form of a code on (something) : to change (information) into a set of letters, numbers, or symbols that can be read by a computer — Merriam-Webster Dictionary
  8. Info Code Encode Decode

  9. None
  10. Storage

  11. None
  12. Probably not going to work.

  13. A 0100 0001 Encode Decode

  14. * www.unicode.org

  15. Unicode provides a unique number for every character,
 no matter

    what the platform,
 no matter what the program,
 no matter what the language. * www.unicode.org
  16. None
  17. SO CLOSE…

  18. Glyph Code Point M 4D ꧅ 96C4

  19. 4D M C4 ꧅ 96 * Big Endian

  20. 4D ♃ C4 Ä 96 * Big Endian

  21. Fixed-length v. Variable-length

  22. UCS-2 • 2-byte Universal Character Set • Fixed-length, from 0x0000

    to 0xFFFF • Superseded by UTF-16
  23. 4D M C4 ꧅ 96 * Big Endian 00

  24. Unicode Unlimited Plane Name CP Range Supplementary Multilingual Plane 10000–1FFFF

    Supplementary Ideographic Plane 20000—2FFFF Supplementary Special-purpose Plane E0000—E0FFF
  25. Glyph Code Point U+1F4F1 U+1F574 * Unicode 7.0 (June 2014)

  26. F1 C4 ꧅ 96 * Big Endian F4 01

  27. UTF-16 • 16-bit Unicode Transformation Format • Variable-length, code points

    take multiples of 16 bytes • D800–DFFF “surrogate pairs”
  28. Lead \ Tail DC00 DC01 ⋯ DFFF D800 010000 010001

    ⋯ 0103FF D801 010400 010401 ⋯ 0107FF ⋮ ⋮ ⋮ ⋱ ⋮ DBFF 10FC00 10FC01 ⋯ 10FFFF
  29. • = 1F4F1 • 1F4F1 – 10000 = F4F1 •

    F4F1 = 3D × 400 + F1 • F1 + DC00 = DCF1 (low surrogate) • 3D + D800 = D83D (high surrogate)
  30. F1 C4 ꧅ 96 * Big Endian DC 3D D8

    In preserved space Must be a surrogate pair 3D D8
  31. Okay cool, how do all these matter to me?

  32. Unicode Programming • Variables are fixed-length • Characters higher than

    0x10000 are relatively new • Backward compatibility
  33. F1 s = DC * Big Endian 3D D8 s[0]

    s[1]
  34. s  =  "MOPCON"   print  s[0]        

    //  "M"   s  =  "넝꧅"   print  s[0]         //  "넝"   s  =  ""   print  s[0]         //  Uh-­‐oh.
  35. s  =  ""   s.length         //

     6.   s[0]  ==  ""     //  False!!
  36. NO.

  37. None
  38. Combining Characters • When precomposed characters do not apply (e.g.

    Cyrillic) • 0300–036F “Combining Diacritical Marks” • Decorate the character before them
  39. y + ˘ = y̆

  40. y + ˘ = y̆ breve (U+02D8)

  41. y + ˘ = y̆ combining breve (U+0306)

  42. None
  43. How many characters are there in világ?

  44. Canonicalisation When é is not equal to é

  45. None
  46. None
  47. Objective-C (NSString) • Characters are unichar • Basically just unsigned

    short • Indexes are always 16-bit offsets • length = Count of 16-bit sequences • Ranges behave the same way
  48. -­‐characterAtIndex:   -­‐getCharacters:range:   -­‐length   -­‐isEqualToString:     //

     Also  -­‐isEqual:   -­‐substringFromIndex:   -­‐substringToIndex:   -­‐substringWithRange: * Not exhaustive
  49. –rangeOfComposedCharacterSequenceAtIndex:   –rangeOfComposedCharacterSequencesForRange:   -­‐enumerateSubstringsInRange:options:          

                             usingBlock:   -­‐compare:     //  Or  -­‐localizedCompare:   -­‐decomposedStringWithCanonicalMapping   -­‐decomposedStringWithCompatibilityMapping   -­‐precomposedStringWithCanonicalMapping   -­‐precomposedStringWithCompatibilityMapping
  50. Swift (String) • Full Unicode support • The Character type

    • Splitting “just works” • Be careful with zero-width characters • Some APIs are still unichar-based
  51. But. //     String(seq:Array("")[0..<2])   //     "".substringWithRange(

                               NSMakeRange(0,  2))
  52. Java (String) • Characters are char • Primitive 16-bit •

    Everything in String is dangerous!! • The Character wrapper class
  53. java.lang.Character   Character.codePointAt(...)   Character.codePointBefore(...)   Character.codePointCount(...)   Character.isSurrogate(ch)  

    Character.isLowSurrogate(ch)   Character.isHighSurrogate(ch)   Character.isSurrogatePair(high,  low)   java.text.Normalizer   Normalizer.normalize(src,  form)   Normalizer.isNormalized(src,  form)
  54. C# (String) • Characters are Char (16-bit) • Equals() and

    Compare() can do folding and normalisation • But no direct surrogate support
  55. Chars(index)   Length()   CompareTo(str)   Equals(str)   GetEnumerator()  

    SubString(...)   //  Operators  ==  and  != * Not exhaustive
  56. System.String   String.CompareTo(s1,  s2,  opts)   IsNormailzed(...)   Normalize(...)  

    Equals(str,  options)   System.CompareOptions   System.Char   Char.ConvertFromUtf32(...)   Char.IsHighSurrogate(...)   Char.IsLowSurrogate(...)   Char.IsSurrogate(...)   Char.IsSurrogatePair(...)
  57. Summing Up • Learn your history and terminology • Pay

    extra attention when • Indexing • Subscripting (to get a “character”) • Iterating through • Getting a substring • Look for high-level APIs if possible
  58. WHAT IF I AM INTERESTED IN SOMETHING ELSE?