Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handling Strings Like a Boss

Handling Strings Like a Boss

Unicode, and how to deal with it using NSString and friends.

Presented at the HelsinkiOS/CocoaHeads meetup on February 12, 2013.

Ali Rantakari

February 12, 2013
Tweet

More Decks by Ali Rantakari

Other Decks in Programming

Transcript

  1. Things that depend on the definition of a “character” NSUInteger

    characterOffset NSUInteger characterIndex NSUInteger maximumLength NSRange interestingRange
  2. A character? Character = Byte UTF-16 Unit Unicode Code Point

    Grapheme depending on context, and on who’s talking: (not a comprehensive list, but these are the ones I’ll be discussing)
  3. Unicode • Standard for consistent encoding, representation and handling of

    text • Specifies: • A bunch of “characters”, each with a unique numeric identifier — a code point • Encodings for representing code points in bytes • Normalization forms • …and some of other stuff we’ll skip
  4. Code Points A U+0041 Latin capital letter A U+1F4A9 Pile

    of poo } U+007D Splendid sideways mustache U+1F3E9 Love hotel Å U+00C5 Latin capital letter A with ring above - U+00AD Soft hyphen (maybe:) U+200B Zero width space U+33AF Squared rad over S squared ۋ U+06CB Arabic letter Ve ⣡ U+28E1 Braille pattern dots-1678 A cursory glance
  5. Unicode Planes • Code points divided into 17 planes •

    Plane #0: Basic Multilingual Plane (BMP) • Most commonly used code points • The other planes are collectively called supplementary planes, or astral planes
  6. BMP

  7. Unicode Encodings: UTF-8 • Commonly used — Pragmatic • Variable

    width encoding • A code point is encoded in 1-4 bytes • Conserves space • Backwards compatible with ASCII • Each valid ASCII string is a valid UTF-8 string
  8. Unicode Encodings: UTF-32 • Constant width encoding • A code

    point is always encoded in four bytes • Wastes space • Enables random access into string buffers • (pointer) + ((code point offset) * 4)
  9. Unicode Encodings: UTF-16 • Sort of between the previous two

    • Variable width encoding • Each unit is two bytes • A code point is encoded in 1-2 units • When two units are used for a code point, this is called a “surrogate pair” • All code points in the BMP require only one unit • Popular with programming language / platform standard libraries
  10. NSString is indexed by UTF-16 units. Apple’s documentation likes to

    talk about “unicode characters”. Whenever they say this, they mean UTF-16 units. A string object presents itself as an array of Unicode characters … characterAtIndex: Returns the character at a given array position. - (unichar)characterAtIndex:(NSUInteger)index
  11. Break it down U+2070E CJK Unified ideograph 2070E U+20731 CJK

    Unified ideograph 20731 (unichar) 55361 (unichar) 57102 (unichar) 55361 (unichar) 57137 Code points: UTF-16 units:
  12. Break it down (again) (unichar) 65 (unichar) 776 Code points:

    UTF-16 units: A U+0041 Latin capital letter A ¨ U+0308 Combining diaeresis Graphemes: Ä
  13. Combining Characters ¨ U+0308 Combining diaeresis ᷎ U+1DCE Combining ogonek

    above U+0020 Space U+035C Combining double breve below ̐ U+0310 Combining Candrabindu ̫ U+032B Combining inverted double arch below Six code points → One grapheme
  14. Combining Characters A U+0041 Latin capital letter A ¨ U+0308

    Combining diaeresis Ä U+00C4 Latin capital letter A with diaeresis Precomposed Decomposed
  15. Chop chop #import <Foundation/Foundation.h> void Output(NSString *s) { [s writeToFile:@"/dev/stdout"

    atomically:NO encoding:NSUTF8StringEncoding error:nil]; } int main(int argc, char *argv[]) { NSString *s = @" is chinese"; Output(s); Output([NSString stringWithFormat:@" (len %lu)\n", s.length]); NSLog(@"%@", s); NSString *sub = [s substringWithRange:NSMakeRange(1,4)]; Output(sub); Output([NSString stringWithFormat:@" (len %lu)\n", sub.length]); NSLog(@"%@", sub); } $ clang -framework Cocoa test.m -o test && ./test is chinese (len 15) 2013-02-02 18:47:47.342 test[42044:707] is chinese (len 4) $ The substring doesn’t get printed out at all (!)
  16. What went wrong? i s c h … Code points:

    55361 57102 55361 57137 32 105 115 32 99 104 … UTF-16 units: [s substringWithRange:NSMakeRange(1,4)] What we wanted What we got
  17. Fix it real good [s substringWithRange: [s rangeOfComposedCharacterSequencesForRange: NSMakeRange(1,4)]] Range

    of composed character sequences i s c h … Code points: 55361 57102 55361 57137 32 105 115 32 99 104 … UTF-16 units: Range
  18. Dealing with grapheme clusters - (NSRange)rangeOfComposedCharacterSequenceAtIndex:(NSUInteger)index; - (NSRange)rangeOfComposedCharacterSequencesForRange:(NSRange)range; Boolean CFStringIsSurrogateHighCharacter(UniChar

    character) Boolean CFStringIsSurrogateLowCharacter(UniChar character) “These methods should be the default choice for programmatically determining the boundaries of user-perceived characters.” To align an index or a range to grapheme cluster boundaries: To detect UTF-16 surrogate pairs:
  19. This is categorically a good idea // Instead of calling

    `-substringWithRange:`, call: // - (NSString *) my_substringWithRange:(NSRange)utf16UnitRange { return [self substringWithRange: [self rangeOfComposedCharacterSequencesForRange: utf16UnitRange]]; } // Instead of calling `-characterAtIndex:`, call: // - (NSString *) my_composedCharacterAtIndex:(NSUInteger)utf16UnitIndex { return [self substringWithRange: [self rangeOfComposedCharacterSequenceAtIndex: utf16UnitIndex]]; }
  20. BOM • The byte order mark (BOM) is a “metacharacter”

    that may be present in the beginning of a string stream • Specifies the encoding that is used • So that the recipient doesn’t have to get it from out-of-band information, or worse, infer it • Specifies byte order (endianness) • Except for UTF-8, which has a constant byte order • UTF-16: two bytes • UTF-8: three bytes
  21. BOM [s dataUsingEncoding:NSUTF16StringEncoding] This includes the BOM: [s lengthOfBytesUsingEncoding:NSUTF16StringEncoding] This

    does not: [s dataUsingEncoding:NSUTF8StringEncoding] Neither does this: (because -dataUsingEncoding: includes the BOM only for representing endianness)
  22. Offsets from another world char *utf8_encoded_string; long array_of_interesting_offsets[]; Let’s say

    you get this from some external API: Now, you want to do some fairly involved string manipulation based on that information, so you decide to decode that into an NSString so you can use its API. But the offsets are UTF-8 byte offsets — how do we translate them into UTF-16 unit offsets?
  23. Offsets from another world char char char char char char

    char char char char char char char char char char \0 \0 \0 char char char char char char char char char char char char char char char char \0 NSString NSString NSString NSString
  24. Offsets from another world • You can handle the byte

    offset problem by splitting the buffer into parts that you decode into NSStrings separately • This abstracts away the encoding implementation and lets NSString deal with it • This depends on being sure that the offsets are aligned to code point boundaries • If they are not, you must align them yourself, which requires understanding the encoding
  25. Length in code points - (NSUInteger) my_lengthInCodePoints { NSUInteger numCodePoints

    = 0; NSUInteger len = self.length; for (NSUInteger i = 0; i < len; i++) { unichar u = [self characterAtIndex:i]; if (CFStringIsSurrogateHighCharacter(u) || !CFStringIsSurrogateLowCharacter(u)) numCodePoints++; } return numCodePoints; } - (NSUInteger) my_lengthInCodePoints { return [self lengthOfBytesUsingEncoding:NSUTF32StringEncoding] / 4; } Okay Better
  26. Length in grapheme clusters - (NSUInteger) xa_graphemeLength { NSUInteger numGraphemes

    = 0; NSUInteger index = 0; NSUInteger len = self.length; while (index < len) { numGraphemes++; index = NSMaxRange( [self rangeOfComposedCharacterSequenceAtIndex:index]); } return numGraphemes; } “how many user-perceived characters”
  27. Normalization A U+0041 ¨ U+0308 i U+0069 t U+0074 i

    U+0069 Ä U+00C4 i U+0069 t U+0074 i U+0069 CFStringNormalize((CFMutableStringRef)myNSString, kCFStringNormalizationFormD) - (NSString *)decomposedStringWithCanonicalMapping; - (NSString *)precomposedStringWithCanonicalMapping; - (NSString *)decomposedStringWithCompatibilityMapping; - (NSString *)precomposedStringWithCompatibilityMapping;
  28. Comparing strings NSString *precomposed = @"Äiti"; NSString *decomposed = @"Äiti";

    // A¨iti [precomposed isEqualToString:decomposed]; // = NO [precomposed compare:decomposed]; // = NSOrderedSame (!) [precomposed compare:decomposed options:NSLiteralSearch]; // = NSOrderedDescending // If specified, ignores diacritics (o-umlaut == o) NSDiacriticInsensitiveSearch // If specified, ignores width differences ('a' == UFF41) NSWidthInsensitiveSearch No need to manually normalize, and then compare: NSString has got you covered:
  29. Transformation fun with CFStringTransform() Straße Stra&#xDF;e kCFStringTransformToXMLHex ະདྷਫ᜞ wèi lái

    shuǐ dào kCFStringTransformToLatin спасибо spasibo {TWO WOMEN HOLDING HANDS} kCFStringTransformToUnicodeName Älämölö Garçon Alamolo Garcon kCFStringTransformStripDiacritics kCFStringTransformStripCombiningMarks
  30. Summary (⅓) • Using NSStrings can be pernicious for us:

    • As long as you only handle strings with code points within the BMP, the number of UTF-16 units will always equal the number of code points • As soon as you leave the BMP, though, this no longer holds! Watch out for the astral planes!
  31. Summary (⅔) • When someone says “character” • When you

    think “character” Stop, and think about what it means in that context.
  32. Summary “It's common to think of a string as a

    sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.” When Apple says “character”, they almost always mean “UTF-16 unit”. In their documentation, the article “Characters and Grapheme Clusters” explains these things well:
  33. In Conclusion • Apple gives us very good tools for

    working with Unicode strings • You just need to: • Understand the different things that “character” can mean in different contexts • Understand the language Apple uses in its documentation • Be aware of some of the pitfalls