Handling Strings Like a Boss

Slide 1

Slide 1 text

Handling strings like a Boss Ali Rantakari HelsinkiOS / CocoaHeads • February 12, 2013

Slide 2

Slide 2 text

A string ? “a sequence of characters” obvious for a human, but not for a computer

Slide 3

Slide 3 text

Things that depend on the deﬁnition of a “character” NSUInteger characterOffset NSUInteger characterIndex NSUInteger maximumLength NSRange interestingRange

Slide 4

Slide 4 text

A character? Character = Byte UTF-16 Unit Unicode Code Point Grapheme depending on context, and on who’s talking: (not a comprehensive list, but these are the ones I’ll be discussing)

Slide 5

Slide 5 text

Unicode • Standard for consistent encoding, representation and handling of text • Speciﬁes: • A bunch of “characters”, each with a unique numeric identiﬁer — a code point • Encodings for representing code points in bytes • Normalization forms • …and some of other stuff we’ll skip

Slide 6

Slide 6 text

Code Points A U+0041 Latin capital letter A U+1F4A9 Pile of poo } U+007D Splendid sideways mustache U+1F3E9 Love hotel Å U+00C5 Latin capital letter A with ring above - U+00AD Soft hyphen (maybe:) U+200B Zero width space U+33AF Squared rad over S squared ۋ U+06CB Arabic letter Ve ⣡ U+28E1 Braille pattern dots-1678 A cursory glance

Slide 7

Slide 7 text

Unicode Planes • Code points divided into 17 planes • Plane #0: Basic Multilingual Plane (BMP) • Most commonly used code points • The other planes are collectively called supplementary planes, or astral planes

Slide 8

Slide 8 text

BMP

Slide 9

Slide 9 text

Unicode Encodings: UTF-8 • Commonly used — Pragmatic • Variable width encoding • A code point is encoded in 1-4 bytes • Conserves space • Backwards compatible with ASCII • Each valid ASCII string is a valid UTF-8 string

Slide 10

Slide 10 text

Unicode Encodings: UTF-32 • Constant width encoding • A code point is always encoded in four bytes • Wastes space • Enables random access into string buffers • (pointer) + ((code point offset) * 4)

Slide 11

Slide 11 text

Unicode Encodings: UTF-16 • Sort of between the previous two • Variable width encoding • Each unit is two bytes • A code point is encoded in 1-2 units • When two units are used for a code point, this is called a “surrogate pair” • All code points in the BMP require only one unit • Popular with programming language / platform standard libraries

Slide 12

Slide 12 text

NSString is indexed by UTF-16 units. Apple’s documentation likes to talk about “unicode characters”. Whenever they say this, they mean UTF-16 units. A string object presents itself as an array of Unicode characters … characterAtIndex: Returns the character at a given array position. - (unichar)characterAtIndex:(NSUInteger)index

Slide 13

Slide 13 text

Pop Quiz STAssertEquals(@"未来⽔水稻".length, 4lu, nil); Fail or no fail? Great success!

Slide 14

Slide 14 text

Pop Quiz STAssertEquals(@"".length, 2lu, nil); Fail or no fail? Compu’er says “no”.

Slide 15

Slide 15 text

Break it down U+2070E CJK Unified ideograph 2070E U+20731 CJK Unified ideograph 20731 (unichar) 55361 (unichar) 57102 (unichar) 55361 (unichar) 57137 Code points: UTF-16 units:

Slide 16

Slide 16 text

Pop Quiz STAssertEquals(@"Äiti".length, 4lu, nil); Fail or no fail? Compu’er says “no”.

Slide 17

Slide 17 text

Break it down (again) (unichar) 65 (unichar) 776 Code points: UTF-16 units: A U+0041 Latin capital letter A ¨ U+0308 Combining diaeresis Graphemes: Ä

Slide 18

Slide 18 text

Combining Characters ¨ U+0308 Combining diaeresis ᷎ U+1DCE Combining ogonek above U+0020 Space U+035C Combining double breve below ̐ U+0310 Combining Candrabindu ̫ U+032B Combining inverted double arch below Six code points → One grapheme

Slide 19

Slide 19 text

Combining Characters A U+0041 Latin capital letter A ¨ U+0308 Combining diaeresis Ä U+00C4 Latin capital letter A with diaeresis Precomposed Decomposed

Slide 20

Slide 20 text

Grapheme Clusters •UTF-16 Surrogate pairs •Base code point + combining character code points

Slide 21

Slide 21 text

Chop chop #import void Output(NSString *s) { [s writeToFile:@"/dev/stdout" atomically:NO encoding:NSUTF8StringEncoding error:nil]; } int main(int argc, char *argv[]) { NSString *s = @" is chinese"; Output(s); Output([NSString stringWithFormat:@" (len %lu)\n", s.length]); NSLog(@"%@", s); NSString *sub = [s substringWithRange:NSMakeRange(1,4)]; Output(sub); Output([NSString stringWithFormat:@" (len %lu)\n", sub.length]); NSLog(@"%@", sub); } $ clang -framework Cocoa test.m -o test && ./test is chinese (len 15) 2013-02-02 18:47:47.342 test[42044:707] is chinese (len 4) $ The substring doesn’t get printed out at all (!)

Slide 22

Slide 22 text

What went wrong? i s c h … Code points: 55361 57102 55361 57137 32 105 115 32 99 104 … UTF-16 units: [s substringWithRange:NSMakeRange(1,4)] What we wanted What we got

Slide 23

Slide 23 text

Fix it real good [s substringWithRange: [s rangeOfComposedCharacterSequencesForRange: NSMakeRange(1,4)]] Range of composed character sequences i s c h … Code points: 55361 57102 55361 57137 32 105 115 32 99 104 … UTF-16 units: Range

Slide 24

Slide 24 text

Dealing with grapheme clusters - (NSRange)rangeOfComposedCharacterSequenceAtIndex:(NSUInteger)index; - (NSRange)rangeOfComposedCharacterSequencesForRange:(NSRange)range; Boolean CFStringIsSurrogateHighCharacter(UniChar character) Boolean CFStringIsSurrogateLowCharacter(UniChar character) “These methods should be the default choice for programmatically determining the boundaries of user-perceived characters.” To align an index or a range to grapheme cluster boundaries: To detect UTF-16 surrogate pairs:

Slide 25

Slide 25 text

This is categorically a good idea // Instead of calling `-substringWithRange:`, call: // - (NSString *) my_substringWithRange:(NSRange)utf16UnitRange { return [self substringWithRange: [self rangeOfComposedCharacterSequencesForRange: utf16UnitRange]]; } // Instead of calling `-characterAtIndex:`, call: // - (NSString *) my_composedCharacterAtIndex:(NSUInteger)utf16UnitIndex { return [self substringWithRange: [self rangeOfComposedCharacterSequenceAtIndex: utf16UnitIndex]]; }

Slide 26

Slide 26 text

Pop Quiz NSString *s = @"Hello"; STAssertEquals([s dataUsingEncoding:NSUTF8StringEncoding].length, [s lengthOfBytesUsingEncoding:NSUTF8StringEncoding], nil); Fail or no fail? Success.

Slide 27

Slide 27 text

Pop Quiz NSString *s = @"Hello"; STAssertEquals([s dataUsingEncoding:NSUTF16StringEncoding].length, [s lengthOfBytesUsingEncoding:NSUTF16StringEncoding], nil); Fail or no fail? It fails.

Slide 28

Slide 28 text

BOM • The byte order mark (BOM) is a “metacharacter” that may be present in the beginning of a string stream • Speciﬁes the encoding that is used • So that the recipient doesn’t have to get it from out-of-band information, or worse, infer it • Speciﬁes byte order (endianness) • Except for UTF-8, which has a constant byte order • UTF-16: two bytes • UTF-8: three bytes

Slide 29

Slide 29 text

BOM [s dataUsingEncoding:NSUTF16StringEncoding] This includes the BOM: [s lengthOfBytesUsingEncoding:NSUTF16StringEncoding] This does not: [s dataUsingEncoding:NSUTF8StringEncoding] Neither does this: (because -dataUsingEncoding: includes the BOM only for representing endianness)

Slide 30

Slide 30 text

Practicum

Slide 31

Slide 31 text

Offsets from another world char *utf8_encoded_string; long array_of_interesting_offsets[]; Let’s say you get this from some external API: Now, you want to do some fairly involved string manipulation based on that information, so you decide to decode that into an NSString so you can use its API. But the offsets are UTF-8 byte offsets — how do we translate them into UTF-16 unit offsets?

Slide 32

Slide 32 text

Offsets from another world char char char char char char char char char char char char char char char char \0 \0 \0 char char char char char char char char char char char char char char char char \0 NSString NSString NSString NSString

Slide 33

Slide 33 text

Offsets from another world • You can handle the byte offset problem by splitting the buffer into parts that you decode into NSStrings separately • This abstracts away the encoding implementation and lets NSString deal with it • This depends on being sure that the offsets are aligned to code point boundaries • If they are not, you must align them yourself, which requires understanding the encoding

Slide 34

Slide 34 text

Length in code points - (NSUInteger) my_lengthInCodePoints { NSUInteger numCodePoints = 0; NSUInteger len = self.length; for (NSUInteger i = 0; i < len; i++) { unichar u = [self characterAtIndex:i]; if (CFStringIsSurrogateHighCharacter(u) || !CFStringIsSurrogateLowCharacter(u)) numCodePoints++; } return numCodePoints; } - (NSUInteger) my_lengthInCodePoints { return [self lengthOfBytesUsingEncoding:NSUTF32StringEncoding] / 4; } Okay Better

Slide 35

Slide 35 text

Length in grapheme clusters - (NSUInteger) xa_graphemeLength { NSUInteger numGraphemes = 0; NSUInteger index = 0; NSUInteger len = self.length; while (index < len) { numGraphemes++; index = NSMaxRange( [self rangeOfComposedCharacterSequenceAtIndex:index]); } return numGraphemes; } “how many user-perceived characters”

Slide 36

Slide 36 text

Enumerating grapheme clusters [self enumerateSubstringsInRange:NSMakeRange(0, self.length) options:NSStringEnumerationByComposedCharacterSequences usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) { }]; “gimme all the user-perceived characters”

Slide 37

Slide 37 text

Normalization A U+0041 ¨ U+0308 i U+0069 t U+0074 i U+0069 Ä U+00C4 i U+0069 t U+0074 i U+0069 CFStringNormalize((CFMutableStringRef)myNSString, kCFStringNormalizationFormD) - (NSString *)decomposedStringWithCanonicalMapping; - (NSString *)precomposedStringWithCanonicalMapping; - (NSString *)decomposedStringWithCompatibilityMapping; - (NSString *)precomposedStringWithCompatibilityMapping;

Slide 38

Slide 38 text

Comparing strings NSString *precomposed = @"Äiti"; NSString *decomposed = @"Äiti"; // A¨iti [precomposed isEqualToString:decomposed]; // = NO [precomposed compare:decomposed]; // = NSOrderedSame (!) [precomposed compare:decomposed options:NSLiteralSearch]; // = NSOrderedDescending // If specified, ignores diacritics (o-umlaut == o) NSDiacriticInsensitiveSearch // If specified, ignores width differences ('a' == UFF41) NSWidthInsensitiveSearch No need to manually normalize, and then compare: NSString has got you covered:

Slide 39

Slide 39 text

Transformation fun with CFStringTransform() Straße Straße kCFStringTransformToXMLHex ະདྷਫ᜞ wèi lái shuǐ dào kCFStringTransformToLatin спасибо spasibo {TWO WOMEN HOLDING HANDS} kCFStringTransformToUnicodeName Älämölö Garçon Alamolo Garcon kCFStringTransformStripDiacritics kCFStringTransformStripCombiningMarks

Slide 40

Slide 40 text

Summary (⅓) • Using NSStrings can be pernicious for us: • As long as you only handle strings with code points within the BMP, the number of UTF-16 units will always equal the number of code points • As soon as you leave the BMP, though, this no longer holds! Watch out for the astral planes!

Slide 41

Slide 41 text

Summary (⅔) • When someone says “character” • When you think “character” Stop, and think about what it means in that context.

Slide 42

Slide 42 text

Summary “It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.” When Apple says “character”, they almost always mean “UTF-16 unit”. In their documentation, the article “Characters and Grapheme Clusters” explains these things well:

Slide 43

Slide 43 text

In Conclusion • Apple gives us very good tools for working with Unicode strings • You just need to: • Understand the different things that “character” can mean in different contexts • Understand the language Apple uses in its documentation • Be aware of some of the pitfalls