Handling Strings Like a Boss

Handling strings like a Boss Ali Rantakari HelsinkiOS / CocoaHeads
• February 12, 2013

A string ? “a sequence of characters” obvious for a
human, but not for a computer

Things that depend on the deﬁnition of a “character” NSUInteger
characterOffset NSUInteger characterIndex NSUInteger maximumLength NSRange interestingRange

A character? Character = Byte UTF-16 Unit Unicode Code Point
Grapheme depending on context, and on who’s talking: (not a comprehensive list, but these are the ones I’ll be discussing)

Unicode • Standard for consistent encoding, representation and handling of
text • Speciﬁes: • A bunch of “characters”, each with a unique numeric identiﬁer — a code point • Encodings for representing code points in bytes • Normalization forms • …and some of other stuff we’ll skip

Code Points A U+0041 Latin capital letter A U+1F4A9 Pile
of poo } U+007D Splendid sideways mustache U+1F3E9 Love hotel Å U+00C5 Latin capital letter A with ring above - U+00AD Soft hyphen (maybe:) U+200B Zero width space U+33AF Squared rad over S squared ۋ U+06CB Arabic letter Ve ⣡ U+28E1 Braille pattern dots-1678 A cursory glance

Unicode Planes • Code points divided into 17 planes •
Plane #0: Basic Multilingual Plane (BMP) • Most commonly used code points • The other planes are collectively called supplementary planes, or astral planes

Unicode Encodings: UTF-8 • Commonly used — Pragmatic • Variable
width encoding • A code point is encoded in 1-4 bytes • Conserves space • Backwards compatible with ASCII • Each valid ASCII string is a valid UTF-8 string

Unicode Encodings: UTF-32 • Constant width encoding • A code
point is always encoded in four bytes • Wastes space • Enables random access into string buffers • (pointer) + ((code point offset) * 4)

Unicode Encodings: UTF-16 • Sort of between the previous two
• Variable width encoding • Each unit is two bytes • A code point is encoded in 1-2 units • When two units are used for a code point, this is called a “surrogate pair” • All code points in the BMP require only one unit • Popular with programming language / platform standard libraries

NSString is indexed by UTF-16 units. Apple’s documentation likes to
talk about “unicode characters”. Whenever they say this, they mean UTF-16 units. A string object presents itself as an array of Unicode characters … characterAtIndex: Returns the character at a given array position. - (unichar)characterAtIndex:(NSUInteger)index

Pop Quiz STAssertEquals(@"未来⽔水稻".length, 4lu, nil); Fail or no fail? Great
success!

Pop Quiz STAssertEquals(@"".length, 2lu, nil); Fail or no fail? Compu’er
says “no”.

Break it down U+2070E CJK Unified ideograph 2070E U+20731 CJK
Unified ideograph 20731 (unichar) 55361 (unichar) 57102 (unichar) 55361 (unichar) 57137 Code points: UTF-16 units:

Pop Quiz STAssertEquals(@"Äiti".length, 4lu, nil); Fail or no fail? Compu’er
says “no”.

Break it down (again) (unichar) 65 (unichar) 776 Code points:
UTF-16 units: A U+0041 Latin capital letter A ¨ U+0308 Combining diaeresis Graphemes: Ä

Combining Characters ¨ U+0308 Combining diaeresis ᷎ U+1DCE Combining ogonek
above U+0020 Space U+035C Combining double breve below ̐ U+0310 Combining Candrabindu ̫ U+032B Combining inverted double arch below Six code points → One grapheme

Combining Characters A U+0041 Latin capital letter A ¨ U+0308
Combining diaeresis Ä U+00C4 Latin capital letter A with diaeresis Precomposed Decomposed

Grapheme Clusters •UTF-16 Surrogate pairs •Base code point + combining
character code points

Chop chop #import <Foundation/Foundation.h> void Output(NSString *s) { [s writeToFile:@"/dev/stdout"
atomically:NO encoding:NSUTF8StringEncoding error:nil]; } int main(int argc, char *argv[]) { NSString *s = @" is chinese"; Output(s); Output([NSString stringWithFormat:@" (len %lu)\n", s.length]); NSLog(@"%@", s); NSString *sub = [s substringWithRange:NSMakeRange(1,4)]; Output(sub); Output([NSString stringWithFormat:@" (len %lu)\n", sub.length]); NSLog(@"%@", sub); } $ clang -framework Cocoa test.m -o test && ./test is chinese (len 15) 2013-02-02 18:47:47.342 test[42044:707] is chinese (len 4) $ The substring doesn’t get printed out at all (!)

What went wrong? i s c h … Code points:
55361 57102 55361 57137 32 105 115 32 99 104 … UTF-16 units: [s substringWithRange:NSMakeRange(1,4)] What we wanted What we got

Fix it real good [s substringWithRange: [s rangeOfComposedCharacterSequencesForRange: NSMakeRange(1,4)]] Range
of composed character sequences i s c h … Code points: 55361 57102 55361 57137 32 105 115 32 99 104 … UTF-16 units: Range

Dealing with grapheme clusters - (NSRange)rangeOfComposedCharacterSequenceAtIndex:(NSUInteger)index; - (NSRange)rangeOfComposedCharacterSequencesForRange:(NSRange)range; Boolean CFStringIsSurrogateHighCharacter(UniChar
character) Boolean CFStringIsSurrogateLowCharacter(UniChar character) “These methods should be the default choice for programmatically determining the boundaries of user-perceived characters.” To align an index or a range to grapheme cluster boundaries: To detect UTF-16 surrogate pairs:

This is categorically a good idea // Instead of calling
`-substringWithRange:`, call: // - (NSString *) my_substringWithRange:(NSRange)utf16UnitRange { return [self substringWithRange: [self rangeOfComposedCharacterSequencesForRange: utf16UnitRange]]; } // Instead of calling `-characterAtIndex:`, call: // - (NSString *) my_composedCharacterAtIndex:(NSUInteger)utf16UnitIndex { return [self substringWithRange: [self rangeOfComposedCharacterSequenceAtIndex: utf16UnitIndex]]; }

Pop Quiz NSString *s = @"Hello"; STAssertEquals([s dataUsingEncoding:NSUTF8StringEncoding].length, [s lengthOfBytesUsingEncoding:NSUTF8StringEncoding],
nil); Fail or no fail? Success.

Pop Quiz NSString *s = @"Hello"; STAssertEquals([s dataUsingEncoding:NSUTF16StringEncoding].length, [s lengthOfBytesUsingEncoding:NSUTF16StringEncoding],
nil); Fail or no fail? It fails.

BOM • The byte order mark (BOM) is a “metacharacter”
that may be present in the beginning of a string stream • Speciﬁes the encoding that is used • So that the recipient doesn’t have to get it from out-of-band information, or worse, infer it • Speciﬁes byte order (endianness) • Except for UTF-8, which has a constant byte order • UTF-16: two bytes • UTF-8: three bytes

BOM [s dataUsingEncoding:NSUTF16StringEncoding] This includes the BOM: [s lengthOfBytesUsingEncoding:NSUTF16StringEncoding] This
does not: [s dataUsingEncoding:NSUTF8StringEncoding] Neither does this: (because -dataUsingEncoding: includes the BOM only for representing endianness)

Practicum

Offsets from another world char *utf8_encoded_string; long array_of_interesting_offsets[]; Let’s say
you get this from some external API: Now, you want to do some fairly involved string manipulation based on that information, so you decide to decode that into an NSString so you can use its API. But the offsets are UTF-8 byte offsets — how do we translate them into UTF-16 unit offsets?

Offsets from another world char char char char char char
char char char char char char char char char char \0 \0 \0 char char char char char char char char char char char char char char char char \0 NSString NSString NSString NSString

Offsets from another world • You can handle the byte
offset problem by splitting the buffer into parts that you decode into NSStrings separately • This abstracts away the encoding implementation and lets NSString deal with it • This depends on being sure that the offsets are aligned to code point boundaries • If they are not, you must align them yourself, which requires understanding the encoding

Length in code points - (NSUInteger) my_lengthInCodePoints { NSUInteger numCodePoints
= 0; NSUInteger len = self.length; for (NSUInteger i = 0; i < len; i++) { unichar u = [self characterAtIndex:i]; if (CFStringIsSurrogateHighCharacter(u) || !CFStringIsSurrogateLowCharacter(u)) numCodePoints++; } return numCodePoints; } - (NSUInteger) my_lengthInCodePoints { return [self lengthOfBytesUsingEncoding:NSUTF32StringEncoding] / 4; } Okay Better

Length in grapheme clusters - (NSUInteger) xa_graphemeLength { NSUInteger numGraphemes
= 0; NSUInteger index = 0; NSUInteger len = self.length; while (index < len) { numGraphemes++; index = NSMaxRange( [self rangeOfComposedCharacterSequenceAtIndex:index]); } return numGraphemes; } “how many user-perceived characters”

Enumerating grapheme clusters [self enumerateSubstringsInRange:NSMakeRange(0, self.length) options:NSStringEnumerationByComposedCharacterSequences usingBlock:^(NSString *substring, NSRange
substringRange, NSRange enclosingRange, BOOL *stop) { }]; “gimme all the user-perceived characters”

Normalization A U+0041 ¨ U+0308 i U+0069 t U+0074 i
U+0069 Ä U+00C4 i U+0069 t U+0074 i U+0069 CFStringNormalize((CFMutableStringRef)myNSString, kCFStringNormalizationFormD) - (NSString *)decomposedStringWithCanonicalMapping; - (NSString *)precomposedStringWithCanonicalMapping; - (NSString *)decomposedStringWithCompatibilityMapping; - (NSString *)precomposedStringWithCompatibilityMapping;

Comparing strings NSString *precomposed = @"Äiti"; NSString *decomposed = @"Äiti";
// A¨iti [precomposed isEqualToString:decomposed]; // = NO [precomposed compare:decomposed]; // = NSOrderedSame (!) [precomposed compare:decomposed options:NSLiteralSearch]; // = NSOrderedDescending // If specified, ignores diacritics (o-umlaut == o) NSDiacriticInsensitiveSearch // If specified, ignores width differences ('a' == UFF41) NSWidthInsensitiveSearch No need to manually normalize, and then compare: NSString has got you covered:

Transformation fun with CFStringTransform() Straße Straße kCFStringTransformToXMLHex ະདྷਫ᜞ wèi lái
shuǐ dào kCFStringTransformToLatin спасибо spasibo {TWO WOMEN HOLDING HANDS} kCFStringTransformToUnicodeName Älämölö Garçon Alamolo Garcon kCFStringTransformStripDiacritics kCFStringTransformStripCombiningMarks

Summary (⅓) • Using NSStrings can be pernicious for us:
• As long as you only handle strings with code points within the BMP, the number of UTF-16 units will always equal the number of code points • As soon as you leave the BMP, though, this no longer holds! Watch out for the astral planes!

Summary (⅔) • When someone says “character” • When you
think “character” Stop, and think about what it means in that context.

Summary “It's common to think of a string as a
sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.” When Apple says “character”, they almost always mean “UTF-16 unit”. In their documentation, the article “Characters and Grapheme Clusters” explains these things well:

In Conclusion • Apple gives us very good tools for
working with Unicode strings • You just need to: • Understand the different things that “character” can mean in different contexts • Understand the language Apple uses in its documentation • Be aware of some of the pitfalls

Handling Strings Like a Boss

Handling Strings Like a Boss

More Decks by Ali Rantakari

Other Decks in Programming

Featured

Transcript