text • Specifies: • A bunch of “characters”, each with a unique numeric identifier — a code point • Encodings for representing code points in bytes • Normalization forms • …and some of other stuff we’ll skip
of poo } U+007D Splendid sideways mustache U+1F3E9 Love hotel Å U+00C5 Latin capital letter A with ring above - U+00AD Soft hyphen (maybe:) U+200B Zero width space U+33AF Squared rad over S squared ۋ U+06CB Arabic letter Ve ⣡ U+28E1 Braille pattern dots-1678 A cursory glance
Plane #0: Basic Multilingual Plane (BMP) • Most commonly used code points • The other planes are collectively called supplementary planes, or astral planes
width encoding • A code point is encoded in 1-4 bytes • Conserves space • Backwards compatible with ASCII • Each valid ASCII string is a valid UTF-8 string
• Variable width encoding • Each unit is two bytes • A code point is encoded in 1-2 units • When two units are used for a code point, this is called a “surrogate pair” • All code points in the BMP require only one unit • Popular with programming language / platform standard libraries
talk about “unicode characters”. Whenever they say this, they mean UTF-16 units. A string object presents itself as an array of Unicode characters … characterAtIndex: Returns the character at a given array position. - (unichar)characterAtIndex:(NSUInteger)index
character) Boolean CFStringIsSurrogateLowCharacter(UniChar character) “These methods should be the default choice for programmatically determining the boundaries of user-perceived characters.” To align an index or a range to grapheme cluster boundaries: To detect UTF-16 surrogate pairs:
that may be present in the beginning of a string stream • Specifies the encoding that is used • So that the recipient doesn’t have to get it from out-of-band information, or worse, infer it • Specifies byte order (endianness) • Except for UTF-8, which has a constant byte order • UTF-16: two bytes • UTF-8: three bytes
does not: [s dataUsingEncoding:NSUTF8StringEncoding] Neither does this: (because -dataUsingEncoding: includes the BOM only for representing endianness)
you get this from some external API: Now, you want to do some fairly involved string manipulation based on that information, so you decide to decode that into an NSString so you can use its API. But the offsets are UTF-8 byte offsets — how do we translate them into UTF-16 unit offsets?
offset problem by splitting the buffer into parts that you decode into NSStrings separately • This abstracts away the encoding implementation and lets NSString deal with it • This depends on being sure that the offsets are aligned to code point boundaries • If they are not, you must align them yourself, which requires understanding the encoding
• As long as you only handle strings with code points within the BMP, the number of UTF-16 units will always equal the number of code points • As soon as you leave the BMP, though, this no longer holds! Watch out for the astral planes!
sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.” When Apple says “character”, they almost always mean “UTF-16 unit”. In their documentation, the article “Characters and Grapheme Clusters” explains these things well:
working with Unicode strings • You just need to: • Understand the different things that “character” can mean in different contexts • Understand the language Apple uses in its documentation • Be aware of some of the pitfalls