You Don't Know Jack About Unicode

1 October 2012 | Jason Garman, Kyrus Technology

¡  Have seen a lot of questions about Unicode
and character encoding ¡  Frequently seen as black magic… but it really isn’t! ¡  Let’s start with a brief history…

Codepage 1252 Code -0 -1 -2 -3 -4 -5 -6
-7 -8 -9 -A -B -C -D -E -F 0x0_ NUL 0 SOH 1 STX 2 ETX 3 EOT 4 ENQ 5 ACK 6 BEL 7 BS 8 HT 9 LF 10 VT 11 FF 12 CR 13 SO 14 SI 15 0x1_ DLE 16 DC1 17 DC2 18 DC3 19 DC4 20 NAK 21 SYN 22 ETB 23 CAN 24 EM 25 SUB 26 ESC 27 FS 28 GS 29 RS 30 US 31 0x2_ SP 32 ! 33 " 34 # 35 $ 36 % 37 & 38 39 ( 40 ) 41 * 42 + 43 , 44 - 45 . 46 / 47 0x3_ 0 48 1 49 2 50 3 51 4 52 5 53 6 54 7 55 8 56 9 57 : 58 ; 59 < 60 = 61 > 62 ? 63 0x4_ @ 64 A 65 B 66 C 67 D 68 E 69 F 70 G 71 H 72 I 73 J 74 K 75 L 76 M 77 N 78 O 79 0x5_ P 80 Q 81 R 82 S 83 T 84 U 85 V 86 W 87 X 88 Y 89 Z 90 [ 91 \ 92 ] 93 ^ 94 _ 95 0x6_ ` 96 a 97 b 98 c 99 d 100 e 101 f 102 g 103 h 104 i 105 j 106 k 107 l 108 m 109 n 110 o 111 0x7_ p 112 q 113 r 114 s 115 t 116 u 117 v 118 w 119 x 120 y 121 z 122 { 123 | 124 } 125 ~ 126 127 0x8_ € 128 129 ‚ 130 ƒ 131 „ 132 … 133 † 134 ‡ 135 ˆ 136 ‰ 137 Š 138 ‹ 139 Œ 140 141 Ž 142 143 0x9_ 144 ‘ 145 ’ 146 “ 147 ” 148 • 149 – 150 — 151 ˜ 152 ™ 153 š 154 › 155 œ 156 157 ž 158 Ÿ 159 0xA_ 160 ¡ 161 ¢ 162 £ 163 ¤ 164 ¥ 165 ¦ 166 § 167 ¨ 168 © 169 ª 170 « 171 ¬ 172 - 173 ® 174 ¯ 175 0xB_ ° 176 ± 177 ² 178 ³ 179 ´ 180 µ 181 ¶ 182 · 183 ¸ 184 ¹ 185 º 186 » 187 ¼ 188 ½ 189 ¾ 190 ¿ 191 0xC_ À 192 Á 193 Â 194 Ã 195 Ä 196 Å 197 Æ 198 Ç 199 È 200 É 201 Ê 202 Ë 203 Ì 204 Í 205 Î 206 Ï 207 0xD_ Ð 208 Ñ 209 Ò 210 Ó 211 Ô 212 Õ 213 Ö 214 × 215 Ø 216 Ù 217 Ú 218 Û 219 Ü 220 Ý 221 Þ 222 ß 223 0xE_ à 224 á 225 â 226 ã 227 ä 228 å 229 æ 230 ç 231 è 232 é 233 ê 234 ë 235 ì 236 í 237 î 238 ï 239 0xF_ ð 240 ñ 241 ò 242 ó 243 ô 244 õ 245 ö 246 ÷ 247 ø 248 ù 249 ú 250 û 251 ü 252 ý 253 þ 254 ÿ 255 © Carl Valentin GmbH Modifications are subject to change

Codepage 1251 Code -0 -1 -2 -3 -4 -5 -6
-7 -8 -9 -A -B -C -D -E -F 0x0_ NUL 0 SOH 1 STX 2 ETX 3 EOT 4 ENQ 5 ACK 6 BEL 7 BS 8 HT 9 LF 10 VT 11 FF 12 CR 13 SO 14 SI 15 0x1_ DLE 16 DC1 17 DC2 18 DC3 19 DC4 20 NAK 21 SYN 22 ETB 23 CAN 24 EM 25 SUB 26 ESC 27 FS 28 GS 29 RS 30 US 31 0x2_ SP 32 ! 33 " 34 # 35 $ 36 % 37 & 38 39 ( 40 ) 41 * 42 + 43 , 44 - 45 . 46 / 47 0x3_ 0 48 1 49 2 50 3 51 4 52 5 53 6 54 7 55 8 56 9 57 : 58 ; 59 < 60 = 61 > 62 ? 63 0x4_ @ 64 A 65 B 66 C 67 D 68 E 69 F 70 G 71 H 72 I 73 J 74 K 75 L 76 M 77 N 78 O 79 0x5_ P 80 Q 81 R 82 S 83 T 84 U 85 V 86 W 87 X 88 Y 89 Z 90 [ 91 \ 92 ] 93 ^ 94 _ 95 0x6_ ` 96 a 97 b 98 c 99 d 100 e 101 f 102 g 103 h 104 i 105 j 106 k 107 l 108 m 109 n 110 o 111 0x7_ p 112 q 113 r 114 s 115 t 116 u 117 v 118 w 119 x 120 y 121 z 122 { 123 | 124 } 125 ~ 126 127 0x8_ Ђ 128 Ѓ 129 ‚ 130 ѓ 131 „ 132 … 133 † 134 ‡ 135 € 136 ‰ 137 Љ 138 ‹ 139 Њ 140 Ќ 141 Ћ 142 Џ 143 0x9_ ђ 144 ‘ 145 ’ 146 “ 147 ” 148 • 149 – 150 — 151 152 ™ 153 љ 154 › 155 њ 156 ќ 157 ћ 158 џ 159 0xA_ 160 Ў 161 ў 162 Ј 163 ¤ 164 Ґ 165 ¦ 166 § 167 Ё 168 © 169 Є 170 « 171 ¬ 172 - 173 ® 174 Ї 175 0xB_ ° 176 ± 177 І 178 і 179 ґ 180 µ 181 ¶ 182 · 183 ё 184 № 185 є 186 » 187 ј 188 Ѕ 189 ѕ 190 ї 191 0xC_ А 192 Б 193 В 194 Г 195 Д 196 Е 197 Ж 198 З 199 И 200 Й 201 К 202 Л 203 М 204 Н 205 О 206 П 207 0xD_ Р 208 С 209 Т 210 У 211 Ф 212 Х 213 Ц 214 Ч 215 Ш 216 Щ 217 Ъ 218 Ы 219 Ь 220 Э 221 Ю 222 Я 223 0xE_ а 224 б 225 в 226 г 227 д 228 е 229 ж 230 з 231 и 232 й 233 к 234 л 235 м 236 н 237 о 238 п 239 0xF_ р 240 с 241 т 242 у 243 ф 244 х 245 ц 246 ч 247 ш 248 щ 249 ъ 250 ы 251 ь 252 э 253 ю 254 я 255 © Carl Valentin GmbH Modifications are subject to change

¡  It isn’t “16 bits” ¡  It isn’t an
encoding; you can’t say “that string is in unicode” ¡  It’s not a black art ¡  It’s not particularly diﬃcult to deal with

¡  Well… I’m simplifying a bit, but yes ¡ 
It’s now published electronically, for example ¡  But the basic idea is that you map a numeric code point to every character that you wish to represent in a computer system ¡  There are pages and pages mapping characters to a number (a Unicode code point) §  I wonder where the term “code pages” came from? ¡  Each character also has properties, such as a name, its case (lower/upper), and more

¡  Unicode did begin life with only 65k characters
(“16 bits”) §  … back in 1991 §  Only ﬁve years later, the Unicode Consortium realized this was insuﬃcient and expanded the maximum number of Unicode characters §  Yet the misconception remains ¡  Today, Unicode has space to encode 1,114,111 unique characters

¡  Unicode is segregated into seventeen “planes” of 65k
code points each ¡  Plane zero (U+0000 – U+FFFF) is designated the Basic Multilingual Plane (BMP) and contains the vast majority of the modern scripts

¡  There have been six major revisions of Unicode
since its original release in 1991 ¡  The latest version of Unicode, 6.2.0, will be released this month (October 2012) for basically one reason: ¡  A new currency symbol! This slide brought to you by U+20BA, TURKISH LIRA SYMBOL

¡  Code Point: An abstract number associated with a
character. Typically expressed as U+<hex number>. Even though Unicode covers more than 65k characters, this is typically written as a four digit hex number. ¡  For example U+0041 == LATIN CAPITAL LETTER A

¡  Character: “The smallest component of written language
that has semantic value” (Unicode standard language)... bear with me, though ¡  Glyph: A visual representation of a character in a given font

¡  So for example, the character “a” (U+0061, LATIN
SMALL LETTER A) can be represented as the following glyphs: a a a a a a a

¡  Well… what’s it look like in WinHex? ¡ 
Let’s take the character ‘☃’, for example (SNOWMAN, U+2603)

¡  The point is: a Unicode character and code
point is simply an abstract construct… ¡  In order to actually physically represent a Unicode character in memory, on network, on disk, you must use an encoding ¡  Deﬁned Unicode encoding schemes: §  UTF-‐8 (“Unicode Transformation Format”) §  UTF-‐16 §  UTF-‐32

¡  The only ﬁxed width encoding is UTF-‐32, which
uses 4 bytes per Unicode character ¡  UTF-‐8 and UTF-‐16 are variable width encodings ¡  UTF-‐8 encodes all Unicode code points using one to four bytes §  Backwards compatible with ASCII – 0-‐7f map directly to the ASCII equivalents §  No embedded NULLs; compatible with C strings §  Used when space/speed is at a premium

¡  UTF-‐16 encodes all Unicode code points using two
or four bytes §  If a character cannot be encoded in two bytes (it is outside of the BMP), then a surrogate pair is used §  The ﬁrst surrogate, or high surrogate, falls within the byte range D800 – DBFF §  The second surrogate, or low surrogate, falls within the byte range DC00 – DFFF §  Used widely throughout Windows & Java

¡  So how do we represent ☃ (U+2603) again?
¡  UTF-‐32 is usually represented big-‐endian: §  00 00 26 03 ¡  UTF-‐16 can be represented little-‐ or big-‐ endian §  Windows will use little-‐endian by default §  The standard says big-‐endian by default §  So… most applications will prepend the BOM (Byte Order Mark) – U+FEFF which is a reserved Unicode character

bits bytes encoding characters 11000100 01000010
c4 42 Windows-‐1252 ÄB 11000100 01000010 c4 42 Mac Roman ƒB 11000100 01000010 c4 42 GB18030 腂 11000100 01000010 c4 42 KOI8-‐R дB 11000100 01000010 c4 42 UTF-‐16 쑂 11000100 01000010 c4 42 UTF-‐8 invalid sequence

This slide brought to you by U+202D, LEFT-‐TO-‐RIGHT OVERRIDE,
and u'\u202Etpp.stohsnee\u202Dfunny.scr’

¡  Ensure that all software components agree on your
encoding §  For example, create databases with UTF-‐8 encoding. The default is “Latin-‐1” ¡  Centralize your Unicode handling §  If you’re handling files, for example, see if you can use the hash of the file as a surrogate identifier, with a table mapping hashes to file names §  Simplifies handling when external scripts don’t “need” access to the original Unicode filename

¡  Test your software to ensure it supports various
Unicode features §  Characters in the BMP (Basic Multilingual Plane) §  Characters in planes 1-‐17 §  Unicode “special” characters ¡  Use Unicode normalization, collation, and comparison libraries §  If you’re sorting, comparing, or otherwise doing something other than storing and displaying Unicode text, don’t just sort by byte order

¡  Stuﬀ that we won’t cover: §  Collation: how
does one sort all these characters? §  Normalization: more than one way to encode a character… §  Advanced typography: initial, medial, ﬁnal, isolated forms, bidirectional character layout, … §  Date formats, decimal formats (punctuation), and other “locale” items

You Don't Know Jack About Unicode

You Don't Know Jack About Unicode

Kyrus

More Decks by Kyrus

Other Decks in Technology

Featured

Transcript

1 October 2012 | Jason Garman, Kyrus Technology

¡  Have seen a lot of questions about Unicode

Codepage 1252 Code -0 -1 -2 -3 -4 -5 -6

Codepage 1251 Code -0 -1 -2 -3 -4 -5 -6

¡  It isn’t “16 bits” ¡  It isn’t an

¡  Well… I’m simplifying a bit, but yes ¡

¡  Unicode did begin life with only 65k characters

¡  Unicode is segregated into seventeen “planes” of 65k

¡  There have been six major revisions of Unicode

¡  Code Point: An abstract number associated with a

¡  Character: “The smallest component of written language

¡  So for example, the character “a” (U+0061, LATIN

¡  Well… what’s it look like in WinHex? ¡

¡  The point is: a Unicode character and code

¡  The only ﬁxed width encoding is UTF-‐32, which

¡  UTF-‐16 encodes all Unicode code points using two

¡  So how do we represent ☃ (U+2603) again?

bits bytes encoding characters 11000100 01000010

This slide brought to you by U+202D, LEFT-‐TO-‐RIGHT OVERRIDE,

¡  Ensure that all software components agree on your

¡  Test your software to ensure it supports various

¡  Stuﬀ that we won’t cover: §  Collation: how