Slide 1

Slide 1 text

U is for Unicode Greg Back July 30, 2017 /gtback @gtback

Slide 2

Slide 2 text

U is for Unicode (With apologies to Sue Grafton) Undertow

Slide 3

Slide 3 text

What is Unicode?

Slide 4

Slide 4 text

What is an encoding?

Slide 5

Slide 5 text

ASCII (RFC 20) 32 40 ( 48 0 56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127 

Slide 6

Slide 6 text

Latin-1 32 40 ( 48 0 56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ¨ 176 ° 184 ¸ 192 À 200 È 208 Ð 216 Ø 224 à 232 è 240 ð 248 ø 161 ¡ 169 © 177 ± 185 ¹ 193 Á 201 É 209 Ñ 217 Ù 225 á 233 é 241 ñ 249 ù 162 ¢ 170 ª 178 ² 186 º 194 Â 202 Ê 210 Ò 218 Ú 226 â 234 ê 242 ò 250 ú 163 £ 171 « 179 ³ 187 » 195 Ã 203 Ë 211 Ó 219 Û 227 ã 235 ë 243 ó 251 û 164 ¤ 172 ¬ 180 ´ 188 ¼ 196 Ä 204 Ì 212 Ô 220 Ü 228 ä 236 ì 244 ô 252 ü 165 ¥ 173 - 181 µ 189 ½ 197 Å 205 Í 213 Õ 221 Ý 229 å 237 í 245 õ 253 ý 166 ¦ 174 ® 182 ¶ 190 ¾ 198 Æ 206 Î 214 Ö 222 Þ 230 æ 238 î 246 ö 254 þ 167 § 175 ¯ 183 · 191 ¿ 199 Ç 207 Ï 215 × 223 ß 231 ç 239 ï 247 ÷ 255 ÿ

Slide 7

Slide 7 text

Latin-2 32 40 ( 48 0 56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ¨ 176 ° 184 ¸ 192 Ŕ 200 Č 208 Đ 216 Ř 224 ŕ 232 č 240 đ 248 ř 161 Ą 169 Š 177 ą 185 š 193 Á 201 É 209 Ń 217 Ů 225 á 233 é 241 ń 249 ů 162 ˘ 170 Ş 178 ˛ 186 ş 194 Â 202 Ę 210 Ň 218 Ú 226 â 234 ę 242 ň 250 ú 163 Ł 171 Ť 179 ł 187 ť 195 Ă 203 Ë 211 Ó 219 Ű 227 ă 235 ë 243 ó 251 ű 164 ¤ 172 Ź 180 ´ 188 ź 196 Ä 204 Ě 212 Ô 220 Ü 228 ä 236 ě 244 ô 252 ü 165 Ľ 173 - 181 ľ 189 ˝ 197 Ĺ 205 Í 213 Ő 221 Ý 229 ĺ 237 í 245 ő 253 ý 166 Ś 174 Ž 182 ś 190 ž 198 Ć 206 Î 214 Ö 222 Ţ 230 ć 238 î 246 ö 254 ţ 167 § 175 Ż 183 ˇ 191 ż 199 Ç 207 Ď 215 × 223 ß 231 ç 239 ď 247 ÷ 255 ˙

Slide 8

Slide 8 text

Greek (ISO/IEC 8859-7) 32 40 ( 48 0 56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ¨ 176 ° 184 Έ 192 ΐ 200 Θ 208 Π 216 Ψ 224 ΰ 232 θ 240 π 248 ψ 161 ‘ 169 © 177 ± 185 Ή 193 Α 201 Ι 209 Ρ 217 Ω 225 α 233 ι 241 ρ 249 ω 162 ’ 170 ͺ 178 ² 186 Ί 194 Β 202 Κ 210 218 Ϊ 226 β 234 κ 242 ς 250 ϊ 163 £ 171 « 179 ³ 187 » 195 Γ 203 Λ 211 Σ 219 Ϋ 227 γ 235 λ 243 σ 251 ϋ 164 € 172 ¬ 180 ΄ 188 Ό 196 Δ 204 Μ 212 Τ 220 ά 228 δ 236 μ 244 τ 252 ό 165 ₯ 173 - 181 ΅ 189 ½ 197 Ε 205 Ν 213 Υ 221 έ 229 ε 237 ν 245 υ 253 ύ 166 ¦ 174 182 Ά 190 Ύ 198 Ζ 206 Ξ 214 Φ 222 ή 230 ζ 238 ξ 246 φ 254 ώ 167 § 175 ― 183 · 191 Ώ 199 Η 207 Ο 215 Χ 223 ί 231 η 239 ο 247 χ 255

Slide 9

Slide 9 text

Japanese (IBM Code Page 932) 32 40 ( 48 0 56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ィ 176 ー 184 ク 192 タ 200 ネ 208 ミ 216 リ 161 。 169 ゥ 177 ア 185 ケ 193 チ 201 ノ 209 ム 217 ル 162 「 170 ェ 178 イ 186 コ 194 ツ 202 ハ 210 メ 218 レ 163 」 171 ォ 179 ウ 187 サ 195 テ 203 ヒ 211 モ 219 ロ 164 、 172 ャ 180 エ 188 シ 196 ト 204 フ 212 ヤ 220 ワ 165 ・ 173 ュ 181 オ 189 ス 197 ナ 205 ヘ 213 ユ 221 ン 166 ヲ 174 ョ 182 カ 190 セ 198 ニ 206 ホ 214 ヨ 222 ゙ 167 ァ 175 ッ 183 キ 191 ソ 199 ヌ 207 マ 215 ラ 223 ゚

Slide 10

Slide 10 text

Japanese (IBM Code Page 932) 32 40 ( 48 0 56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ィ 176 ー 184 ク 192 タ 200 ネ 208 ミ 216 リ 161 。 169 ゥ 177 ア 185 ケ 193 チ 201 ノ 209 ム 217 ル 162 「 170 ェ 178 イ 186 コ 194 ツ 202 ハ 210 メ 218 レ 163 」 171 ォ 179 ウ 187 サ 195 テ 203 ヒ 211 モ 219 ロ 164 、 172 ャ 180 エ 188 シ 196 ト 204 フ 212 ヤ 220 ワ 165 ・ 173 ュ 181 オ 189 ス 197 ナ 205 ヘ 213 ユ 221 ン 166 ヲ 174 ョ 182 カ 190 セ 198 ニ 206 ホ 214 ヨ 222 ゙ 167 ァ 175 ッ 183 キ 191 ソ 199 ヌ 207 マ 215 ラ 223 ゚ Plus some 2-byte forms!!

Slide 11

Slide 11 text

Clue #1: Humans deal with characters. Computers deal with bytes.

Slide 12

Slide 12 text

What is Unicode? One ring to rule them all “encoding” https://commons.wikimedia.org/wiki/File:Unico_Anello.png

Slide 13

Slide 13 text

A Character U+0041 Code Point Glyph(s) A A A A A

Slide 14

Slide 14 text

Encoding Unicode Code Points - UCS-2: Fixed-width (2 bytes per code point) - UTF-16: Variable-width (2 or 4 bytes per code point) - UTF-32: Fixed-width (4 bytes per code point) - UTF-8: Variable-width (1-4 bytes per code point)

Slide 15

Slide 15 text

Encoding Code Points - UCS-2: Fixed-width (2 bytes per code point) - UTF-16: Variable-width (2 or 4 bytes per code point) - UTF-32: Fixed-width (4 bytes per code point) - UTF-8: Variable width (1-4 bytes per code point)

Slide 16

Slide 16 text

Unicode and Python Python 2 (“two blue”) Python 3 (“three green”)

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Python 3 Unicode Bytes Python 2 Unicode Bytes Text Types

Slide 20

Slide 20 text

Python 3 Unicode Bytes Python 2 Unicode Bytes Text Types ✔ ✔

Slide 21

Slide 21 text

Python 3 Unicode Bytes Python 2 Unicode Bytes Text Types ¯\_(ツ)_/¯

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Python 3 CANNOT add and Python 2 CAN add and (in most cases) Adding Text Types

Slide 25

Slide 25 text

Python 3 CANNOT add and Python 2 CAN add and (in most cases) Adding Text Types ✔ ✘

Slide 26

Slide 26 text

Python 3 CANNOT add and Python 2 CAN add and (in most cases) Adding Text Types

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Clue #2: - Unicode → Bytes : encode() - Bytes → Unicode : decode()

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

>>> latin1_string.encode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal not in range(128) System Default Encodings

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

>>> latin1_string.encode('ascii') AttributeError: 'bytes' object has no attribute 'encode' >>> latin1_string.encode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal not in range(128) Encoding Errors

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Clue #2: - Unicode → Bytes : encode() - Bytes → Unicode : decode() Clue #3: In Python, - Unicode is not an encoding. - UTF-8 is an encoding.

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

>>> latin1_string.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal not in range(128) System Default Encodings

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

Python 3 UTF-8 Python 2 ascii (?) System Default Encodings sys.getdefaultencoding()

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Clue #4: Unicode “Sandwich” 1. Decode (Bytes → Unicode) as soon as possible 2. Encode (Unicode → Bytes) as late as possible 3. Deal with Unicode everywhere in the middle.

Slide 42

Slide 42 text

Detecting Encodings Q: How do I know the encoding of a stream of bytes? A1: You don’t. A2: Unless you’re told. A3: But even then you can’t be sure. A4: Fortunately, you can “guess”.

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Clue #5: Be prepared for anything.

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

Python 2 and 3 Compatibility six.text_type - Always Unicode six.binary_type - Always Bytes six.string_types - Possible string data

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

Python 2 and 3 Compatibility six.text_type - Always Unicode six.binary_type - Always Bytes six.string_types - Possible string data from __future__ import unicode_literals - All string literals in source code are Unicode.

Slide 55

Slide 55 text

Clue #6: Use six and unicode_literals for Python 2 / Python 3 compatibility

Slide 56

Slide 56 text

Mystery Solved (Conclusion) Clue #1: Humans:Characters :: Computers:Bytes Clue #2: U→B: encode() B → U: decode() Clue #3: “Unicode” isn’t an encoding. “UTF-8” is. Clue #4: Unicode Sandwich Clue #5: Be prepared for anything. Clue #6: Use six and unicode_literals

Slide 57

Slide 57 text

Resources http://www.joelonsoftware.com/articles/Unicode.html (“Spolsky Unicode”) http://nedbatchelder.com/text/unipain.html (“Unipain Python”) https://speakerdeck.com/ramalho/unicode-solutions-in-python-2-and-python-3 (“Speakerdeck Unicode Python”) https://www.safaribooksonline.com/library/view/fluent-python/9781491946237/ch0 4.html (“Fluent Python Unicode”) Thank you! Questions? /gtback @gtback

Slide 58

Slide 58 text

Unicode Zoo → U+2192 RIGHTWARDS ARROW ✔ U+2714 HEAVY CHECK MARK ✘ U+2718 HEAVY BALLOT X U+1F44D THUMBS UP SIGN U+1F44E THUMBS DOWN SIGN

Slide 59

Slide 59 text

Unicode Zoo Ü U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS ñ U+00F1 LATIN SMALL LETTER N WITH TILDE î U+00EE LATIN SMALL LETTER I WITH CIRCUMFLEX ç U+00E7 LATIN SMALL LETTER C WITH CEDILLA ø U+00F8 LATIN SMALL LETTER O WITH STROKE d U+0064 LATIN SMALL LETTER D é U+00E9 LATIN SMALL LETTER E WITH ACUTE

Slide 60

Slide 60 text

Unicode Zoo ¯ U+00AF MACRON \ U+005C REVERSE SOLIDUS _ U+005F LOW LINE ( U+0028 LEFT PARENTHESIS ツ U+30C4 KATAKANA LETTER TU ) U+0029 RIGHT PARENTHESIS / U+002F SOLIDUS

Slide 61

Slide 61 text

Ü is for Üñîçødé Greg Back July 30, 2017 /gtback @gtback

Slide 62

Slide 62 text

℉◎υя ṧ¢øяℯ @ηⅾ ṧε♥εᾔ ¥ε@ґṧ αℊø øüя ḟ@т♄εґṧ ßґ☺υℊ♄☂ ḟ☺ґ⊥н øη ⊥ℌ☤ṧ ¢øη☂☤ᾔℯηт, @ η℮ẘ ηα⊥☤øη, ¢☺ηḉℯїṽ℮ḓ їᾔ ℒ☤♭ℯя☂ƴ, αηḓ ⅾℯḓḯ¢ α⊥℮ⅾ ⊥◎ т♄℮ ρґ◎℘øṧ☤тїøᾔ тнα☂ αʟʟ μεη @яℯ ¢я℮α☂ε∂ εⓠü@ʟ▪ Ⓕⓞⓤⓡ ⓢⓒⓞⓡⓔ ⓐⓝⓓ ⓢⓔⓥⓔⓝ ⓨⓔⓐⓡⓢ ⓐⓖⓞ ⓞⓤⓡ ⓕⓐⓣⓗⓔⓡⓢ ⓑⓡⓞⓤⓖⓗⓣ ⓕⓞⓡⓣⓗ ⓞⓝ ⓣⓗⓘⓢ ⓒⓞⓝⓣⓘⓝⓔⓝⓣ , ⓐ ⓝⓔⓦ ⓝⓐⓣⓘⓞⓝ, ⓒⓞⓝⓒⓔⓘⓥⓔⓓ ⓘⓝ Ⓛⓘⓑⓔⓡⓣⓨ, ⓐⓝⓓ ⓓⓔⓓⓘⓒⓐⓣⓔⓓ ⓣⓞ ⓣⓗⓔ ⓟⓡⓞⓟⓞⓢⓘⓣⓘⓞⓝ ⓣⓗⓐⓣ ⓐⓛⓛ ⓜⓔⓝ ⓐⓡⓔ ⓒⓡⓔⓐⓣⓔⓓ ⓔⓠⓤⓐⓛ◎ , , , . Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.

Slide 63

Slide 63 text

℉◎υя ṧ¢øяℯ @ηⅾ ṧε♥εᾔ ¥ε@ґṧ αℊø øüя ḟ@т♄εґṧ ßґ☺υℊ♄☂ ḟ☺ґ⊥н øη ⊥ℌ☤ṧ ¢øη☂☤ᾔℯηт, @ η℮ẘ ηα⊥☤øη, ¢☺ηḉℯїṽ℮ḓ їᾔ ℒ☤♭ℯя☂ƴ, αηḓ ⅾℯḓḯ¢ α⊥℮ⅾ ⊥◎ т♄℮ ρґ◎℘øṧ☤тїøᾔ тнα☂ αʟʟ μεη @яℯ ¢я℮α☂ε∂ εⓠü@ʟ▪ Ⓕⓞⓤⓡ ⓢⓒⓞⓡⓔ ⓐⓝⓓ ⓢⓔⓥⓔⓝ ⓨⓔⓐⓡⓢ ⓐⓖⓞ ⓞⓤⓡ ⓕⓐⓣⓗⓔⓡⓢ ⓑⓡⓞⓤⓖⓗⓣ ⓕⓞⓡⓣⓗ ⓞⓝ ⓣⓗⓘⓢ ⓒⓞⓝⓣⓘⓝⓔⓝⓣ , ⓐ ⓝⓔⓦ ⓝⓐⓣⓘⓞⓝ, ⓒⓞⓝⓒⓔⓘⓥⓔⓓ ⓘⓝ Ⓛⓘⓑⓔⓡⓣⓨ, ⓐⓝⓓ ⓓⓔⓓⓘⓒⓐⓣⓔⓓ ⓣⓞ ⓣⓗⓔ ⓟⓡⓞⓟⓞⓢⓘⓣⓘⓞⓝ ⓣⓗⓐⓣ ⓐⓛⓛ ⓜⓔⓝ ⓐⓡⓔ ⓒⓡⓔⓐⓣⓔⓓ ⓔⓠⓤⓐⓛ◎ , , , . Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.

Slide 64

Slide 64 text

Final Clue: DO NOT use non-standard code points for stylistic effect. But for comedic effect, it's encouraged! ☺