Upgrade to Pro — share decks privately, control downloads, hide ads and more …

U is for Unicode: PyOhio

U is for Unicode: PyOhio

Slides from my "U is for Unicode: Solving the Mystery" talk at PyOhio 2017

Greg Back

July 30, 2017
Tweet

More Decks by Greg Back

Other Decks in Technology

Transcript

  1. ASCII (RFC 20) 32 40 ( 48 0 56 8

    64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127 
  2. Latin-1 32 40 ( 48 0 56 8 64 @

    72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ¨ 176 ° 184 ¸ 192 À 200 È 208 Ð 216 Ø 224 à 232 è 240 ð 248 ø 161 ¡ 169 © 177 ± 185 ¹ 193 Á 201 É 209 Ñ 217 Ù 225 á 233 é 241 ñ 249 ù 162 ¢ 170 ª 178 ² 186 º 194 Â 202 Ê 210 Ò 218 Ú 226 â 234 ê 242 ò 250 ú 163 £ 171 « 179 ³ 187 » 195 Ã 203 Ë 211 Ó 219 Û 227 ã 235 ë 243 ó 251 û 164 ¤ 172 ¬ 180 ´ 188 ¼ 196 Ä 204 Ì 212 Ô 220 Ü 228 ä 236 ì 244 ô 252 ü 165 ¥ 173 - 181 µ 189 ½ 197 Å 205 Í 213 Õ 221 Ý 229 å 237 í 245 õ 253 ý 166 ¦ 174 ® 182 ¶ 190 ¾ 198 Æ 206 Î 214 Ö 222 Þ 230 æ 238 î 246 ö 254 þ 167 § 175 ¯ 183 · 191 ¿ 199 Ç 207 Ï 215 × 223 ß 231 ç 239 ï 247 ÷ 255 ÿ
  3. Latin-2 32 40 ( 48 0 56 8 64 @

    72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ¨ 176 ° 184 ¸ 192 Ŕ 200 Č 208 Đ 216 Ř 224 ŕ 232 č 240 đ 248 ř 161 Ą 169 Š 177 ą 185 š 193 Á 201 É 209 Ń 217 Ů 225 á 233 é 241 ń 249 ů 162 ˘ 170 Ş 178 ˛ 186 ş 194 Â 202 Ę 210 Ň 218 Ú 226 â 234 ę 242 ň 250 ú 163 Ł 171 Ť 179 ł 187 ť 195 Ă 203 Ë 211 Ó 219 Ű 227 ă 235 ë 243 ó 251 ű 164 ¤ 172 Ź 180 ´ 188 ź 196 Ä 204 Ě 212 Ô 220 Ü 228 ä 236 ě 244 ô 252 ü 165 Ľ 173 - 181 ľ 189 ˝ 197 Ĺ 205 Í 213 Ő 221 Ý 229 ĺ 237 í 245 ő 253 ý 166 Ś 174 Ž 182 ś 190 ž 198 Ć 206 Î 214 Ö 222 Ţ 230 ć 238 î 246 ö 254 ţ 167 § 175 Ż 183 ˇ 191 ż 199 Ç 207 Ď 215 × 223 ß 231 ç 239 ď 247 ÷ 255 ˙
  4. Greek (ISO/IEC 8859-7) 32 40 ( 48 0 56 8

    64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ¨ 176 ° 184 Έ 192 ΐ 200 Θ 208 Π 216 Ψ 224 ΰ 232 θ 240 π 248 ψ 161 ‘ 169 © 177 ± 185 Ή 193 Α 201 Ι 209 Ρ 217 Ω 225 α 233 ι 241 ρ 249 ω 162 ’ 170 ͺ 178 ² 186 Ί 194 Β 202 Κ 210 218 Ϊ 226 β 234 κ 242 ς 250 ϊ 163 £ 171 « 179 ³ 187 » 195 Γ 203 Λ 211 Σ 219 Ϋ 227 γ 235 λ 243 σ 251 ϋ 164 € 172 ¬ 180 ΄ 188 Ό 196 Δ 204 Μ 212 Τ 220 ά 228 δ 236 μ 244 τ 252 ό 165 ₯ 173 - 181 ΅ 189 ½ 197 Ε 205 Ν 213 Υ 221 έ 229 ε 237 ν 245 υ 253 ύ 166 ¦ 174 182 Ά 190 Ύ 198 Ζ 206 Ξ 214 Φ 222 ή 230 ζ 238 ξ 246 φ 254 ώ 167 § 175 ― 183 · 191 Ώ 199 Η 207 Ο 215 Χ 223 ί 231 η 239 ο 247 χ 255
  5. Japanese (IBM Code Page 932) 32 40 ( 48 0

    56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ィ 176 ー 184 ク 192 タ 200 ネ 208 ミ 216 リ 161 。 169 ゥ 177 ア 185 ケ 193 チ 201 ノ 209 ム 217 ル 162 「 170 ェ 178 イ 186 コ 194 ツ 202 ハ 210 メ 218 レ 163 」 171 ォ 179 ウ 187 サ 195 テ 203 ヒ 211 モ 219 ロ 164 、 172 ャ 180 エ 188 シ 196 ト 204 フ 212 ヤ 220 ワ 165 ・ 173 ュ 181 オ 189 ス 197 ナ 205 ヘ 213 ユ 221 ン 166 ヲ 174 ョ 182 カ 190 セ 198 ニ 206 ホ 214 ヨ 222 ゙ 167 ァ 175 ッ 183 キ 191 ソ 199 ヌ 207 マ 215 ラ 223 ゚
  6. Japanese (IBM Code Page 932) 32 40 ( 48 0

    56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ィ 176 ー 184 ク 192 タ 200 ネ 208 ミ 216 リ 161 。 169 ゥ 177 ア 185 ケ 193 チ 201 ノ 209 ム 217 ル 162 「 170 ェ 178 イ 186 コ 194 ツ 202 ハ 210 メ 218 レ 163 」 171 ォ 179 ウ 187 サ 195 テ 203 ヒ 211 モ 219 ロ 164 、 172 ャ 180 エ 188 シ 196 ト 204 フ 212 ヤ 220 ワ 165 ・ 173 ュ 181 オ 189 ス 197 ナ 205 ヘ 213 ユ 221 ン 166 ヲ 174 ョ 182 カ 190 セ 198 ニ 206 ホ 214 ヨ 222 ゙ 167 ァ 175 ッ 183 キ 191 ソ 199 ヌ 207 マ 215 ラ 223 ゚ Plus some 2-byte forms!!
  7. What is Unicode? One ring to rule them all “encoding”

    https://commons.wikimedia.org/wiki/File:Unico_Anello.png
  8. Encoding Unicode Code Points - UCS-2: Fixed-width (2 bytes per

    code point) - UTF-16: Variable-width (2 or 4 bytes per code point) - UTF-32: Fixed-width (4 bytes per code point) - UTF-8: Variable-width (1-4 bytes per code point)
  9. Encoding Code Points - UCS-2: Fixed-width (2 bytes per code

    point) - UTF-16: Variable-width (2 or 4 bytes per code point) - UTF-32: Fixed-width (4 bytes per code point) - UTF-8: Variable width (1-4 bytes per code point)
  10. Python 3 Unicode <class 'str'> Bytes <class 'bytes'> Python 2

    Unicode <class 'unicode'> Bytes <class 'str'> Text Types
  11. Python 3 Unicode <class 'str'> Bytes <class 'bytes'> Python 2

    Unicode <class 'unicode'> Bytes <class 'str'> Text Types ✔ ✔
  12. Python 3 Unicode <class 'str'> Bytes <class 'bytes'> Python 2

    Unicode <class 'unicode'> Bytes <class 'str'> Text Types ¯\_(ツ)_/¯
  13. Python 3 CANNOT add <str> and <bytes> Python 2 CAN

    add <str> and <unicode> (in most cases) Adding Text Types
  14. Python 3 CANNOT add <str> and <bytes> Python 2 CAN

    add <str> and <unicode> (in most cases) Adding Text Types ✔ ✘
  15. Python 3 CANNOT add <str> and <bytes> Python 2 CAN

    add <str> and <unicode> (in most cases) Adding Text Types
  16. >>> latin1_string.encode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in

    position 0: ordinal not in range(128) System Default Encodings
  17. >>> latin1_string.encode('ascii') AttributeError: 'bytes' object has no attribute 'encode' >>>

    latin1_string.encode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal not in range(128) Encoding Errors
  18. Clue #2: - Unicode → Bytes : encode() - Bytes

    → Unicode : decode() Clue #3: In Python, - Unicode is not an encoding. - UTF-8 is an encoding.
  19. >>> latin1_string.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in

    position 0: ordinal not in range(128) System Default Encodings
  20. Clue #4: Unicode “Sandwich” 1. Decode (Bytes → Unicode) as

    soon as possible 2. Encode (Unicode → Bytes) as late as possible 3. Deal with Unicode everywhere in the middle.
  21. Detecting Encodings Q: How do I know the encoding of

    a stream of bytes? A1: You don’t. A2: Unless you’re told. A3: But even then you can’t be sure. A4: Fortunately, you can “guess”.
  22. Python 2 and 3 Compatibility six.text_type - Always Unicode six.binary_type

    - Always Bytes six.string_types - Possible string data
  23. Python 2 and 3 Compatibility six.text_type - Always Unicode six.binary_type

    - Always Bytes six.string_types - Possible string data from __future__ import unicode_literals - All string literals in source code are Unicode.
  24. Mystery Solved (Conclusion) Clue #1: Humans:Characters :: Computers:Bytes Clue #2:

    U→B: encode() B → U: decode() Clue #3: “Unicode” isn’t an encoding. “UTF-8” is. Clue #4: Unicode Sandwich Clue #5: Be prepared for anything. Clue #6: Use six and unicode_literals
  25. Resources http://www.joelonsoftware.com/articles/Unicode.html (“Spolsky Unicode”) http://nedbatchelder.com/text/unipain.html (“Unipain Python”) https://speakerdeck.com/ramalho/unicode-solutions-in-python-2-and-python-3 (“Speakerdeck Unicode

    Python”) https://www.safaribooksonline.com/library/view/fluent-python/9781491946237/ch0 4.html (“Fluent Python Unicode”) Thank you! Questions? /gtback @gtback
  26. Unicode Zoo → U+2192 RIGHTWARDS ARROW ✔ U+2714 HEAVY CHECK

    MARK ✘ U+2718 HEAVY BALLOT X U+1F44D THUMBS UP SIGN U+1F44E THUMBS DOWN SIGN
  27. Unicode Zoo Ü U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS

    ñ U+00F1 LATIN SMALL LETTER N WITH TILDE î U+00EE LATIN SMALL LETTER I WITH CIRCUMFLEX ç U+00E7 LATIN SMALL LETTER C WITH CEDILLA ø U+00F8 LATIN SMALL LETTER O WITH STROKE d U+0064 LATIN SMALL LETTER D é U+00E9 LATIN SMALL LETTER E WITH ACUTE
  28. Unicode Zoo ¯ U+00AF MACRON \ U+005C REVERSE SOLIDUS _

    U+005F LOW LINE ( U+0028 LEFT PARENTHESIS ツ U+30C4 KATAKANA LETTER TU ) U+0029 RIGHT PARENTHESIS / U+002F SOLIDUS
  29. ℉◎υя ṧ¢øяℯ @ηⅾ ṧε♥εᾔ ¥ε@ґṧ αℊø øüя ḟ@т♄εґṧ ßґ☺υℊ♄☂ ḟ☺ґ⊥н

    øη ⊥ℌ☤ṧ ¢øη☂☤ᾔℯηт, @ η℮ẘ ηα⊥☤øη, ¢☺ηḉℯїṽ℮ḓ їᾔ ℒ☤♭ℯя☂ƴ, αηḓ ⅾℯḓḯ¢ α⊥℮ⅾ ⊥◎ т♄℮ ρґ◎℘øṧ☤тїøᾔ тнα☂ αʟʟ μεη @яℯ ¢я℮α☂ε∂ εⓠü@ʟ▪ Ⓕⓞⓤⓡ ⓢⓒⓞⓡⓔ ⓐⓝⓓ ⓢⓔⓥⓔⓝ ⓨⓔⓐⓡⓢ ⓐⓖⓞ ⓞⓤⓡ ⓕⓐⓣⓗⓔⓡⓢ ⓑⓡⓞⓤⓖⓗⓣ ⓕⓞⓡⓣⓗ ⓞⓝ ⓣⓗⓘⓢ ⓒⓞⓝⓣⓘⓝⓔⓝⓣ , ⓐ ⓝⓔⓦ ⓝⓐⓣⓘⓞⓝ, ⓒⓞⓝⓒⓔⓘⓥⓔⓓ ⓘⓝ Ⓛⓘⓑⓔⓡⓣⓨ, ⓐⓝⓓ ⓓⓔⓓⓘⓒⓐⓣⓔⓓ ⓣⓞ ⓣⓗⓔ ⓟⓡⓞⓟⓞⓢⓘⓣⓘⓞⓝ ⓣⓗⓐⓣ ⓐⓛⓛ ⓜⓔⓝ ⓐⓡⓔ ⓒⓡⓔⓐⓣⓔⓓ ⓔⓠⓤⓐⓛ◎ , , , . Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
  30. ℉◎υя ṧ¢øяℯ @ηⅾ ṧε♥εᾔ ¥ε@ґṧ αℊø øüя ḟ@т♄εґṧ ßґ☺υℊ♄☂ ḟ☺ґ⊥н

    øη ⊥ℌ☤ṧ ¢øη☂☤ᾔℯηт, @ η℮ẘ ηα⊥☤øη, ¢☺ηḉℯїṽ℮ḓ їᾔ ℒ☤♭ℯя☂ƴ, αηḓ ⅾℯḓḯ¢ α⊥℮ⅾ ⊥◎ т♄℮ ρґ◎℘øṧ☤тїøᾔ тнα☂ αʟʟ μεη @яℯ ¢я℮α☂ε∂ εⓠü@ʟ▪ Ⓕⓞⓤⓡ ⓢⓒⓞⓡⓔ ⓐⓝⓓ ⓢⓔⓥⓔⓝ ⓨⓔⓐⓡⓢ ⓐⓖⓞ ⓞⓤⓡ ⓕⓐⓣⓗⓔⓡⓢ ⓑⓡⓞⓤⓖⓗⓣ ⓕⓞⓡⓣⓗ ⓞⓝ ⓣⓗⓘⓢ ⓒⓞⓝⓣⓘⓝⓔⓝⓣ , ⓐ ⓝⓔⓦ ⓝⓐⓣⓘⓞⓝ, ⓒⓞⓝⓒⓔⓘⓥⓔⓓ ⓘⓝ Ⓛⓘⓑⓔⓡⓣⓨ, ⓐⓝⓓ ⓓⓔⓓⓘⓒⓐⓣⓔⓓ ⓣⓞ ⓣⓗⓔ ⓟⓡⓞⓟⓞⓢⓘⓣⓘⓞⓝ ⓣⓗⓐⓣ ⓐⓛⓛ ⓜⓔⓝ ⓐⓡⓔ ⓒⓡⓔⓐⓣⓔⓓ ⓔⓠⓤⓐⓛ◎ , , , . Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
  31. Final Clue: DO NOT use non-standard code points for stylistic

    effect. But for comedic effect, it's encouraged! ☺