Upgrade to Pro — share decks privately, control downloads, hide ads and more …

U is for Unicode: PyOhio

U is for Unicode: PyOhio

Slides from my "U is for Unicode: Solving the Mystery" talk at PyOhio 2017

Avatar for Greg Back

Greg Back

July 30, 2017
Tweet

More Decks by Greg Back

Other Decks in Technology

Transcript

  1. ASCII (RFC 20) 32 40 ( 48 0 56 8

    64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127 
  2. Latin-1 32 40 ( 48 0 56 8 64 @

    72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ¨ 176 ° 184 ¸ 192 À 200 È 208 Ð 216 Ø 224 à 232 è 240 ð 248 ø 161 ¡ 169 © 177 ± 185 ¹ 193 Á 201 É 209 Ñ 217 Ù 225 á 233 é 241 ñ 249 ù 162 ¢ 170 ª 178 ² 186 º 194 Â 202 Ê 210 Ò 218 Ú 226 â 234 ê 242 ò 250 ú 163 £ 171 « 179 ³ 187 » 195 Ã 203 Ë 211 Ó 219 Û 227 ã 235 ë 243 ó 251 û 164 ¤ 172 ¬ 180 ´ 188 ¼ 196 Ä 204 Ì 212 Ô 220 Ü 228 ä 236 ì 244 ô 252 ü 165 ¥ 173 - 181 µ 189 ½ 197 Å 205 Í 213 Õ 221 Ý 229 å 237 í 245 õ 253 ý 166 ¦ 174 ® 182 ¶ 190 ¾ 198 Æ 206 Î 214 Ö 222 Þ 230 æ 238 î 246 ö 254 þ 167 § 175 ¯ 183 · 191 ¿ 199 Ç 207 Ï 215 × 223 ß 231 ç 239 ï 247 ÷ 255 ÿ
  3. Latin-2 32 40 ( 48 0 56 8 64 @

    72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ¨ 176 ° 184 ¸ 192 Ŕ 200 Č 208 Đ 216 Ř 224 ŕ 232 č 240 đ 248 ř 161 Ą 169 Š 177 ą 185 š 193 Á 201 É 209 Ń 217 Ů 225 á 233 é 241 ń 249 ů 162 ˘ 170 Ş 178 ˛ 186 ş 194 Â 202 Ę 210 Ň 218 Ú 226 â 234 ę 242 ň 250 ú 163 Ł 171 Ť 179 ł 187 ť 195 Ă 203 Ë 211 Ó 219 Ű 227 ă 235 ë 243 ó 251 ű 164 ¤ 172 Ź 180 ´ 188 ź 196 Ä 204 Ě 212 Ô 220 Ü 228 ä 236 ě 244 ô 252 ü 165 Ľ 173 - 181 ľ 189 ˝ 197 Ĺ 205 Í 213 Ő 221 Ý 229 ĺ 237 í 245 ő 253 ý 166 Ś 174 Ž 182 ś 190 ž 198 Ć 206 Î 214 Ö 222 Ţ 230 ć 238 î 246 ö 254 ţ 167 § 175 Ż 183 ˇ 191 ż 199 Ç 207 Ď 215 × 223 ß 231 ç 239 ď 247 ÷ 255 ˙
  4. Greek (ISO/IEC 8859-7) 32 40 ( 48 0 56 8

    64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ¨ 176 ° 184 Έ 192 ΐ 200 Θ 208 Π 216 Ψ 224 ΰ 232 θ 240 π 248 ψ 161 ‘ 169 © 177 ± 185 Ή 193 Α 201 Ι 209 Ρ 217 Ω 225 α 233 ι 241 ρ 249 ω 162 ’ 170 ͺ 178 ² 186 Ί 194 Β 202 Κ 210 218 Ϊ 226 β 234 κ 242 ς 250 ϊ 163 £ 171 « 179 ³ 187 » 195 Γ 203 Λ 211 Σ 219 Ϋ 227 γ 235 λ 243 σ 251 ϋ 164 € 172 ¬ 180 ΄ 188 Ό 196 Δ 204 Μ 212 Τ 220 ά 228 δ 236 μ 244 τ 252 ό 165 ₯ 173 - 181 ΅ 189 ½ 197 Ε 205 Ν 213 Υ 221 έ 229 ε 237 ν 245 υ 253 ύ 166 ¦ 174 182 Ά 190 Ύ 198 Ζ 206 Ξ 214 Φ 222 ή 230 ζ 238 ξ 246 φ 254 ώ 167 § 175 ― 183 · 191 Ώ 199 Η 207 Ο 215 Χ 223 ί 231 η 239 ο 247 χ 255
  5. Japanese (IBM Code Page 932) 32 40 ( 48 0

    56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ィ 176 ー 184 ク 192 タ 200 ネ 208 ミ 216 リ 161 。 169 ゥ 177 ア 185 ケ 193 チ 201 ノ 209 ム 217 ル 162 「 170 ェ 178 イ 186 コ 194 ツ 202 ハ 210 メ 218 レ 163 」 171 ォ 179 ウ 187 サ 195 テ 203 ヒ 211 モ 219 ロ 164 、 172 ャ 180 エ 188 シ 196 ト 204 フ 212 ヤ 220 ワ 165 ・ 173 ュ 181 オ 189 ス 197 ナ 205 ヘ 213 ユ 221 ン 166 ヲ 174 ョ 182 カ 190 セ 198 ニ 206 ホ 214 ヨ 222 ゙ 167 ァ 175 ッ 183 キ 191 ソ 199 ヌ 207 マ 215 ラ 223 ゚
  6. Japanese (IBM Code Page 932) 32 40 ( 48 0

    56 8 64 @ 72 H 80 P 88 X 96 ` 104 h 112 p 120 x 33 ! 41 ) 49 1 57 9 65 A 73 I 81 Q 89 Y 97 a 105 i 113 q 121 y 34 " 42 * 50 2 58 : 66 B 74 J 82 R 90 Z 98 b 106 j 114 r 122 z 35 # 43 + 51 3 59 ; 67 C 75 K 83 S 91 [ 99 c 107 k 115 s 123 { 36 $ 44 , 52 4 60 < 68 D 76 L 84 T 92 \ 100 d 108 l 116 t 124 | 37 % 45 - 53 5 61 = 69 E 77 M 85 U 93 ] 101 e 109 m 117 u 125 } 38 & 46 . 54 6 62 > 70 F 78 N 86 V 94 ^ 102 f 110 n 118 v 126 ~ 39 ' 47 / 55 7 63 ? 71 G 79 O 87 W 95 _ 103 g 111 o 119 w 127  160 168 ィ 176 ー 184 ク 192 タ 200 ネ 208 ミ 216 リ 161 。 169 ゥ 177 ア 185 ケ 193 チ 201 ノ 209 ム 217 ル 162 「 170 ェ 178 イ 186 コ 194 ツ 202 ハ 210 メ 218 レ 163 」 171 ォ 179 ウ 187 サ 195 テ 203 ヒ 211 モ 219 ロ 164 、 172 ャ 180 エ 188 シ 196 ト 204 フ 212 ヤ 220 ワ 165 ・ 173 ュ 181 オ 189 ス 197 ナ 205 ヘ 213 ユ 221 ン 166 ヲ 174 ョ 182 カ 190 セ 198 ニ 206 ホ 214 ヨ 222 ゙ 167 ァ 175 ッ 183 キ 191 ソ 199 ヌ 207 マ 215 ラ 223 ゚ Plus some 2-byte forms!!
  7. What is Unicode? One ring to rule them all “encoding”

    https://commons.wikimedia.org/wiki/File:Unico_Anello.png
  8. Encoding Unicode Code Points - UCS-2: Fixed-width (2 bytes per

    code point) - UTF-16: Variable-width (2 or 4 bytes per code point) - UTF-32: Fixed-width (4 bytes per code point) - UTF-8: Variable-width (1-4 bytes per code point)
  9. Encoding Code Points - UCS-2: Fixed-width (2 bytes per code

    point) - UTF-16: Variable-width (2 or 4 bytes per code point) - UTF-32: Fixed-width (4 bytes per code point) - UTF-8: Variable width (1-4 bytes per code point)
  10. Python 3 Unicode <class 'str'> Bytes <class 'bytes'> Python 2

    Unicode <class 'unicode'> Bytes <class 'str'> Text Types
  11. Python 3 Unicode <class 'str'> Bytes <class 'bytes'> Python 2

    Unicode <class 'unicode'> Bytes <class 'str'> Text Types ✔ ✔
  12. Python 3 Unicode <class 'str'> Bytes <class 'bytes'> Python 2

    Unicode <class 'unicode'> Bytes <class 'str'> Text Types ¯\_(ツ)_/¯
  13. Python 3 CANNOT add <str> and <bytes> Python 2 CAN

    add <str> and <unicode> (in most cases) Adding Text Types
  14. Python 3 CANNOT add <str> and <bytes> Python 2 CAN

    add <str> and <unicode> (in most cases) Adding Text Types ✔ ✘
  15. Python 3 CANNOT add <str> and <bytes> Python 2 CAN

    add <str> and <unicode> (in most cases) Adding Text Types
  16. >>> latin1_string.encode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in

    position 0: ordinal not in range(128) System Default Encodings
  17. >>> latin1_string.encode('ascii') AttributeError: 'bytes' object has no attribute 'encode' >>>

    latin1_string.encode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal not in range(128) Encoding Errors
  18. Clue #2: - Unicode → Bytes : encode() - Bytes

    → Unicode : decode() Clue #3: In Python, - Unicode is not an encoding. - UTF-8 is an encoding.
  19. >>> latin1_string.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in

    position 0: ordinal not in range(128) System Default Encodings
  20. Clue #4: Unicode “Sandwich” 1. Decode (Bytes → Unicode) as

    soon as possible 2. Encode (Unicode → Bytes) as late as possible 3. Deal with Unicode everywhere in the middle.
  21. Detecting Encodings Q: How do I know the encoding of

    a stream of bytes? A1: You don’t. A2: Unless you’re told. A3: But even then you can’t be sure. A4: Fortunately, you can “guess”.
  22. Python 2 and 3 Compatibility six.text_type - Always Unicode six.binary_type

    - Always Bytes six.string_types - Possible string data
  23. Python 2 and 3 Compatibility six.text_type - Always Unicode six.binary_type

    - Always Bytes six.string_types - Possible string data from __future__ import unicode_literals - All string literals in source code are Unicode.
  24. Mystery Solved (Conclusion) Clue #1: Humans:Characters :: Computers:Bytes Clue #2:

    U→B: encode() B → U: decode() Clue #3: “Unicode” isn’t an encoding. “UTF-8” is. Clue #4: Unicode Sandwich Clue #5: Be prepared for anything. Clue #6: Use six and unicode_literals
  25. Resources http://www.joelonsoftware.com/articles/Unicode.html (“Spolsky Unicode”) http://nedbatchelder.com/text/unipain.html (“Unipain Python”) https://speakerdeck.com/ramalho/unicode-solutions-in-python-2-and-python-3 (“Speakerdeck Unicode

    Python”) https://www.safaribooksonline.com/library/view/fluent-python/9781491946237/ch0 4.html (“Fluent Python Unicode”) Thank you! Questions? /gtback @gtback
  26. Unicode Zoo → U+2192 RIGHTWARDS ARROW ✔ U+2714 HEAVY CHECK

    MARK ✘ U+2718 HEAVY BALLOT X U+1F44D THUMBS UP SIGN U+1F44E THUMBS DOWN SIGN
  27. Unicode Zoo Ü U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS

    ñ U+00F1 LATIN SMALL LETTER N WITH TILDE î U+00EE LATIN SMALL LETTER I WITH CIRCUMFLEX ç U+00E7 LATIN SMALL LETTER C WITH CEDILLA ø U+00F8 LATIN SMALL LETTER O WITH STROKE d U+0064 LATIN SMALL LETTER D é U+00E9 LATIN SMALL LETTER E WITH ACUTE
  28. Unicode Zoo ¯ U+00AF MACRON \ U+005C REVERSE SOLIDUS _

    U+005F LOW LINE ( U+0028 LEFT PARENTHESIS ツ U+30C4 KATAKANA LETTER TU ) U+0029 RIGHT PARENTHESIS / U+002F SOLIDUS
  29. ℉◎υя ṧ¢øяℯ @ηⅾ ṧε♥εᾔ ¥ε@ґṧ αℊø øüя ḟ@т♄εґṧ ßґ☺υℊ♄☂ ḟ☺ґ⊥н

    øη ⊥ℌ☤ṧ ¢øη☂☤ᾔℯηт, @ η℮ẘ ηα⊥☤øη, ¢☺ηḉℯїṽ℮ḓ їᾔ ℒ☤♭ℯя☂ƴ, αηḓ ⅾℯḓḯ¢ α⊥℮ⅾ ⊥◎ т♄℮ ρґ◎℘øṧ☤тїøᾔ тнα☂ αʟʟ μεη @яℯ ¢я℮α☂ε∂ εⓠü@ʟ▪ Ⓕⓞⓤⓡ ⓢⓒⓞⓡⓔ ⓐⓝⓓ ⓢⓔⓥⓔⓝ ⓨⓔⓐⓡⓢ ⓐⓖⓞ ⓞⓤⓡ ⓕⓐⓣⓗⓔⓡⓢ ⓑⓡⓞⓤⓖⓗⓣ ⓕⓞⓡⓣⓗ ⓞⓝ ⓣⓗⓘⓢ ⓒⓞⓝⓣⓘⓝⓔⓝⓣ , ⓐ ⓝⓔⓦ ⓝⓐⓣⓘⓞⓝ, ⓒⓞⓝⓒⓔⓘⓥⓔⓓ ⓘⓝ Ⓛⓘⓑⓔⓡⓣⓨ, ⓐⓝⓓ ⓓⓔⓓⓘⓒⓐⓣⓔⓓ ⓣⓞ ⓣⓗⓔ ⓟⓡⓞⓟⓞⓢⓘⓣⓘⓞⓝ ⓣⓗⓐⓣ ⓐⓛⓛ ⓜⓔⓝ ⓐⓡⓔ ⓒⓡⓔⓐⓣⓔⓓ ⓔⓠⓤⓐⓛ◎ , , , . Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
  30. ℉◎υя ṧ¢øяℯ @ηⅾ ṧε♥εᾔ ¥ε@ґṧ αℊø øüя ḟ@т♄εґṧ ßґ☺υℊ♄☂ ḟ☺ґ⊥н

    øη ⊥ℌ☤ṧ ¢øη☂☤ᾔℯηт, @ η℮ẘ ηα⊥☤øη, ¢☺ηḉℯїṽ℮ḓ їᾔ ℒ☤♭ℯя☂ƴ, αηḓ ⅾℯḓḯ¢ α⊥℮ⅾ ⊥◎ т♄℮ ρґ◎℘øṧ☤тїøᾔ тнα☂ αʟʟ μεη @яℯ ¢я℮α☂ε∂ εⓠü@ʟ▪ Ⓕⓞⓤⓡ ⓢⓒⓞⓡⓔ ⓐⓝⓓ ⓢⓔⓥⓔⓝ ⓨⓔⓐⓡⓢ ⓐⓖⓞ ⓞⓤⓡ ⓕⓐⓣⓗⓔⓡⓢ ⓑⓡⓞⓤⓖⓗⓣ ⓕⓞⓡⓣⓗ ⓞⓝ ⓣⓗⓘⓢ ⓒⓞⓝⓣⓘⓝⓔⓝⓣ , ⓐ ⓝⓔⓦ ⓝⓐⓣⓘⓞⓝ, ⓒⓞⓝⓒⓔⓘⓥⓔⓓ ⓘⓝ Ⓛⓘⓑⓔⓡⓣⓨ, ⓐⓝⓓ ⓓⓔⓓⓘⓒⓐⓣⓔⓓ ⓣⓞ ⓣⓗⓔ ⓟⓡⓞⓟⓞⓢⓘⓣⓘⓞⓝ ⓣⓗⓐⓣ ⓐⓛⓛ ⓜⓔⓝ ⓐⓡⓔ ⓒⓡⓔⓐⓣⓔⓓ ⓔⓠⓤⓐⓛ◎ , , , . Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
  31. Final Clue: DO NOT use non-standard code points for stylistic

    effect. But for comedic effect, it's encouraged! ☺