Python Unicode and Bytes Demystified

Slide 1

Slide 1 text

ℙƴ☂ℌøἤ ⒝⒴⒯⒠⒮ ⒝⒴⒯⒠⒮ DΣMYƧƬIFIΣD Boris FELD - PyParis, Paris - 2017

Slide 2

Slide 2 text

Boris FELD Python developer Mercurial and Python consultant at Octobus https://lothiraldan.github.io/ @lothiraldan /me

Slide 3

Slide 3 text

Unicode is ��!

Slide 4

Slide 4 text

Let's test it!

Slide 5

Slide 5 text

What is the length of this Unicode string in Python 2? len(u' ') 1 2 3 4 1. Unicode length

Slide 6

Slide 6 text

It depends of your python: DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64 $> docker run -t -i $DOCKER_IMAGE /opt/python/cp27-cp27mu/bin/python \ -c "print len(u'\U0001f60e')" 1 But it can also be: DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64 $> docker run -t -i $DOCKER_IMAGE /opt/python/cp27-cp27m/bin/python \ -c "print len(u'\U0001f60e')" 2 Unicode length

Slide 7

Slide 7 text

When could you see this error message? UnicodeEncodeError: 'ascii' codec can't encode character When doing .encode('ascii') When doing .decode('ascii') When doing .decode('utf-8') In all of theses situations 2. UnicodeEncodeError

Slide 8

Slide 8 text

In all of these situations! >>> x = u'é' >>> x.encode('ascii') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128) >>> x.decode('ascii') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128) >>> x.decode('utf-8') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128) UnicodeEncodeError

Slide 9

Slide 9 text

When should you use chr and unichr? You should always use chr. You should always use unichr. You should chr for ASCII and unichr for Unicode. 3. Chr vs unichr

Slide 10

Slide 10 text

Prefer using unichr for everything. Chr vs unichr

Slide 11

Slide 11 text

Skeptical dog is skeptical

Slide 12

Slide 12 text

We have to go back!

Slide 13

Slide 13 text

The 60s

Slide 14

Slide 14 text

Apollo 11

Slide 15

Slide 15 text

Woodstock

Slide 16

Slide 16 text

Something important

Slide 17

Slide 17 text

Something huge

Slide 18

Slide 18 text

ASCII was born

Slide 19

Slide 19 text

In 1960s, the American Standards Association wanted to answer the question: How to represent text digitally? The important question

Slide 20

Slide 20 text

Problem, computers are only speaking bits. How to transform text into bits? Problem

Slide 21

Slide 21 text

We know how to convert integer to binary: 0 = 0000000 1 = 0000001 2 = 0000010 3 = 0000011 ............. 127 = 1111111 Let's assign each character an integer from 0 to 127 named "code point". Pretty simple solution

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

ASCII with Python

Slide 24

Slide 24 text

Let's take a string: "pyparis" A string is a sequence of characters: assert list("pyparis") == ['p', 'y', 'p', 'a', 'r', 'i', 's'] What is a string?

Slide 25

Slide 25 text

assert type("pyparis"[0]) == assert len("pyparis"[0]) == 1 A character (from the Greek χαρακτήρ "engraved or stamped mark" on coins or seals, "branding mark, symbol") is a sign or symbol. — Wikipedia A character is basically anything. It could represents be a letter, a digit or even an emoji. What is character

Slide 26

Slide 26 text

For retrieving the ASCII code point of a character, we can use ord: assert ord("p") == 112 To reverse the process we can use chr: assert chr(112) == "p" Code point in Python

Slide 27

Slide 27 text

p y p a r i s Code Point 112 121 112 97 114 105 115 Code points

Slide 28

Slide 28 text

p y p a r i s Code Point 112 121 112 97 114 105 115 Binary 1110000 1111001 1110000 1100001 1110010 1101001 1110011 code point encode binary code point decode binary ASCII encoding

Slide 29

Slide 29 text

encode is meant to transform a string into some bytes: string = 'abc' bytes = bytes.encode('ascii') assert hex(bytes) == '616263' decode is meant to transform some bytes into a string: bytes = unhex('616263') string = bytes.decode('ascii') assert string == 'abc' Each of these methods accepts an encoding parameter for the name of the conversion algorithm to use. Encode vs Decode

Slide 30

Slide 30 text

Everything is awesome...

Slide 31

Slide 31 text

... right?

Slide 32

Slide 32 text

Small problem

Slide 33

Slide 33 text

ASCII solved the problem for USA but not for everyone else. Not everyone speaks english

Slide 34

Slide 34 text

ASCII only use the 7 lower bits of a byte. 01100001 But on most computer a byte is actually 8 bits so we can support more characters. And so new standard were born... Other standards

Slide 35

Slide 35 text

Some were based on ASCII and use a 8 bit to add support for accents for example, like Latin1 that defines the character É with the code point 201. Some other, were not compatible at all, like EBCDIC, used on IBM mainframes, where the 1001011 (code point 75) code point represent the punctuation mark "." while in ASCII it represent "A". Of course they were not all cross-compatible... Other standards

Slide 36

Slide 36 text

It was a mess

Slide 37

Slide 37 text

Initial text a b ã é Latin1 Code Point 97 98 227 233 Latin1 encoding 01100001 01100010 11100011 11101001 ASCII decoding a b ERROR ERROR Mac OS Roman decoding a b „ È EBCDIC decoding / ERROR T Z Example

Slide 38

Slide 38 text

Here comes our savior!

Slide 39

Slide 39 text

One Standard to rule them all, One Standard to find them, One Standard to bring them all and in the greater good bind them Unicode the savior

Slide 40

Slide 40 text

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. — Wikipedia It all started in 1987-1988 as a coordination between Joe Becker from Xerox and Lee Collins and Mark Davis from Apple. The unicode code points are fortunately for us ASCII compatible. What is Unicode?

Slide 41

Slide 41 text

The latest version of Unicode contains a repertoire of 128,237 characters covering 135 modern and historic scripts, as well as multiple symbol sets. — Wikipedia ASCII was defining 127 characters, so Unicode defines 1000 times more characters. It defines several blocks: Basic Latin: ab...XYZ Greek, Aramaic, Cherokee: ΔעᏗ Right to left scripts, Cuneiform, hieroglyphs: Mahjong Tiles, Domino Tiles, Playing cards: Emoticons, Musical notations: Unicode size

Slide 42

Slide 42 text

Remember the ASCII table? Unicode vs ASCII

Slide 43

Slide 43 text

Unicode with Python

Slide 44

Slide 44 text

Let's take a unicode character €. First, declare the encoding of your python source file as utf-8: # -*- coding: utf-8 -*- Then, you can write it this way: u'€' Or: u'\u20AC' Its code point is 8364: ord(u'€') == 8364 How to write Unicode in Python

Slide 45

Slide 45 text

Let's convert the code point into binary: € Code Point 8364 Naive conversion 00100000 10101100 Problem

Slide 46

Slide 46 text

It doesn't fit into 1 byte. The problems when you start using more than 1 bytes are multiple and annoying: How to order the bytes, Big And Little Endian problems anyone? How to recognize which byte you are reading in a file or stream? How to detect and correct transmission errors where only some bytes were missing? 8364 into binary takes two bytes. Unicode characters code points goes well beyond 1 000 000 (because of non allocated yet), taking up to 3 bytes. Multi-bytes

Slide 47

Slide 47 text

As ASCII was simple, transforming ASCII code points into binary was straightforward. But the presence of high code point characters in Unicode complexify the process. There are multiple ways of doing it, called encodings: UTF-8 UTF-16 UTF-32 Multiple encoding

Slide 48

Slide 48 text

If you are not sure, use UTF-8, it will be compatible with every characters, works well most of the time and solved multi-bytes related problems Elegantly. If you process more Asian characters than Latin, use UTF-16 so you use less space and memory. If you need to interact with another program, use the default other program encoding (CSV anyone?). Comparison of Unicode encodings - Wikipedia Choose an encoding

Slide 49

Slide 49 text

UTF-8 Everywhere Manifesto UTF-8 everywhere

Slide 50

Slide 50 text

A € Code Point 65 8364 Naive conversion 01000001 00100000 10101100 UTF-8 01000001 11100010 10000010 10101100 UTF-16 00000000 01000001 00100000 10101100 UTF-32 00000000 00000000 00000000 01000001 00000000 00000000 00100000 10101100 What are the differences?

Slide 51

Slide 51 text

Let's clarify something: encode is meant to transform an unicode string into some bytes: hex(u'é'.encode('utf-8')) == 'c3a9' decode is meant to transform some bytes into an unicode string: unhex('c3a9').decode('utf-8') == u'é' Encode vs Decode

Slide 52

Slide 52 text

Python 2

Slide 53

Slide 53 text

Counting the length of an ASCII string is easy, count the number of bytes! But it's much more harder with Unicode strings. Python 2 tries hard to get you a correct answer. Let's take back our example: . Its code point is 128526. 1. String length

Slide 54

Slide 54 text

Python 2 comes in several flavor, two are related to Unicode. Its either a narrow build or a wide build. It basically change how Python stores its strings. For code point < 65535, everything works the same, Python store each character separately and only one character. For code point > 65535, it differs. The wide build character size is enough for all Unicode code points. But the narrow build character size is not big enough for code point > 65535, so it store upper code points as a pair of characters. The narrow build use less memory but it explains why the narrow build returns 2 for len(u' '), it's because Python 2 actually store two characters. Multiple flavors of Python 2

Slide 55

Slide 55 text

Remember the signification of encode and decode? Encode transforms an Unicode string into some bytes. Decode transforms some bytes into an Unicode string. 2. Encoding / Decoding in Python 2

Slide 56

Slide 56 text

Python 2 always had a string type but introduced the Unicode type in Python 2.1. Python 2 str is badly named as it's basically a bag of bytes. When you display it, Python will try to decode it for you. So for ASCII only strings, encode and decode will return the same. x = 'abc' assert x.encode('ascii') == x assert x.decode('ascii') == x Python 2 type system

Slide 57

Slide 57 text

Python is a strongly typed language, meaning that Python shouldn't coerce types behind your back: '012' + 3 Traceback (most recent call last): File "", line 1, in TypeError: cannot concatenate 'str' and 'int' objects But it's not respecting this property with strings. Remember that decode convert bytes into an Unicode string in Python? x = u'é' x.decode('utf-8') As decode is called on an Unicode instance, it isn't bytes. So python tries to makes some bytes out of the string and does: x = u'é' x.encode('ascii').decode('utf-8') That's way you can see an UnicodeEncodeError error while trying to decode an Unicode Python 2 type coercing

Slide 58

Slide 58 text

You can use chr to get the character of a code point: assert chr(65) == 'A' But it only works with ASCII characters! chr(8364) Traceback (most recent call last): File "", line 1, in ValueError: chr() arg not in range(256) For Unicode you need to use unichr: assert unichr(8364) == u'€' 3. Python 2 chr vs unichr

Slide 59

Slide 59 text

Python 3 ♥ ♥ ♥ ♥

Slide 60

Slide 60 text

Python 3 now always store its strings the same way and len returns you the right answer no matter what: x = ' ' assert len(x) == 1 1. Python 3 single flavor

Slide 61

Slide 61 text

Python 3 biggest change was to change the type systems of strings. Bytes String Unicode strings Python 2 str unicode Python 3 bytes str 2. Python 3 big change

Slide 62

Slide 62 text

Now that Python 3 have separate types for bytes and string, we now longer can mess with encode and decode: string = '' string.decode('ascii') Traceback (most recent call last): File "", line 1, in AttributeError: 'str' object has no attribute 'decode' Decoding an Unicode string never made sense anyway. bytes = b'' bytes.encode('utf-8') Traceback (most recent call last): File "", line 1, in AttributeError: 'bytes' object has no attribute 'encode' So you always know what the types you are dealing with. 2. Python 3 coherent type system

Slide 63

Slide 63 text

Unicode strings are now the norm, so Python 3 dropped the u prefix for Unicode strings and replaced it by a b prefix for bytes, so you directly write: x = ' ' Python 3.3 reintroduced the prefix for codebases that needs to be compatible with Python 2 and Python 3, so it's also works: x = u' ' 2. No more u prefix

Slide 64

Slide 64 text

Python 3 no longer have separate functions for chr and unichr, just use chr. assert chr(65) == 'A' assert chr(8364) == '€' 3. Python 3 chr

Slide 65

Slide 65 text

Pain relief tips

Slide 66

Slide 66 text

Thanks to the new type system, it is now easier to identify which part of the code needs to encode strings and decode bytes. bytes Outside world decode Library unicode Business logic unicode encode Library bytes Outside world 1. Unicode sandwich

Slide 67

Slide 67 text

Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end. — Python doc on unicode Unicode sandwich

Slide 68

Slide 68 text

You cannot infer the encodings of bytes: Content-Type: text/html; charset=ISO-8859-4 # -*- coding: iso8859-1 -*- If you really really really really need to guess the encoding, you can use chardet, but remember, it's a best effort scenario. 2. Use declared encoding

Slide 69

Slide 69 text

encode and decode accepts a second arguments for error handling. By default it is set on strict, which means crash x = u'abcé' x.encode('ascii', errors='strict') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3... You can also use replace to replace invalid character by ?: assert x.encode('ascii', errors='replace') == 'abc?' Or you can simply ignore them: assert x.encode('ascii', errors='ignore') == 'abc' Finally you can replace them by their XML code: assert x.encode('ascii', errors='xmlcharrefreplace') == 'abcé' 3. Error handling

Slide 70

Slide 70 text

Use Unicode anytime possible. Use Python 3. Explicitly encode str and decode str in Python 2, it might solves bugs in your code and ease Python 3 conversions. Unicode sandwich. Never guess an encoding! Use error handling. Conclusion

Slide 71

Slide 71 text

for c in range(0x1F410, 0x1F4f0): print (r"\U%08x"%c).decode("unicode-escape"), Python fun

Slide 72

Slide 72 text

Thank you!

Slide 73

Slide 73 text

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Pragmatic Unicode Unicode In Python, Completely Demystified What every programmer absolutely, positively needs to know about encodings and character sets to work with text Holy batman Reddit on unicode References