History of Unicode 16 bits = 65,536 distinct values Code Point Character Name U+0041 A LATIN CAPITAL LETTER A U+00E9 é LATIN SMALL LETTER E WITH ACUTE U+00F1 ñ LATIN SMALL LETTER N WITH TILDE U+1F4A9 PILE OF POO
Encoding to ASCII Convert Unicode string into the ASCII ● Code point < 128, each byte is the same as the value of the code point ● Code point >= 128, you will get an error
Encoding to UTF Convert Unicode to UTF-8 ● Code point < 128, each byte is the same as the value of the code point ● Code point >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
Decoding to Unicode >>> bar = b'mel\xc3\xb3n' >>> bar.decode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) >>> bar.decode('utf-8') 'melón'
Reading and Writing Unicode Data (Python 3) >>> with open('unicode.txt', encoding='utf-8') as f: … for line in f: … print(line) * locale.getpreferredencoding(False): Return the encoding used for text data.
Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n' UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)* >>> import sys >>> # The encoding used for these implicit decodings is the value >>> sys.getdefaultencoding() 'ascii'