Slide 1

Slide 1 text

Unicode: WTF? @_amatellanes

Slide 2

Slide 2 text

Unicode

Slide 3

Slide 3 text

We need bytes but they need a meaning

Slide 4

Slide 4 text

History of Unicode 1968 ASCII (American Standard Code for Information Interchange) was standardized

Slide 5

Slide 5 text

History of Unicode ASCII Hex Símbolo 0 0 127 7F DEL ... 65 41 A 66 42 B 67 43 C 68 44 D ...

Slide 6

Slide 6 text

History of Unicode 10 PRINT "MISE A JOUR TERMINEE" 11 PRINT "PARAMETRES ENREGISTRES"

Slide 7

Slide 7 text

History of Unicode 16 bits = 65,536 distinct values Code Point Character Name U+0041 A LATIN CAPITAL LETTER A U+00E9 é LATIN SMALL LETTER E WITH ACUTE U+00F1 ñ LATIN SMALL LETTER N WITH TILDE U+1F4A9 PILE OF POO

Slide 8

Slide 8 text

Encoding to ASCII Convert Unicode string into the ASCII ● Code point < 128, each byte is the same as the value of the code point ● Code point >= 128, you will get an error

Slide 9

Slide 9 text

UTF Unicode Transformation Format UTF encodings include: ● UTF-8 ● UTF-16 ● UTF-32 ● ...

Slide 10

Slide 10 text

Encoding to UTF Convert Unicode to UTF-8 ● Code point < 128, each byte is the same as the value of the code point ● Code point >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

Slide 11

Slide 11 text

Python

Slide 12

Slide 12 text

String Types Python 2 Python 3 b’abc’ str bytes u’abc’ unicode str ‘abc’ str str

Slide 13

Slide 13 text

String Types Python 2 Python 3 b’abc’ str bytes u’abc’ unicode str ‘abc’ str str In Python 2, str is for bytes, NOT string

Slide 14

Slide 14 text

Unicode in Python 3 >>> 'Δ' >>> u'Δ' >>> '\N{GREEK CAPITAL LETTER DELTA}' >>> '\u0394' >>> '\U00000394' >>> chr(int('0x0394', 16)) >>> b'\xce\x94'.decode('utf-8') * unicodedata - Unicode Database https://docs.python.org/3/library/unicodedata.html

Slide 15

Slide 15 text

Unicode in Python 2 >>> u'Δ' >>> unicode('\xce\x94', encoding='utf-8') >>> unichr(int('0x0394', 16)) >>> b'\xce\x94'.decode('utf-8') * unicodedata - Unicode Database https://docs.python.org/2/library/unicodedata.html

Slide 16

Slide 16 text

Encoding Unicode to ASCII >>> foo = 'foo' >>> type(foo) >>> foo.encode('ascii') b'foo' >>> foo = 'melón' >>> type(foo) >>> foo.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 3: ordinal not in range(128)

Slide 17

Slide 17 text

Encoding Unicode to UTF-8 >>> foo = 'foo' >>> type(foo) >>> foo.encode('utf-8') b'foo' >>> foo = 'melón' >>> type(foo) >>> foo.encode('utf-8') b'mel\xc3\xb3n'

Slide 18

Slide 18 text

Encoding Unicode to UTF-8 >>> foo = 'melón' >>> bytes(foo, 'utf-8') # Only Python 3 b'mel\xc3\xb3n'

Slide 19

Slide 19 text

Decoding to Unicode >>> bar = b'mel\xc3\xb3n' >>> bar.decode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) >>> bar.decode('utf-8') 'melón'

Slide 20

Slide 20 text

Unicode Literals in Python Source Code # -*- coding: -*- Python 3 Python 2.4 > Python 2.4 ≤ Default encoding UTF-8 ASCII Euro-centric

Slide 21

Slide 21 text

Reading and Writing Unicode Data (Python 3) >>> with open('unicode.txt', encoding='utf-8') as f: … for line in f: … print(line) * locale.getpreferredencoding(False): Return the encoding used for text data.

Slide 22

Slide 22 text

Reading and Writing Unicode Data (Python 2) >>> import codecs >>> f = codecs.open('unicode.txt', encoding='utf-8') … for line in f: … print line

Slide 23

Slide 23 text

Tip Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end.

Slide 24

Slide 24 text

Concatenate Strings (Python 3) >>> 'me' + 'lón' >>> b'me' + b'l\xc3\xb3n' >>> 'me' + b'l\xc3\xb3n'

Slide 25

Slide 25 text

Concatenate Strings (Python 3) >>> 'me' + 'lón' ‘melón’ >>> b'me' + b'l\xc3\xb3n' b'mel\xc3\xb3n' >>> 'me' + b'l\xc3\xb3n' TypeError: Can't convert 'bytes' object to str implicitly

Slide 26

Slide 26 text

Concatenate Strings (Python 2) >>> 'me' + ‘lón’ >>> u'me' + u'lón' >>> b'me' + b'l\xc3\xb3n'

Slide 27

Slide 27 text

Concatenate Strings (Python 2) >>> 'me' + 'lón' 'mel\xc3\xb3n' >>> u'me' + u'lón' u'mel\xf3n' >>> b'me' + b'l\xc3\xb3n' 'mel\xc3\xb3n'

Slide 28

Slide 28 text

Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n'

Slide 29

Slide 29 text

Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n' UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)* >>> import sys >>> # The encoding used for these implicit decodings is the value >>> sys.getdefaultencoding() 'ascii'

Slide 30

Slide 30 text

Python 2 and 3 compatibility As of Python 2.6 (PEP 3112): >>> from __future__ import unicode_literals

Slide 31

Slide 31 text

References ● Unicode HOWTO Python 2 ○ https://docs.python.org/2/howto/unicode.html ● Unicode HOWTO Python 3 ○ https://docs.python.org/3/howto/unicode.html ● Ned Batchelder: Pragmatic Unicode ○ http://nedbatchelder.com/text/unipain.html ○ https://www.youtube.com/watch?v=sgHbC6udIqc ● Facundo Batista - Entendiendo Unicode ○ https://www.youtube.com/watch?v=Dr1R4ZlVLxI