Unicode: WTF?

Unicode: WTF? @_amatellanes

Unicode

We need bytes but they need a meaning

History of Unicode 1968 ASCII (American Standard Code for Information
Interchange) was standardized

History of Unicode ASCII Hex Símbolo 0 0 127 7F
DEL ... 65 41 A 66 42 B 67 43 C 68 44 D ...

History of Unicode 10 PRINT "MISE A JOUR TERMINEE" 11
PRINT "PARAMETRES ENREGISTRES"

History of Unicode 16 bits = 65,536 distinct values Code
Point Character Name U+0041 A LATIN CAPITAL LETTER A U+00E9 é LATIN SMALL LETTER E WITH ACUTE U+00F1 ñ LATIN SMALL LETTER N WITH TILDE U+1F4A9 PILE OF POO

Encoding to ASCII Convert Unicode string into the ASCII •
Code point < 128, each byte is the same as the value of the code point • Code point >= 128, you will get an error

UTF Unicode Transformation Format UTF encodings include: • UTF-8 •
UTF-16 • UTF-32 • ...

Encoding to UTF Convert Unicode to UTF-8 • Code point
< 128, each byte is the same as the value of the code point • Code point >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

Python

String Types Python 2 Python 3 b’abc’ str bytes u’abc’
unicode str ‘abc’ str str

String Types Python 2 Python 3 b’abc’ str bytes u’abc’
unicode str ‘abc’ str str In Python 2, str is for bytes, NOT string

Unicode in Python 3 >>> 'Δ' >>> u'Δ' >>> '\N{GREEK
CAPITAL LETTER DELTA}' >>> '\u0394' >>> '\U00000394' >>> chr(int('0x0394', 16)) >>> b'\xce\x94'.decode('utf-8') * unicodedata - Unicode Database https://docs.python.org/3/library/unicodedata.html

Unicode in Python 2 >>> u'Δ' >>> unicode('\xce\x94', encoding='utf-8') >>>
unichr(int('0x0394', 16)) >>> b'\xce\x94'.decode('utf-8') * unicodedata - Unicode Database https://docs.python.org/2/library/unicodedata.html

Encoding Unicode to ASCII >>> foo = 'foo' >>> type(foo)
<class 'str'> >>> foo.encode('ascii') b'foo' >>> foo = 'melón' >>> type(foo) <class 'str'> >>> foo.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 3: ordinal not in range(128)

Encoding Unicode to UTF-8 >>> foo = 'foo' >>> type(foo)
<class 'str'> >>> foo.encode('utf-8') b'foo' >>> foo = 'melón' >>> type(foo) <class 'str'> >>> foo.encode('utf-8') b'mel\xc3\xb3n'

Encoding Unicode to UTF-8 >>> foo = 'melón' >>> bytes(foo,
'utf-8') # Only Python 3 b'mel\xc3\xb3n'

Decoding to Unicode >>> bar = b'mel\xc3\xb3n' >>> bar.decode('ascii') UnicodeDecodeError:
'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) >>> bar.decode('utf-8') 'melón'

Unicode Literals in Python Source Code # -*- coding: <encoding
name> -*- Python 3 Python 2.4 > Python 2.4 ≤ Default encoding UTF-8 ASCII Euro-centric

Reading and Writing Unicode Data (Python 3) >>> with open('unicode.txt',
encoding='utf-8') as f: … for line in f: … print(line) * locale.getpreferredencoding(False): Return the encoding used for text data.

Reading and Writing Unicode Data (Python 2) >>> import codecs
>>> f = codecs.open('unicode.txt', encoding='utf-8') … for line in f: … print line

Tip Software should only work with Unicode strings internally, decoding
the input data as soon as possible and encoding the output only at the end.

Concatenate Strings (Python 3) >>> 'me' + 'lón' >>> b'me'
+ b'l\xc3\xb3n' >>> 'me' + b'l\xc3\xb3n'

Concatenate Strings (Python 3) >>> 'me' + 'lón' ‘melón’ >>>
b'me' + b'l\xc3\xb3n' b'mel\xc3\xb3n' >>> 'me' + b'l\xc3\xb3n' TypeError: Can't convert 'bytes' object to str implicitly

Concatenate Strings (Python 2) >>> 'me' + ‘lón’ >>> u'me'
+ u'lón' >>> b'me' + b'l\xc3\xb3n'

Concatenate Strings (Python 2) >>> 'me' + 'lón' 'mel\xc3\xb3n' >>>
u'me' + u'lón' u'mel\xf3n' >>> b'me' + b'l\xc3\xb3n' 'mel\xc3\xb3n'

Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n'

Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n' UnicodeDecodeError: 'ascii'
codec can't decode byte 0xc3 in position 1: ordinal not in range(128)* >>> import sys >>> # The encoding used for these implicit decodings is the value >>> sys.getdefaultencoding() 'ascii'

Python 2 and 3 compatibility As of Python 2.6 (PEP
3112): >>> from __future__ import unicode_literals

References • Unicode HOWTO Python 2 ◦ https://docs.python.org/2/howto/unicode.html • Unicode
HOWTO Python 3 ◦ https://docs.python.org/3/howto/unicode.html • Ned Batchelder: Pragmatic Unicode ◦ http://nedbatchelder.com/text/unipain.html ◦ https://www.youtube.com/watch?v=sgHbC6udIqc • Facundo Batista - Entendiendo Unicode ◦ https://www.youtube.com/watch?v=Dr1R4ZlVLxI

Unicode: WTF?

Unicode: WTF?

Adrián Matellanes

More Decks by Adrián Matellanes

Other Decks in Programming

Featured

Transcript

Unicode: WTF? @_amatellanes

Unicode

We need bytes but they need a meaning

History of Unicode 1968 ASCII (American Standard Code for Information

History of Unicode ASCII Hex Símbolo 0 0 127 7F

History of Unicode 10 PRINT "MISE A JOUR TERMINEE" 11

History of Unicode 16 bits = 65,536 distinct values Code

Encoding to ASCII Convert Unicode string into the ASCII •

UTF Unicode Transformation Format UTF encodings include: • UTF-8 •

Encoding to UTF Convert Unicode to UTF-8 • Code point

Python

String Types Python 2 Python 3 b’abc’ str bytes u’abc’

String Types Python 2 Python 3 b’abc’ str bytes u’abc’

Unicode in Python 3 >>> 'Δ' >>> u'Δ' >>> '\N{GREEK

Unicode in Python 2 >>> u'Δ' >>> unicode('\xce\x94', encoding='utf-8') >>>

Encoding Unicode to ASCII >>> foo = 'foo' >>> type(foo)

Encoding Unicode to UTF-8 >>> foo = 'foo' >>> type(foo)

Encoding Unicode to UTF-8 >>> foo = 'melón' >>> bytes(foo,

Decoding to Unicode >>> bar = b'mel\xc3\xb3n' >>> bar.decode('ascii') UnicodeDecodeError:

Unicode Literals in Python Source Code # -*- coding: <encoding

Reading and Writing Unicode Data (Python 3) >>> with open('unicode.txt',

Reading and Writing Unicode Data (Python 2) >>> import codecs

Tip Software should only work with Unicode strings internally, decoding

Concatenate Strings (Python 3) >>> 'me' + 'lón' >>> b'me'

Concatenate Strings (Python 3) >>> 'me' + 'lón' ‘melón’ >>>

Concatenate Strings (Python 2) >>> 'me' + ‘lón’ >>> u'me'

Concatenate Strings (Python 2) >>> 'me' + 'lón' 'mel\xc3\xb3n' >>>

Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n'

Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n' UnicodeDecodeError: 'ascii'

Python 2 and 3 compatibility As of Python 2.6 (PEP

References • Unicode HOWTO Python 2 ◦ https://docs.python.org/2/howto/unicode.html • Unicode