Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode: WTF?

Unicode: WTF?

A quick overview of what Unicode is and how Python support for Unicode.

Adrián Matellanes

September 14, 2016
Tweet

More Decks by Adrián Matellanes

Other Decks in Programming

Transcript

  1. History of Unicode ASCII Hex Símbolo 0 0 127 7F

    DEL ... 65 41 A 66 42 B 67 43 C 68 44 D ...
  2. History of Unicode 16 bits = 65,536 distinct values Code

    Point Character Name U+0041 A LATIN CAPITAL LETTER A U+00E9 é LATIN SMALL LETTER E WITH ACUTE U+00F1 ñ LATIN SMALL LETTER N WITH TILDE U+1F4A9 PILE OF POO
  3. Encoding to ASCII Convert Unicode string into the ASCII •

    Code point < 128, each byte is the same as the value of the code point • Code point >= 128, you will get an error
  4. Encoding to UTF Convert Unicode to UTF-8 • Code point

    < 128, each byte is the same as the value of the code point • Code point >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
  5. String Types Python 2 Python 3 b’abc’ str bytes u’abc’

    unicode str ‘abc’ str str In Python 2, str is for bytes, NOT string
  6. Unicode in Python 3 >>> 'Δ' >>> u'Δ' >>> '\N{GREEK

    CAPITAL LETTER DELTA}' >>> '\u0394' >>> '\U00000394' >>> chr(int('0x0394', 16)) >>> b'\xce\x94'.decode('utf-8') * unicodedata - Unicode Database https://docs.python.org/3/library/unicodedata.html
  7. Unicode in Python 2 >>> u'Δ' >>> unicode('\xce\x94', encoding='utf-8') >>>

    unichr(int('0x0394', 16)) >>> b'\xce\x94'.decode('utf-8') * unicodedata - Unicode Database https://docs.python.org/2/library/unicodedata.html
  8. Encoding Unicode to ASCII >>> foo = 'foo' >>> type(foo)

    <class 'str'> >>> foo.encode('ascii') b'foo' >>> foo = 'melón' >>> type(foo) <class 'str'> >>> foo.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 3: ordinal not in range(128)
  9. Encoding Unicode to UTF-8 >>> foo = 'foo' >>> type(foo)

    <class 'str'> >>> foo.encode('utf-8') b'foo' >>> foo = 'melón' >>> type(foo) <class 'str'> >>> foo.encode('utf-8') b'mel\xc3\xb3n'
  10. Encoding Unicode to UTF-8 >>> foo = 'melón' >>> bytes(foo,

    'utf-8') # Only Python 3 b'mel\xc3\xb3n'
  11. Decoding to Unicode >>> bar = b'mel\xc3\xb3n' >>> bar.decode('ascii') UnicodeDecodeError:

    'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) >>> bar.decode('utf-8') 'melón'
  12. Unicode Literals in Python Source Code # -*- coding: <encoding

    name> -*- Python 3 Python 2.4 > Python 2.4 ≤ Default encoding UTF-8 ASCII Euro-centric
  13. Reading and Writing Unicode Data (Python 3) >>> with open('unicode.txt',

    encoding='utf-8') as f: … for line in f: … print(line) * locale.getpreferredencoding(False): Return the encoding used for text data.
  14. Reading and Writing Unicode Data (Python 2) >>> import codecs

    >>> f = codecs.open('unicode.txt', encoding='utf-8') … for line in f: … print line
  15. Tip Software should only work with Unicode strings internally, decoding

    the input data as soon as possible and encoding the output only at the end.
  16. Concatenate Strings (Python 3) >>> 'me' + 'lón' >>> b'me'

    + b'l\xc3\xb3n' >>> 'me' + b'l\xc3\xb3n'
  17. Concatenate Strings (Python 3) >>> 'me' + 'lón' ‘melón’ >>>

    b'me' + b'l\xc3\xb3n' b'mel\xc3\xb3n' >>> 'me' + b'l\xc3\xb3n' TypeError: Can't convert 'bytes' object to str implicitly
  18. Concatenate Strings (Python 2) >>> 'me' + 'lón' 'mel\xc3\xb3n' >>>

    u'me' + u'lón' u'mel\xf3n' >>> b'me' + b'l\xc3\xb3n' 'mel\xc3\xb3n'
  19. Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n' UnicodeDecodeError: 'ascii'

    codec can't decode byte 0xc3 in position 1: ordinal not in range(128)* >>> import sys >>> # The encoding used for these implicit decodings is the value >>> sys.getdefaultencoding() 'ascii'
  20. Python 2 and 3 compatibility As of Python 2.6 (PEP

    3112): >>> from __future__ import unicode_literals
  21. References • Unicode HOWTO Python 2 ◦ https://docs.python.org/2/howto/unicode.html • Unicode

    HOWTO Python 3 ◦ https://docs.python.org/3/howto/unicode.html • Ned Batchelder: Pragmatic Unicode ◦ http://nedbatchelder.com/text/unipain.html ◦ https://www.youtube.com/watch?v=sgHbC6udIqc • Facundo Batista - Entendiendo Unicode ◦ https://www.youtube.com/watch?v=Dr1R4ZlVLxI