Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode: WTF?

Unicode: WTF?

A quick overview of what Unicode is and how Python support for Unicode.

4cae070608be3489262fe419c03498dc?s=128

Adrián Matellanes

September 14, 2016
Tweet

Transcript

  1. Unicode: WTF? @_amatellanes

  2. Unicode

  3. We need bytes but they need a meaning

  4. History of Unicode 1968 ASCII (American Standard Code for Information

    Interchange) was standardized
  5. History of Unicode ASCII Hex Símbolo 0 0 127 7F

    DEL ... 65 41 A 66 42 B 67 43 C 68 44 D ...
  6. History of Unicode 10 PRINT "MISE A JOUR TERMINEE" 11

    PRINT "PARAMETRES ENREGISTRES"
  7. History of Unicode 16 bits = 65,536 distinct values Code

    Point Character Name U+0041 A LATIN CAPITAL LETTER A U+00E9 é LATIN SMALL LETTER E WITH ACUTE U+00F1 ñ LATIN SMALL LETTER N WITH TILDE U+1F4A9 PILE OF POO
  8. Encoding to ASCII Convert Unicode string into the ASCII •

    Code point < 128, each byte is the same as the value of the code point • Code point >= 128, you will get an error
  9. UTF Unicode Transformation Format UTF encodings include: • UTF-8 •

    UTF-16 • UTF-32 • ...
  10. Encoding to UTF Convert Unicode to UTF-8 • Code point

    < 128, each byte is the same as the value of the code point • Code point >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
  11. Python

  12. String Types Python 2 Python 3 b’abc’ str bytes u’abc’

    unicode str ‘abc’ str str
  13. String Types Python 2 Python 3 b’abc’ str bytes u’abc’

    unicode str ‘abc’ str str In Python 2, str is for bytes, NOT string
  14. Unicode in Python 3 >>> 'Δ' >>> u'Δ' >>> '\N{GREEK

    CAPITAL LETTER DELTA}' >>> '\u0394' >>> '\U00000394' >>> chr(int('0x0394', 16)) >>> b'\xce\x94'.decode('utf-8') * unicodedata - Unicode Database https://docs.python.org/3/library/unicodedata.html
  15. Unicode in Python 2 >>> u'Δ' >>> unicode('\xce\x94', encoding='utf-8') >>>

    unichr(int('0x0394', 16)) >>> b'\xce\x94'.decode('utf-8') * unicodedata - Unicode Database https://docs.python.org/2/library/unicodedata.html
  16. Encoding Unicode to ASCII >>> foo = 'foo' >>> type(foo)

    <class 'str'> >>> foo.encode('ascii') b'foo' >>> foo = 'melón' >>> type(foo) <class 'str'> >>> foo.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 3: ordinal not in range(128)
  17. Encoding Unicode to UTF-8 >>> foo = 'foo' >>> type(foo)

    <class 'str'> >>> foo.encode('utf-8') b'foo' >>> foo = 'melón' >>> type(foo) <class 'str'> >>> foo.encode('utf-8') b'mel\xc3\xb3n'
  18. Encoding Unicode to UTF-8 >>> foo = 'melón' >>> bytes(foo,

    'utf-8') # Only Python 3 b'mel\xc3\xb3n'
  19. Decoding to Unicode >>> bar = b'mel\xc3\xb3n' >>> bar.decode('ascii') UnicodeDecodeError:

    'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128) >>> bar.decode('utf-8') 'melón'
  20. Unicode Literals in Python Source Code # -*- coding: <encoding

    name> -*- Python 3 Python 2.4 > Python 2.4 ≤ Default encoding UTF-8 ASCII Euro-centric
  21. Reading and Writing Unicode Data (Python 3) >>> with open('unicode.txt',

    encoding='utf-8') as f: … for line in f: … print(line) * locale.getpreferredencoding(False): Return the encoding used for text data.
  22. Reading and Writing Unicode Data (Python 2) >>> import codecs

    >>> f = codecs.open('unicode.txt', encoding='utf-8') … for line in f: … print line
  23. Tip Software should only work with Unicode strings internally, decoding

    the input data as soon as possible and encoding the output only at the end.
  24. Concatenate Strings (Python 3) >>> 'me' + 'lón' >>> b'me'

    + b'l\xc3\xb3n' >>> 'me' + b'l\xc3\xb3n'
  25. Concatenate Strings (Python 3) >>> 'me' + 'lón' ‘melón’ >>>

    b'me' + b'l\xc3\xb3n' b'mel\xc3\xb3n' >>> 'me' + b'l\xc3\xb3n' TypeError: Can't convert 'bytes' object to str implicitly
  26. Concatenate Strings (Python 2) >>> 'me' + ‘lón’ >>> u'me'

    + u'lón' >>> b'me' + b'l\xc3\xb3n'
  27. Concatenate Strings (Python 2) >>> 'me' + 'lón' 'mel\xc3\xb3n' >>>

    u'me' + u'lón' u'mel\xf3n' >>> b'me' + b'l\xc3\xb3n' 'mel\xc3\xb3n'
  28. Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n'

  29. Concatenate Strings (Python 2) >>> u'me' + b'l\xc3\xb3n' UnicodeDecodeError: 'ascii'

    codec can't decode byte 0xc3 in position 1: ordinal not in range(128)* >>> import sys >>> # The encoding used for these implicit decodings is the value >>> sys.getdefaultencoding() 'ascii'
  30. Python 2 and 3 compatibility As of Python 2.6 (PEP

    3112): >>> from __future__ import unicode_literals
  31. References • Unicode HOWTO Python 2 ◦ https://docs.python.org/2/howto/unicode.html • Unicode

    HOWTO Python 3 ◦ https://docs.python.org/3/howto/unicode.html • Ned Batchelder: Pragmatic Unicode ◦ http://nedbatchelder.com/text/unipain.html ◦ https://www.youtube.com/watch?v=sgHbC6udIqc • Facundo Batista - Entendiendo Unicode ◦ https://www.youtube.com/watch?v=Dr1R4ZlVLxI