Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode: WTF?

Unicode: WTF?

A quick overview of what Unicode is and how Python support for Unicode.

Adrián Matellanes

September 14, 2016
Tweet

More Decks by Adrián Matellanes

Other Decks in Programming

Transcript

  1. Unicode: WTF?
    @_amatellanes

    View Slide

  2. Unicode

    View Slide

  3. We need bytes but they need a meaning

    View Slide

  4. History of Unicode
    1968 ASCII (American Standard Code for Information Interchange) was standardized

    View Slide

  5. History of Unicode
    ASCII Hex Símbolo
    0 0
    127 7F DEL
    ...
    65 41 A
    66 42 B
    67 43 C
    68 44 D
    ...

    View Slide

  6. History of Unicode
    10 PRINT "MISE A JOUR TERMINEE"
    11 PRINT "PARAMETRES ENREGISTRES"

    View Slide

  7. History of Unicode
    16 bits = 65,536 distinct values
    Code Point Character Name
    U+0041 A LATIN CAPITAL LETTER A
    U+00E9 é LATIN SMALL LETTER E WITH
    ACUTE
    U+00F1 ñ LATIN SMALL LETTER N WITH TILDE
    U+1F4A9
    PILE OF POO

    View Slide

  8. Encoding to ASCII
    Convert Unicode string into the ASCII
    ● Code point < 128, each byte is the same as the value of the code point
    ● Code point >= 128, you will get an error

    View Slide

  9. UTF
    Unicode Transformation Format
    UTF encodings include:
    ● UTF-8
    ● UTF-16
    ● UTF-32
    ● ...

    View Slide

  10. Encoding to UTF
    Convert Unicode to UTF-8
    ● Code point < 128, each byte is the same as the value of the code point
    ● Code point >= 128, it’s turned into a sequence of two, three, or four
    bytes, where each byte of the sequence is between 128 and 255.

    View Slide

  11. Python

    View Slide

  12. String Types
    Python 2 Python 3
    b’abc’ str bytes
    u’abc’ unicode str
    ‘abc’ str str

    View Slide

  13. String Types
    Python 2 Python 3
    b’abc’ str bytes
    u’abc’ unicode str
    ‘abc’ str str
    In Python 2, str is for bytes, NOT string

    View Slide

  14. Unicode in Python 3
    >>> 'Δ'
    >>> u'Δ'
    >>> '\N{GREEK CAPITAL LETTER DELTA}'
    >>> '\u0394'
    >>> '\U00000394'
    >>> chr(int('0x0394', 16))
    >>> b'\xce\x94'.decode('utf-8')
    * unicodedata - Unicode Database https://docs.python.org/3/library/unicodedata.html

    View Slide

  15. Unicode in Python 2
    >>> u'Δ'
    >>> unicode('\xce\x94', encoding='utf-8')
    >>> unichr(int('0x0394', 16))
    >>> b'\xce\x94'.decode('utf-8')
    * unicodedata - Unicode Database https://docs.python.org/2/library/unicodedata.html

    View Slide

  16. Encoding Unicode to ASCII
    >>> foo = 'foo'
    >>> type(foo)

    >>> foo.encode('ascii')
    b'foo'
    >>> foo = 'melón'
    >>> type(foo)

    >>> foo.encode('ascii')
    UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in
    position 3: ordinal not in range(128)

    View Slide

  17. Encoding Unicode to UTF-8
    >>> foo = 'foo'
    >>> type(foo)

    >>> foo.encode('utf-8')
    b'foo'
    >>> foo = 'melón'
    >>> type(foo)

    >>> foo.encode('utf-8')
    b'mel\xc3\xb3n'

    View Slide

  18. Encoding Unicode to UTF-8
    >>> foo = 'melón'
    >>> bytes(foo, 'utf-8') # Only Python 3
    b'mel\xc3\xb3n'

    View Slide

  19. Decoding to Unicode
    >>> bar = b'mel\xc3\xb3n'
    >>> bar.decode('ascii')
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3:
    ordinal not in range(128)
    >>> bar.decode('utf-8')
    'melón'

    View Slide

  20. Unicode Literals in Python Source Code
    # -*- coding: -*-
    Python 3 Python 2.4 > Python 2.4 ≤
    Default encoding UTF-8 ASCII Euro-centric

    View Slide

  21. Reading and Writing Unicode Data (Python 3)
    >>> with open('unicode.txt', encoding='utf-8') as f:
    … for line in f:
    … print(line)
    * locale.getpreferredencoding(False): Return the encoding used for text data.

    View Slide

  22. Reading and Writing Unicode Data (Python 2)
    >>> import codecs
    >>> f = codecs.open('unicode.txt', encoding='utf-8')
    … for line in f:
    … print line

    View Slide

  23. Tip
    Software should only work with Unicode strings internally, decoding the input data as
    soon as possible and encoding the output only at the end.

    View Slide

  24. Concatenate Strings (Python 3)
    >>> 'me' + 'lón'
    >>> b'me' + b'l\xc3\xb3n'
    >>> 'me' + b'l\xc3\xb3n'

    View Slide

  25. Concatenate Strings (Python 3)
    >>> 'me' + 'lón'
    ‘melón’
    >>> b'me' + b'l\xc3\xb3n'
    b'mel\xc3\xb3n'
    >>> 'me' + b'l\xc3\xb3n'
    TypeError: Can't convert 'bytes' object to str implicitly

    View Slide

  26. Concatenate Strings (Python 2)
    >>> 'me' + ‘lón’
    >>> u'me' + u'lón'
    >>> b'me' + b'l\xc3\xb3n'

    View Slide

  27. Concatenate Strings (Python 2)
    >>> 'me' + 'lón'
    'mel\xc3\xb3n'
    >>> u'me' + u'lón'
    u'mel\xf3n'
    >>> b'me' + b'l\xc3\xb3n'
    'mel\xc3\xb3n'

    View Slide

  28. Concatenate Strings (Python 2)
    >>> u'me' + b'l\xc3\xb3n'

    View Slide

  29. Concatenate Strings (Python 2)
    >>> u'me' + b'l\xc3\xb3n'
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
    position 1: ordinal not in range(128)*
    >>> import sys
    >>> # The encoding used for these implicit decodings is the
    value
    >>> sys.getdefaultencoding()
    'ascii'

    View Slide

  30. Python 2 and 3 compatibility
    As of Python 2.6 (PEP 3112):
    >>> from __future__ import unicode_literals

    View Slide

  31. References
    ● Unicode HOWTO Python 2
    ○ https://docs.python.org/2/howto/unicode.html
    ● Unicode HOWTO Python 3
    ○ https://docs.python.org/3/howto/unicode.html
    ● Ned Batchelder: Pragmatic Unicode
    ○ http://nedbatchelder.com/text/unipain.html
    ○ https://www.youtube.com/watch?v=sgHbC6udIqc
    ● Facundo Batista - Entendiendo Unicode
    ○ https://www.youtube.com/watch?v=Dr1R4ZlVLxI

    View Slide