$30 off During Our Annual Pro Sale. View Details »

Unicode Solutions in Python 2 and Python 3

Unicode Solutions in Python 2 and Python 3

Dealing with Unicode, legacy encodings, encoding errors, sorting, safe comparisons etc. An overview of chapter 4 of the book Fluent Python (O'Reilly, 2014)

Luciano Ramalho

April 28, 2015
Tweet

More Decks by Luciano Ramalho

Other Decks in Technology

Transcript

  1. 1 Unicode solutions in Python 2 and 3 Unicode solutions

    in Python 2 and 3
  2. 2 Unicode solutions in Python 2 and 3 Unicode solutions

    in Python 2 and 3 Latin-1 Latin-1 emoticons emoticons Armenian Armenian Egyptian Hieroglyphs Egyptian Hieroglyphs Malayalam Malayalam CJK Unified Ideographs CJK Unified Ideographs
  3. 3 About me: Luciano Ramalho • Programming in Python since

    1998 • Focus on content management (i.e. text wrangling) • Teaching Python since 1999 • Speaker at PyCon US, OSCON, FISL, PythonBrasil, RuPy, QCon... • Author of Fluent Python • Twitter: @ramalhoorg • Native language: Português – “ação” 4 non-ASCII characters here 4 non-ASCII characters here
  4. 4 Resources • All code, slides and images used in

    this talk: – https://github.com/fuentpython/unicode-solutions • Fluent Python – http://shop.oreilly.com/product/0636920032519.do – Relevant content and examples: • Chapter 4: Text versus Bytes – all 39 pages • Chapter 18: Concurrency with asyncio – the charfnder examples
  5. 5 A bite of theory A bite of theory

  6. 6 The single-byte codepage ballet Source code: http://bit.ly/1Oqt0MZ Video: https://www.youtube.com/watch?v=J4qioAacrYo

  7. 7 Why Unicode • Too many incompatible byte encodings •

    Separate concepts: – character identity: one code point for each abstract character • U+0041 → LATIN CAPITAL LETTER A • U+096C → DEVANAGARI DIGIT SIX – binary representation: multiple encodings • U+0041 → 0x41 0x41 0x00 • U+096C → 0xE0 0xA5 0xAC 0x6C 0x09 UTF-8 UTF-8 UTF-16LE UTF-16LE
  8. 8 A sample of encodings Figure 4-1 of Fluent Python

  9. 9 Byte and text solutions Byte and text solutions

  10. 10 Data types for text or bytes Python 2.7 Python

    3.4 Human text unicode u'café', u'caf\xe9' str 'café', u'café', 'caf\xe9' (immutable) Bytes str 'café', 'caf\xe9', b'café' bytes b'caf\xc3\xa9' (mutable) Bytes bytearray bytearray(b'caf\xc3\xa9') bytearray bytearray(b'caf\xc3\xa9') Py2 Py3
  11. 11 .encode() v. .decode() • “Humans use text. Computers speak

    bytes.” – Esther Nam and Travis Fischer in Character encoding and Unicode in Python (Pycon US 2014) • Use .encode() to convert human text to bytes • Use .decode() to convert bytes to human text 2.7 gotcha: methods .encode() and .decode() exist in both str and unicode types! 2.7 gotcha: methods .encode() and .decode() exist in both str and unicode types! b'caf\xc3\xa9' 'café' encode decode
  12. 12 Text v. bytes Py3 • Items in unicode text

    are characters • Items in byte sequences are bytes – integers 0...255 – shown as ASCII sequences with a b'' prefx for convenience
  13. 13 Text v. bytes Py2 Py3

  14. 14 Bytes and bytearray Py3

  15. 15 Bytes and bytearray Py2 Py3

  16. 16 Common codecs • codec = encoder/decoder table or algorithm

    Py3
  17. 17 Common codecs • codec = encoder/decoder table or algorithm

    Py2 Py3
  18. 18 Coping with Unicode Errors • SyntaxError – A .py

    fle has source code in an unexpected encoding • UnicodeDecodeError – A binary sequence contains bytes that are not valid in the expected encoding • UnicodeEncodeError – A Unicode string contains codepoints that cannot be represented in the desired encoding
  19. 19 Coping with SyntaxError • A .py fle uses an

    unexpected encoding – The source fle encoding is not the default, and no # coding comment was found. – The source fle encoding is not the one declared in the # coding comment • Default source encoding: – Python 2.7 → ASCII – Python 3.x → UTF-8 2.7 gotcha: default source encoding is ASCII 2.7 gotcha: default source encoding is ASCII Py2
  20. 20 UnicodeEncodeError • A character in the Unicode text cannot

    be represented in the target byte encoding – happens with legacy encodings that cover only a small subset of Unicode Py3
  21. 21 UnicodeEncodeError Py2 • A character in the Unicode text

    cannot be represented in the target byte encoding – happens with legacy encodings that cover only a small subset of Unicode
  22. 22 UnicodeDecodeError • Invalid byte in the source encoding –

    more common with the UTF encodings Py3
  23. 23 UnicodeDecodeError Py2

  24. 24 Best practice to avoid errors Figure 4-2 of Fluent

    Python, after Ned Batchelder's Pragmatic Unicode talk: http://nedbatchelder.com/text/unipain.html
  25. 25 How to implement the sandwich • Avoid calling .encode()

    or .decode() if possible. – if impossible to avoid, restrict usage to code sections that perform the actual I/O. • Django and most frameworks already perform encoding/decoding in library code (not in your code) • Always specify encoding when opening text fles, so you send and receive text, and not bytes – in Python 2.7, remember to use io.open() 2.7 gotcha: no way to specify encoding with built-in open(…). Must use io.open(…). 2.7 gotcha: no way to specify encoding with built-in open(…). Must use io.open(…).
  26. 26 Text I/O • open() built-in is Unicode-aware in Python

    3 – text mode default accepts encoding argument – .write() method only accepts Unicode text – .read() method returns Unicode text Py3
  27. 27 Bytes or text I/O Py2 • open() built-in only

    supports bytes in Python 2 – even in “text mode” (deals with CR+LF...) – no encoding argument accepted – .write() implicitly converts unicode to str using ASCII codec • remember: 'café' is actually b'caf\xc3\xa9' – .read() method always returns bytes
  28. 28 Bytes or text I/O Py2 • io.open() is the

    Unicode-aware open() from Python 3 backported to Python 2.6+
  29. 29 Bytes or text I/O Py2 • io.open() also handles

    bytes – mode 'b'
  30. 30 FAQ: How to fnd out the encoding of a

    fle? • Some fles have an encoding header – HTML, XML, some database dumps • Otherwise, you must be told. Ask! • If you can't ask, try the Chardet package – not 100% safe, but pretty smart – uses statistics and heuristics – includes a chardetect command-line tool
  31. 31 FAQ: What are the default encodings? Python 3 on

    recent GNU/Linux and OSX Python 3 on recent GNU/Linux and OSX UTF-8 FTW! UTF-8 FTW!
  32. 32 FAQ: What are the default encodings? Python 3 on

    Windows 7 Python 3 on Windows 7 four different encodings! four different encodings!
  33. 33 Unicode solutions Unicode solutions

  34. 34 Combining characters • Latin character accents and other diacritical

    marks can be written as separate characters Py3
  35. 35 Normalization • Composing or decomposing all characters – Optional:

    replacing compatibility characters • Normalization forms: NFC, NFKC, NFD, NFKD Py3
  36. 36 Case folding • Standard character substitutions Py3 2.7 gotcha:

    unicode.casefold() not implemented 2.7 gotcha: unicode.casefold() not implemented
  37. 37 Unicode sorting • By default, text ordering uses the

    code point values of the characters – this is not what humans expect • in Portuguese, accents and diacritics are tiebreakers only Py3 wrong: açaí should be first wrong: açaí should be first wrong: caju should be last wrong: caju should be last
  38. 38 Unicode sorting • The standard library solution requires use

    of the locale module and a suitable locale available in the OS – only main program should set locale, and only at start-up – desired locale is not always available... Py3
  39. 39 Unicode sorting • James Tauber's PyUCA is a Python

    3 implementation of UCA (Unicode Collation Algorithm) – locale independent! – designed to work with many languages – https://pypi.python.org/pypi/pyuca/ Py3
  40. 40 Unicode database Py3 • Metadata about each Unicode character

    – name, numeric value, category etc. – standard library: unicodedata (Python 2 and Python 3) numeric values! numeric values!
  41. 41 Unicode database Py3

  42. 42 fupy-ch18/charfnder.py Py3 • Command-line utility to search for characters

    by words in the offcial name – e.g. “cat face”, “black chess”...
  43. 43 fupy-ch18/http_charfnder.py Py3 • HTTP and Telnet servers used to

    illustrate asyncio programming (Fluent Python, ch. 18)
  44. 44 ¿Preguntas? ¿Preguntas? • More answers: – Python Unicode HOWTO

    • for Python 2 • for Python 3 – Fluent Python, chapter 4 – Twitter: @ramalhoorg