Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode Solutions in Python 2 and Python 3

Unicode Solutions in Python 2 and Python 3

Dealing with Unicode, legacy encodings, encoding errors, sorting, safe comparisons etc. An overview of chapter 4 of the book Fluent Python (O'Reilly, 2014)

Luciano Ramalho

April 28, 2015
Tweet

More Decks by Luciano Ramalho

Other Decks in Technology

Transcript

  1. 2 Unicode solutions in Python 2 and 3 Unicode solutions

    in Python 2 and 3 Latin-1 Latin-1 emoticons emoticons Armenian Armenian Egyptian Hieroglyphs Egyptian Hieroglyphs Malayalam Malayalam CJK Unified Ideographs CJK Unified Ideographs
  2. 3 About me: Luciano Ramalho • Programming in Python since

    1998 • Focus on content management (i.e. text wrangling) • Teaching Python since 1999 • Speaker at PyCon US, OSCON, FISL, PythonBrasil, RuPy, QCon... • Author of Fluent Python • Twitter: @ramalhoorg • Native language: Português – “ação” 4 non-ASCII characters here 4 non-ASCII characters here
  3. 4 Resources • All code, slides and images used in

    this talk: – https://github.com/fuentpython/unicode-solutions • Fluent Python – http://shop.oreilly.com/product/0636920032519.do – Relevant content and examples: • Chapter 4: Text versus Bytes – all 39 pages • Chapter 18: Concurrency with asyncio – the charfnder examples
  4. 7 Why Unicode • Too many incompatible byte encodings •

    Separate concepts: – character identity: one code point for each abstract character • U+0041 → LATIN CAPITAL LETTER A • U+096C → DEVANAGARI DIGIT SIX – binary representation: multiple encodings • U+0041 → 0x41 0x41 0x00 • U+096C → 0xE0 0xA5 0xAC 0x6C 0x09 UTF-8 UTF-8 UTF-16LE UTF-16LE
  5. 10 Data types for text or bytes Python 2.7 Python

    3.4 Human text unicode u'café', u'caf\xe9' str 'café', u'café', 'caf\xe9' (immutable) Bytes str 'café', 'caf\xe9', b'café' bytes b'caf\xc3\xa9' (mutable) Bytes bytearray bytearray(b'caf\xc3\xa9') bytearray bytearray(b'caf\xc3\xa9') Py2 Py3
  6. 11 .encode() v. .decode() • “Humans use text. Computers speak

    bytes.” – Esther Nam and Travis Fischer in Character encoding and Unicode in Python (Pycon US 2014) • Use .encode() to convert human text to bytes • Use .decode() to convert bytes to human text 2.7 gotcha: methods .encode() and .decode() exist in both str and unicode types! 2.7 gotcha: methods .encode() and .decode() exist in both str and unicode types! b'caf\xc3\xa9' 'café' encode decode
  7. 12 Text v. bytes Py3 • Items in unicode text

    are characters • Items in byte sequences are bytes – integers 0...255 – shown as ASCII sequences with a b'' prefx for convenience
  8. 18 Coping with Unicode Errors • SyntaxError – A .py

    fle has source code in an unexpected encoding • UnicodeDecodeError – A binary sequence contains bytes that are not valid in the expected encoding • UnicodeEncodeError – A Unicode string contains codepoints that cannot be represented in the desired encoding
  9. 19 Coping with SyntaxError • A .py fle uses an

    unexpected encoding – The source fle encoding is not the default, and no # coding comment was found. – The source fle encoding is not the one declared in the # coding comment • Default source encoding: – Python 2.7 → ASCII – Python 3.x → UTF-8 2.7 gotcha: default source encoding is ASCII 2.7 gotcha: default source encoding is ASCII Py2
  10. 20 UnicodeEncodeError • A character in the Unicode text cannot

    be represented in the target byte encoding – happens with legacy encodings that cover only a small subset of Unicode Py3
  11. 21 UnicodeEncodeError Py2 • A character in the Unicode text

    cannot be represented in the target byte encoding – happens with legacy encodings that cover only a small subset of Unicode
  12. 24 Best practice to avoid errors Figure 4-2 of Fluent

    Python, after Ned Batchelder's Pragmatic Unicode talk: http://nedbatchelder.com/text/unipain.html
  13. 25 How to implement the sandwich • Avoid calling .encode()

    or .decode() if possible. – if impossible to avoid, restrict usage to code sections that perform the actual I/O. • Django and most frameworks already perform encoding/decoding in library code (not in your code) • Always specify encoding when opening text fles, so you send and receive text, and not bytes – in Python 2.7, remember to use io.open() 2.7 gotcha: no way to specify encoding with built-in open(…). Must use io.open(…). 2.7 gotcha: no way to specify encoding with built-in open(…). Must use io.open(…).
  14. 26 Text I/O • open() built-in is Unicode-aware in Python

    3 – text mode default accepts encoding argument – .write() method only accepts Unicode text – .read() method returns Unicode text Py3
  15. 27 Bytes or text I/O Py2 • open() built-in only

    supports bytes in Python 2 – even in “text mode” (deals with CR+LF...) – no encoding argument accepted – .write() implicitly converts unicode to str using ASCII codec • remember: 'café' is actually b'caf\xc3\xa9' – .read() method always returns bytes
  16. 28 Bytes or text I/O Py2 • io.open() is the

    Unicode-aware open() from Python 3 backported to Python 2.6+
  17. 30 FAQ: How to fnd out the encoding of a

    fle? • Some fles have an encoding header – HTML, XML, some database dumps • Otherwise, you must be told. Ask! • If you can't ask, try the Chardet package – not 100% safe, but pretty smart – uses statistics and heuristics – includes a chardetect command-line tool
  18. 31 FAQ: What are the default encodings? Python 3 on

    recent GNU/Linux and OSX Python 3 on recent GNU/Linux and OSX UTF-8 FTW! UTF-8 FTW!
  19. 32 FAQ: What are the default encodings? Python 3 on

    Windows 7 Python 3 on Windows 7 four different encodings! four different encodings!
  20. 34 Combining characters • Latin character accents and other diacritical

    marks can be written as separate characters Py3
  21. 35 Normalization • Composing or decomposing all characters – Optional:

    replacing compatibility characters • Normalization forms: NFC, NFKC, NFD, NFKD Py3
  22. 36 Case folding • Standard character substitutions Py3 2.7 gotcha:

    unicode.casefold() not implemented 2.7 gotcha: unicode.casefold() not implemented
  23. 37 Unicode sorting • By default, text ordering uses the

    code point values of the characters – this is not what humans expect • in Portuguese, accents and diacritics are tiebreakers only Py3 wrong: açaí should be first wrong: açaí should be first wrong: caju should be last wrong: caju should be last
  24. 38 Unicode sorting • The standard library solution requires use

    of the locale module and a suitable locale available in the OS – only main program should set locale, and only at start-up – desired locale is not always available... Py3
  25. 39 Unicode sorting • James Tauber's PyUCA is a Python

    3 implementation of UCA (Unicode Collation Algorithm) – locale independent! – designed to work with many languages – https://pypi.python.org/pypi/pyuca/ Py3
  26. 40 Unicode database Py3 • Metadata about each Unicode character

    – name, numeric value, category etc. – standard library: unicodedata (Python 2 and Python 3) numeric values! numeric values!
  27. 42 fupy-ch18/charfnder.py Py3 • Command-line utility to search for characters

    by words in the offcial name – e.g. “cat face”, “black chess”...
  28. 43 fupy-ch18/http_charfnder.py Py3 • HTTP and Telnet servers used to

    illustrate asyncio programming (Fluent Python, ch. 18)
  29. 44 ¿Preguntas? ¿Preguntas? • More answers: – Python Unicode HOWTO

    • for Python 2 • for Python 3 – Fluent Python, chapter 4 – Twitter: @ramalhoorg