Unicode Solutions in Python 2 and Python 3

1 Unicode solutions in Python 2 and 3 Unicode solutions
in Python 2 and 3

2 Unicode solutions in Python 2 and 3 Unicode solutions
in Python 2 and 3 Latin-1 Latin-1 emoticons emoticons Armenian Armenian Egyptian Hieroglyphs Egyptian Hieroglyphs Malayalam Malayalam CJK Unified Ideographs CJK Unified Ideographs

3 About me: Luciano Ramalho • Programming in Python since
1998 • Focus on content management (i.e. text wrangling) • Teaching Python since 1999 • Speaker at PyCon US, OSCON, FISL, PythonBrasil, RuPy, QCon... • Author of Fluent Python • Twitter: @ramalhoorg • Native language: Português – “ação” 4 non-ASCII characters here 4 non-ASCII characters here

4 Resources • All code, slides and images used in
this talk: – https://github.com/fuentpython/unicode-solutions • Fluent Python – http://shop.oreilly.com/product/0636920032519.do – Relevant content and examples: • Chapter 4: Text versus Bytes – all 39 pages • Chapter 18: Concurrency with asyncio – the charfnder examples

5 A bite of theory A bite of theory

6 The single-byte codepage ballet Source code: http://bit.ly/1Oqt0MZ Video: https://www.youtube.com/watch?v=J4qioAacrYo

7 Why Unicode • Too many incompatible byte encodings •
Separate concepts: – character identity: one code point for each abstract character • U+0041 → LATIN CAPITAL LETTER A • U+096C → DEVANAGARI DIGIT SIX – binary representation: multiple encodings • U+0041 → 0x41 0x41 0x00 • U+096C → 0xE0 0xA5 0xAC 0x6C 0x09 UTF-8 UTF-8 UTF-16LE UTF-16LE

8 A sample of encodings Figure 4-1 of Fluent Python

9 Byte and text solutions Byte and text solutions

10 Data types for text or bytes Python 2.7 Python
3.4 Human text unicode u'café', u'caf\xe9' str 'café', u'café', 'caf\xe9' (immutable) Bytes str 'café', 'caf\xe9', b'café' bytes b'caf\xc3\xa9' (mutable) Bytes bytearray bytearray(b'caf\xc3\xa9') bytearray bytearray(b'caf\xc3\xa9') Py2 Py3

11 .encode() v. .decode() • “Humans use text. Computers speak
bytes.” – Esther Nam and Travis Fischer in Character encoding and Unicode in Python (Pycon US 2014) • Use .encode() to convert human text to bytes • Use .decode() to convert bytes to human text 2.7 gotcha: methods .encode() and .decode() exist in both str and unicode types! 2.7 gotcha: methods .encode() and .decode() exist in both str and unicode types! b'caf\xc3\xa9' 'café' encode decode

12 Text v. bytes Py3 • Items in unicode text
are characters • Items in byte sequences are bytes – integers 0...255 – shown as ASCII sequences with a b'' prefx for convenience

13 Text v. bytes Py2 Py3

14 Bytes and bytearray Py3

15 Bytes and bytearray Py2 Py3

16 Common codecs • codec = encoder/decoder table or algorithm
Py3

17 Common codecs • codec = encoder/decoder table or algorithm
Py2 Py3

18 Coping with Unicode Errors • SyntaxError – A .py
fle has source code in an unexpected encoding • UnicodeDecodeError – A binary sequence contains bytes that are not valid in the expected encoding • UnicodeEncodeError – A Unicode string contains codepoints that cannot be represented in the desired encoding

19 Coping with SyntaxError • A .py fle uses an
unexpected encoding – The source fle encoding is not the default, and no # coding comment was found. – The source fle encoding is not the one declared in the # coding comment • Default source encoding: – Python 2.7 → ASCII – Python 3.x → UTF-8 2.7 gotcha: default source encoding is ASCII 2.7 gotcha: default source encoding is ASCII Py2

20 UnicodeEncodeError • A character in the Unicode text cannot
be represented in the target byte encoding – happens with legacy encodings that cover only a small subset of Unicode Py3

21 UnicodeEncodeError Py2 • A character in the Unicode text
cannot be represented in the target byte encoding – happens with legacy encodings that cover only a small subset of Unicode

22 UnicodeDecodeError • Invalid byte in the source encoding –
more common with the UTF encodings Py3

23 UnicodeDecodeError Py2

24 Best practice to avoid errors Figure 4-2 of Fluent
Python, after Ned Batchelder's Pragmatic Unicode talk: http://nedbatchelder.com/text/unipain.html

25 How to implement the sandwich • Avoid calling .encode()
or .decode() if possible. – if impossible to avoid, restrict usage to code sections that perform the actual I/O. • Django and most frameworks already perform encoding/decoding in library code (not in your code) • Always specify encoding when opening text fles, so you send and receive text, and not bytes – in Python 2.7, remember to use io.open() 2.7 gotcha: no way to specify encoding with built-in open(…). Must use io.open(…). 2.7 gotcha: no way to specify encoding with built-in open(…). Must use io.open(…).

26 Text I/O • open() built-in is Unicode-aware in Python
3 – text mode default accepts encoding argument – .write() method only accepts Unicode text – .read() method returns Unicode text Py3

27 Bytes or text I/O Py2 • open() built-in only
supports bytes in Python 2 – even in “text mode” (deals with CR+LF...) – no encoding argument accepted – .write() implicitly converts unicode to str using ASCII codec • remember: 'café' is actually b'caf\xc3\xa9' – .read() method always returns bytes

28 Bytes or text I/O Py2 • io.open() is the
Unicode-aware open() from Python 3 backported to Python 2.6+

29 Bytes or text I/O Py2 • io.open() also handles
bytes – mode 'b'

30 FAQ: How to fnd out the encoding of a
fle? • Some fles have an encoding header – HTML, XML, some database dumps • Otherwise, you must be told. Ask! • If you can't ask, try the Chardet package – not 100% safe, but pretty smart – uses statistics and heuristics – includes a chardetect command-line tool

31 FAQ: What are the default encodings? Python 3 on
recent GNU/Linux and OSX Python 3 on recent GNU/Linux and OSX UTF-8 FTW! UTF-8 FTW!

32 FAQ: What are the default encodings? Python 3 on
Windows 7 Python 3 on Windows 7 four different encodings! four different encodings!

33 Unicode solutions Unicode solutions

34 Combining characters • Latin character accents and other diacritical
marks can be written as separate characters Py3

35 Normalization • Composing or decomposing all characters – Optional:
replacing compatibility characters • Normalization forms: NFC, NFKC, NFD, NFKD Py3

36 Case folding • Standard character substitutions Py3 2.7 gotcha:
unicode.casefold() not implemented 2.7 gotcha: unicode.casefold() not implemented

37 Unicode sorting • By default, text ordering uses the
code point values of the characters – this is not what humans expect • in Portuguese, accents and diacritics are tiebreakers only Py3 wrong: açaí should be first wrong: açaí should be first wrong: caju should be last wrong: caju should be last

38 Unicode sorting • The standard library solution requires use
of the locale module and a suitable locale available in the OS – only main program should set locale, and only at start-up – desired locale is not always available... Py3

39 Unicode sorting • James Tauber's PyUCA is a Python
3 implementation of UCA (Unicode Collation Algorithm) – locale independent! – designed to work with many languages – https://pypi.python.org/pypi/pyuca/ Py3

40 Unicode database Py3 • Metadata about each Unicode character
– name, numeric value, category etc. – standard library: unicodedata (Python 2 and Python 3) numeric values! numeric values!

41 Unicode database Py3

42 fupy-ch18/charfnder.py Py3 • Command-line utility to search for characters
by words in the offcial name – e.g. “cat face”, “black chess”...

43 fupy-ch18/http_charfnder.py Py3 • HTTP and Telnet servers used to
illustrate asyncio programming (Fluent Python, ch. 18)

44 ¿Preguntas? ¿Preguntas? • More answers: – Python Unicode HOWTO
• for Python 2 • for Python 3 – Fluent Python, chapter 4 – Twitter: @ramalhoorg

Unicode Solutions in Python 2 and Python 3

Unicode Solutions in Python 2 and Python 3

More Decks by Luciano Ramalho

Other Decks in Technology

Featured

Transcript