Slide 1

Slide 1 text

1 Unicode solutions in Python 2 and 3 Unicode solutions in Python 2 and 3

Slide 2

Slide 2 text

2 Unicode solutions in Python 2 and 3 Unicode solutions in Python 2 and 3 Latin-1 Latin-1 emoticons emoticons Armenian Armenian Egyptian Hieroglyphs Egyptian Hieroglyphs Malayalam Malayalam CJK Unified Ideographs CJK Unified Ideographs

Slide 3

Slide 3 text

3 About me: Luciano Ramalho ● Programming in Python since 1998 ● Focus on content management (i.e. text wrangling) ● Teaching Python since 1999 ● Speaker at PyCon US, OSCON, FISL, PythonBrasil, RuPy, QCon... ● Author of Fluent Python ● Twitter: @ramalhoorg ● Native language: Português – “ação” 4 non-ASCII characters here 4 non-ASCII characters here

Slide 4

Slide 4 text

4 Resources ● All code, slides and images used in this talk: – https://github.com/fuentpython/unicode-solutions ● Fluent Python – http://shop.oreilly.com/product/0636920032519.do – Relevant content and examples: ● Chapter 4: Text versus Bytes – all 39 pages ● Chapter 18: Concurrency with asyncio – the charfnder examples

Slide 5

Slide 5 text

5 A bite of theory A bite of theory

Slide 6

Slide 6 text

6 The single-byte codepage ballet Source code: http://bit.ly/1Oqt0MZ Video: https://www.youtube.com/watch?v=J4qioAacrYo

Slide 7

Slide 7 text

7 Why Unicode ● Too many incompatible byte encodings ● Separate concepts: – character identity: one code point for each abstract character ● U+0041 → LATIN CAPITAL LETTER A ● U+096C → DEVANAGARI DIGIT SIX – binary representation: multiple encodings ● U+0041 → 0x41 0x41 0x00 ● U+096C → 0xE0 0xA5 0xAC 0x6C 0x09 UTF-8 UTF-8 UTF-16LE UTF-16LE

Slide 8

Slide 8 text

8 A sample of encodings Figure 4-1 of Fluent Python

Slide 9

Slide 9 text

9 Byte and text solutions Byte and text solutions

Slide 10

Slide 10 text

10 Data types for text or bytes Python 2.7 Python 3.4 Human text unicode u'café', u'caf\xe9' str 'café', u'café', 'caf\xe9' (immutable) Bytes str 'café', 'caf\xe9', b'café' bytes b'caf\xc3\xa9' (mutable) Bytes bytearray bytearray(b'caf\xc3\xa9') bytearray bytearray(b'caf\xc3\xa9') Py2 Py3

Slide 11

Slide 11 text

11 .encode() v. .decode() ● “Humans use text. Computers speak bytes.” – Esther Nam and Travis Fischer in Character encoding and Unicode in Python (Pycon US 2014) ● Use .encode() to convert human text to bytes ● Use .decode() to convert bytes to human text 2.7 gotcha: methods .encode() and .decode() exist in both str and unicode types! 2.7 gotcha: methods .encode() and .decode() exist in both str and unicode types! b'caf\xc3\xa9' 'café' encode decode

Slide 12

Slide 12 text

12 Text v. bytes Py3 ● Items in unicode text are characters ● Items in byte sequences are bytes – integers 0...255 – shown as ASCII sequences with a b'' prefx for convenience

Slide 13

Slide 13 text

13 Text v. bytes Py2 Py3

Slide 14

Slide 14 text

14 Bytes and bytearray Py3

Slide 15

Slide 15 text

15 Bytes and bytearray Py2 Py3

Slide 16

Slide 16 text

16 Common codecs ● codec = encoder/decoder table or algorithm Py3

Slide 17

Slide 17 text

17 Common codecs ● codec = encoder/decoder table or algorithm Py2 Py3

Slide 18

Slide 18 text

18 Coping with Unicode Errors ● SyntaxError – A .py fle has source code in an unexpected encoding ● UnicodeDecodeError – A binary sequence contains bytes that are not valid in the expected encoding ● UnicodeEncodeError – A Unicode string contains codepoints that cannot be represented in the desired encoding

Slide 19

Slide 19 text

19 Coping with SyntaxError ● A .py fle uses an unexpected encoding – The source fle encoding is not the default, and no # coding comment was found. – The source fle encoding is not the one declared in the # coding comment ● Default source encoding: – Python 2.7 → ASCII – Python 3.x → UTF-8 2.7 gotcha: default source encoding is ASCII 2.7 gotcha: default source encoding is ASCII Py2

Slide 20

Slide 20 text

20 UnicodeEncodeError ● A character in the Unicode text cannot be represented in the target byte encoding – happens with legacy encodings that cover only a small subset of Unicode Py3

Slide 21

Slide 21 text

21 UnicodeEncodeError Py2 ● A character in the Unicode text cannot be represented in the target byte encoding – happens with legacy encodings that cover only a small subset of Unicode

Slide 22

Slide 22 text

22 UnicodeDecodeError ● Invalid byte in the source encoding – more common with the UTF encodings Py3

Slide 23

Slide 23 text

23 UnicodeDecodeError Py2

Slide 24

Slide 24 text

24 Best practice to avoid errors Figure 4-2 of Fluent Python, after Ned Batchelder's Pragmatic Unicode talk: http://nedbatchelder.com/text/unipain.html

Slide 25

Slide 25 text

25 How to implement the sandwich ● Avoid calling .encode() or .decode() if possible. – if impossible to avoid, restrict usage to code sections that perform the actual I/O. ● Django and most frameworks already perform encoding/decoding in library code (not in your code) ● Always specify encoding when opening text fles, so you send and receive text, and not bytes – in Python 2.7, remember to use io.open() 2.7 gotcha: no way to specify encoding with built-in open(…). Must use io.open(…). 2.7 gotcha: no way to specify encoding with built-in open(…). Must use io.open(…).

Slide 26

Slide 26 text

26 Text I/O ● open() built-in is Unicode-aware in Python 3 – text mode default accepts encoding argument – .write() method only accepts Unicode text – .read() method returns Unicode text Py3

Slide 27

Slide 27 text

27 Bytes or text I/O Py2 ● open() built-in only supports bytes in Python 2 – even in “text mode” (deals with CR+LF...) – no encoding argument accepted – .write() implicitly converts unicode to str using ASCII codec ● remember: 'café' is actually b'caf\xc3\xa9' – .read() method always returns bytes

Slide 28

Slide 28 text

28 Bytes or text I/O Py2 ● io.open() is the Unicode-aware open() from Python 3 backported to Python 2.6+

Slide 29

Slide 29 text

29 Bytes or text I/O Py2 ● io.open() also handles bytes – mode 'b'

Slide 30

Slide 30 text

30 FAQ: How to fnd out the encoding of a fle? ● Some fles have an encoding header – HTML, XML, some database dumps ● Otherwise, you must be told. Ask! ● If you can't ask, try the Chardet package – not 100% safe, but pretty smart – uses statistics and heuristics – includes a chardetect command-line tool

Slide 31

Slide 31 text

31 FAQ: What are the default encodings? Python 3 on recent GNU/Linux and OSX Python 3 on recent GNU/Linux and OSX UTF-8 FTW! UTF-8 FTW!

Slide 32

Slide 32 text

32 FAQ: What are the default encodings? Python 3 on Windows 7 Python 3 on Windows 7 four different encodings! four different encodings!

Slide 33

Slide 33 text

33 Unicode solutions Unicode solutions

Slide 34

Slide 34 text

34 Combining characters ● Latin character accents and other diacritical marks can be written as separate characters Py3

Slide 35

Slide 35 text

35 Normalization ● Composing or decomposing all characters – Optional: replacing compatibility characters ● Normalization forms: NFC, NFKC, NFD, NFKD Py3

Slide 36

Slide 36 text

36 Case folding ● Standard character substitutions Py3 2.7 gotcha: unicode.casefold() not implemented 2.7 gotcha: unicode.casefold() not implemented

Slide 37

Slide 37 text

37 Unicode sorting ● By default, text ordering uses the code point values of the characters – this is not what humans expect ● in Portuguese, accents and diacritics are tiebreakers only Py3 wrong: açaí should be first wrong: açaí should be first wrong: caju should be last wrong: caju should be last

Slide 38

Slide 38 text

38 Unicode sorting ● The standard library solution requires use of the locale module and a suitable locale available in the OS – only main program should set locale, and only at start-up – desired locale is not always available... Py3

Slide 39

Slide 39 text

39 Unicode sorting ● James Tauber's PyUCA is a Python 3 implementation of UCA (Unicode Collation Algorithm) – locale independent! – designed to work with many languages – https://pypi.python.org/pypi/pyuca/ Py3

Slide 40

Slide 40 text

40 Unicode database Py3 ● Metadata about each Unicode character – name, numeric value, category etc. – standard library: unicodedata (Python 2 and Python 3) numeric values! numeric values!

Slide 41

Slide 41 text

41 Unicode database Py3

Slide 42

Slide 42 text

42 fupy-ch18/charfnder.py Py3 ● Command-line utility to search for characters by words in the offcial name – e.g. “cat face”, “black chess”...

Slide 43

Slide 43 text

43 fupy-ch18/http_charfnder.py Py3 ● HTTP and Telnet servers used to illustrate asyncio programming (Fluent Python, ch. 18)

Slide 44

Slide 44 text

44 ¿Preguntas? ¿Preguntas? ● More answers: – Python Unicode HOWTO ● for Python 2 ● for Python 3 – Fluent Python, chapter 4 – Twitter: @ramalhoorg