Opening The Floodgates: Unicode Identifiers in Python

Slide 1

Slide 1 text

Opening The Floodgates Unicode Identifiers In Python

Slide 2

Slide 2 text

“Hey Matt, look what Swift can do”

Slide 3

Slide 3 text

Python can’t do do it Well… Python won’t do it.

Slide 4

Slide 4 text

Text Representation The Treachery Of Images by René Magritte

Slide 5

Slide 5 text

Encoding failure

Slide 6

Slide 6 text

Encode throws DecodeError?! ლ(ಠ益ಠლ)

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Unicode ● Standard to define how characters should be represented in computers ● Other standards (ascii, latin-1, euc-jp) are culturally specific ○ Inappropriate for a global internet ● Over one million characters allowing all known characters to be represented ● There is even room for Emoji!

Slide 9

Slide 9 text

Unicode ● A unicode string is a collection of code points from 0 to 1,114,112 ● These codepoints correspond to characters ○ U+0045 -> E ○ U+1F60A -> ● To represent these in the machine an encoding is used.

Slide 10

Slide 10 text

Encoding 101 ● UTF-8 is one way of encoding unicode ● Each code point is represented with one or more bytes ○ E -> U+0045 -> 01000101 ○ -> U+1F60A -> 11110000 10011111 10011000 10001010

Slide 11

Slide 11 text

Why Should you care ● When you leave the pretty world that is your program you are in the real world. The real world is in bytes ○ Databases ○ Console ○ HTTP ● '(\xe2\x95\xaf\xc2\xb0\xe2\x96\xa1\xc2\xb0\xef\xbc\x89\xe2\x95\xa f\xef\xb8\xb5 \xe2\x94\xbb\xe2\x94\x81\xe2\x94\xbb' ○ Understanding that these bytes are encoded in UTF-8 is the difference between '(╯°□°）╯︵ ┻━┻' and a UnicodeDecodeError

Slide 12

Slide 12 text

So about Encode throwing DecodeError ● “¥” is a python 2 string ● u”¥” is a unicode object ● encode must be called on unicode objects ○ Encode into bytes ○ Decode into unicode ● Call encode on a string Python tries to “help” by decoding it into unicode first ● And it explodes.

Slide 13

Slide 13 text

Identifiers in Python ● Python Identifiers pre Python 3 only supported basic latin letters, digits, and underscore ● Pep 3131 expands this definition to include Unicode

Slide 14

Slide 14 text

A Mixed Blessing ● Inclusiveness ○ Not everyone uses latin alphabets ● Complexity ○ Imagine using a library with a unicode heavy API ■ γάμμα.ɸ() ○ Harder to read source code ○ Security concerns

Slide 15

Slide 15 text

So why can’t I program with poop already?! ● Python3 identifiers only allow a subset of Unicode ● Specifically the follow the suggestions of The Unicode Consortium ○ Identifier: XID_START XID_CONTINUE* ● XID_START ○ Basically “This is a letter” ● XID_CONTINUE ○ Everything in XID_START and modifier characters

Slide 16

Slide 16 text

Continuation mark demonstrated Combining Grave Accent

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Long Story Short Emoji is not considered the start or continuation of a word

Slide 19

Slide 19 text

Swift Follows These Suggestions (mostly) ● Exception made for emoji (for some reason) ● Gives better error ○ Python says the accent is invalid, but technically it just is not a valid start ○ I wrote a patch (http://bugs.python.org/issue23263)

Slide 20

Slide 20 text

OK! WE KNOW EVERYTHING WE NEED TO KNOW!

Slide 21

Slide 21 text

Obviously, I need to fork the interpreter

Slide 22

Slide 22 text

Ok… fine

Slide 23

Slide 23 text

Victory! https://github.com/Bachmann1234/python-emoji

Slide 24

Slide 24 text

That was a lot of… stuff Ned Batchelder: Pragmatic Unicode Travis Fischer, Esther Nam: Character encoding and Unicode in Python

Slide 25

Slide 25 text

Matt Bachmann @MattBachmann https://github.com/Bachmann1234