Opening The Floodgates: Unicode Identifiers in Python

Opening The Floodgates Unicode Identifiers In Python

“Hey Matt, look what Swift can do”

Python can’t do do it Well… Python won’t do it.

Text Representation The Treachery Of Images by René Magritte

Encoding failure

Encode throws DecodeError?! ლ(ಠ益ಠლ)

Unicode • Standard to define how characters should be represented
in computers • Other standards (ascii, latin-1, euc-jp) are culturally specific ◦ Inappropriate for a global internet • Over one million characters allowing all known characters to be represented • There is even room for Emoji!

Unicode • A unicode string is a collection of code
points from 0 to 1,114,112 • These codepoints correspond to characters ◦ U+0045 -> E ◦ U+1F60A -> • To represent these in the machine an encoding is used.

Encoding 101 • UTF-8 is one way of encoding unicode
• Each code point is represented with one or more bytes ◦ E -> U+0045 -> 01000101 ◦ -> U+1F60A -> 11110000 10011111 10011000 10001010

Why Should you care • When you leave the pretty
world that is your program you are in the real world. The real world is in bytes ◦ Databases ◦ Console ◦ HTTP • '(\xe2\x95\xaf\xc2\xb0\xe2\x96\xa1\xc2\xb0\xef\xbc\x89\xe2\x95\xa f\xef\xb8\xb5 \xe2\x94\xbb\xe2\x94\x81\xe2\x94\xbb' ◦ Understanding that these bytes are encoded in UTF-8 is the difference between '(╯°□°）╯︵ ┻━┻' and a UnicodeDecodeError

So about Encode throwing DecodeError • “¥” is a python
2 string • u”¥” is a unicode object • encode must be called on unicode objects ◦ Encode into bytes ◦ Decode into unicode • Call encode on a string Python tries to “help” by decoding it into unicode first • And it explodes.

Identifiers in Python • Python Identifiers pre Python 3 only
supported basic latin letters, digits, and underscore • Pep 3131 expands this definition to include Unicode

A Mixed Blessing • Inclusiveness ◦ Not everyone uses latin
alphabets • Complexity ◦ Imagine using a library with a unicode heavy API ▪ γάμμα.ɸ() ◦ Harder to read source code ◦ Security concerns

So why can’t I program with poop already?! • Python3
identifiers only allow a subset of Unicode • Specifically the follow the suggestions of The Unicode Consortium ◦ Identifier: XID_START XID_CONTINUE* • XID_START ◦ Basically “This is a letter” • XID_CONTINUE ◦ Everything in XID_START and modifier characters

Continuation mark demonstrated Combining Grave Accent

Long Story Short Emoji is not considered the start or
continuation of a word

Swift Follows These Suggestions (mostly) • Exception made for emoji
(for some reason) • Gives better error ◦ Python says the accent is invalid, but technically it just is not a valid start ◦ I wrote a patch (http://bugs.python.org/issue23263)

OK! WE KNOW EVERYTHING WE NEED TO KNOW!

Obviously, I need to fork the interpreter

Ok… fine

Victory! https://github.com/Bachmann1234/python-emoji

That was a lot of… stuff Ned Batchelder: Pragmatic Unicode
Travis Fischer, Esther Nam: Character encoding and Unicode in Python

Matt Bachmann @MattBachmann https://github.com/Bachmann1234

Opening The Floodgates: Unicode Identifiers in ...

Opening The Floodgates: Unicode Identifiers in Python

Matt Bachmann

More Decks by Matt Bachmann

Other Decks in Programming

Featured

Transcript