Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Opening The Floodgates: Unicode Identifiers in Python

Opening The Floodgates: Unicode Identifiers in Python

Python 3 introduced unicode identifiers into the language. This talk discusses what this means, what restrictions are included and shows what can happen when you drop those restrictions

Matt Bachmann

February 17, 2015
Tweet

More Decks by Matt Bachmann

Other Decks in Programming

Transcript

  1. Unicode • Standard to define how characters should be represented

    in computers • Other standards (ascii, latin-1, euc-jp) are culturally specific ◦ Inappropriate for a global internet • Over one million characters allowing all known characters to be represented • There is even room for Emoji!
  2. Unicode • A unicode string is a collection of code

    points from 0 to 1,114,112 • These codepoints correspond to characters ◦ U+0045 -> E ◦ U+1F60A -> • To represent these in the machine an encoding is used.
  3. Encoding 101 • UTF-8 is one way of encoding unicode

    • Each code point is represented with one or more bytes ◦ E -> U+0045 -> 01000101 ◦ -> U+1F60A -> 11110000 10011111 10011000 10001010
  4. Why Should you care • When you leave the pretty

    world that is your program you are in the real world. The real world is in bytes ◦ Databases ◦ Console ◦ HTTP • '(\xe2\x95\xaf\xc2\xb0\xe2\x96\xa1\xc2\xb0\xef\xbc\x89\xe2\x95\xa f\xef\xb8\xb5 \xe2\x94\xbb\xe2\x94\x81\xe2\x94\xbb' ◦ Understanding that these bytes are encoded in UTF-8 is the difference between '(╯°□°)╯︵ ┻━┻' and a UnicodeDecodeError
  5. So about Encode throwing DecodeError • “¥” is a python

    2 string • u”¥” is a unicode object • encode must be called on unicode objects ◦ Encode into bytes ◦ Decode into unicode • Call encode on a string Python tries to “help” by decoding it into unicode first • And it explodes.
  6. Identifiers in Python • Python Identifiers pre Python 3 only

    supported basic latin letters, digits, and underscore • Pep 3131 expands this definition to include Unicode
  7. A Mixed Blessing • Inclusiveness ◦ Not everyone uses latin

    alphabets • Complexity ◦ Imagine using a library with a unicode heavy API ▪ γάμμα.ɸ() ◦ Harder to read source code ◦ Security concerns
  8. So why can’t I program with poop already?! • Python3

    identifiers only allow a subset of Unicode • Specifically the follow the suggestions of The Unicode Consortium ◦ Identifier: XID_START XID_CONTINUE* • XID_START ◦ Basically “This is a letter” • XID_CONTINUE ◦ Everything in XID_START and modifier characters
  9. Swift Follows These Suggestions (mostly) • Exception made for emoji

    (for some reason) • Gives better error ◦ Python says the accent is invalid, but technically it just is not a valid start ◦ I wrote a patch (http://bugs.python.org/issue23263)
  10. That was a lot of… stuff Ned Batchelder: Pragmatic Unicode

    Travis Fischer, Esther Nam: Character encoding and Unicode in Python