Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pythons in the tower of babel

Pythons in the tower of babel

Or how to decode those unicode errors

Mario Corchero

April 26, 2016
Tweet

More Decks by Mario Corchero

Other Decks in Programming

Transcript

  1. Pythons within the tower of Babel Or how to decode

    those Unicode errors Mario Corchero News Automation @Bloomberg
  2. History of text - ASCII • American Standard Code for

    Information Interchange • 0-127
  3. Unicode • 0 to 0x10ffff • A table that maps

    all possible characters into a “code point” • An Unicode string is a sequence of code points A U+0041 Glyph Code point Latin capital letter A Character
  4. Unicode - Save that to disk • A rule to

    translate an Unicode string to bytes is called encoding. The same way we encode/encode letters into sound the Unicode encodings allow us to encode/decode into bytes
  5. Take home • Bytes vs Unicode – There is no

    “string” • Encodings – Know your stuff • Python2 vs Python3 • Unicode Sandwich • Test with unicode
  6. Interpreter “SyntaxError: Non-ASCII character '\xeb' in file test.py on line

    1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details”
  7. Terminal setup “UnicodeEncodeError: 'charmap' codec can't encode character u'\u1234' in

    position 0: character maps to <undefined>” This means your console is not setup to display Unicode See https://wiki.python.org/moin/PrintFails
  8. JSON JSON module is broken (yes and no) When parsing

    in javascript I get: "æ± è¯"
  9. JSON Input Ensure_ascii output u” 汉语” True '"\\u6c49\\u8bed"' u” 汉语”

    False u'"\u6c49\u8bed"' u”汉语".encode("utf-8") True '"\\u6c49\\u8bed"’ u”汉语".encode("utf-8") False '"\xe6\xb1\x89\xe8\xaf\xad"' [u"汉语", u"汉语".encode("utf-8")] True '["\\u6c49\\u8bed", "\\u6c49\\u8bed"]' [u"汉语", u"汉语".encode("utf-8")] False UnicodeDecodeError There is no binary in the JSON format!