Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python, Locales and Writing Systems - PyCon US, 12th May 2018

Python, Locales and Writing Systems - PyCon US, 12th May 2018

Python 3 removes a lot of the confusion around Unicode handling in Python, but that by no means fixes everything. Different locales and writing systems have unique behaviours that can trip you up. Here's some of the worst ones and how to handle them correctly.

Presented at PyCon US 2018: https://us.pycon.org/2018/schedule/presentation/106/

Turkish i and ı:
- http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail
- http://www.i18nguy.com/unicode/turkish-i18n.html
- https://pypi.python.org/pypi/PyICU

- https://github.com/mpcabd/python-arabic-reshaper

- http://www.gernot-katzers-spice-pages.com/var/korean_hangul_unicode.html
- https://docs.python.org/2/library/unicodedata.html

- http://howto.hackallthethings.com/2016/06/using-multi-byte-characters-to-nullify.html
- http://www.rafayhackingarticles.net/2016/08/google-chrome-firefox-address-bar.html
- https://www.xudongz.com/blog/2017/idn-phishing/
- https://www.theguardian.com/technology/2017/apr/19/phishing-url-trick-hackers
- https://it.slashdot.org/story/02/05/28/0142248/spoofing-urls-with-unicode

Further links:
http://lukas-prokop.at/talks/pydays18-unicode/#1 (a great exploration of Unicode's nooks and crannies)


Rae Knowler

May 12, 2018

More Decks by Rae Knowler

Other Decks in Programming


  1. Python, Locales and Writing Systems Rae Knowler PyCon US May

  2. Who I am @RaeKnowler Python (Django, CKAN) PHP, Go, JavaScript

    they/them/their https://www.flickr.com/photos/zurichtourism/5160475075
  3. @RaeKnowler #PyCon2018

  4. @RaeKnowler #PyCon2018 Python 3 is great Unicode by default! Source

    file encoding assumed to be UTF-8 No need to specify u'foobar' for non-ascii strings
  5. @RaeKnowler #PyCon2018 Python 3 is great Less of this:

  6. @RaeKnowler #PyCon2018 Turkish i and ı http://gizmodo.com/382026/a-cellphones-missing-do t-kills-two-people-puts-three-more-in-jail Emine Çalçoban

    Ramazan Çalçoban
  7. @RaeKnowler #PyCon2018 Turkish i and ı Dotless: 'ı' (U+0131), 'I'

    (U+0049) Dotted: 'i' (U+0069), 'İ' (U+0130) More details here: http://www.i18nguy.com/unicode/turkish-i18n.html
  8. @RaeKnowler #PyCon2018 Turkish i and ı

  9. @RaeKnowler #PyCon2018 Turkish i and ı

  10. @RaeKnowler #PyCon2018 Turkish i and ı - Solutions • PyICU:

    a Python extension wrapping IBM’s International Components for Unicode C++ library (ICU). https://pypi.python.org/pypi/PyICU • Or… make a translation table and use str.translate() to replace characters when changing the case
  11. @RaeKnowler #PyCon2018 Right-to-left writing systems https://en.wikipedia.org/wiki/File:Simtat_Aluf_Batslut.JPG

  12. @RaeKnowler #PyCon2018 Right-to-left writing systems Unicode wants characters ordered logically,

    not visually → we need bidirectional (bidi) support → pip install python-bidi
  13. @RaeKnowler #PyCon2018 Right-to-left writing systems

  14. @RaeKnowler #PyCon2018 Right-to-left writing systems Arabic letters have contextual forms.

    Their placement in the text changes their shape. https://en.wikipedia.org/wiki/Arabic_script_in_Unicode#Contextual_forms
  15. @RaeKnowler #PyCon2018 → Python Arabic Reshaper to the rescue: https://github.com/mpcabd/python-arabic-reshaper

    Right-to-left writing systems
  16. @RaeKnowler #PyCon2018 Right-to-left writing systems

  17. @RaeKnowler #PyCon2018 Fullwidth and halfwidth characters Notice any difference? ---

    The quick brown fox jumped over  the lazy dog. The quick brown fox jumped over the lazy dog.
  18. @RaeKnowler #PyCon2018 Fullwidth and halfwidth characters Some fonts don't even

    bother styling the fullwidth characters. --- The quick brown fox jumped over  the lazy dog. The quick brown fox jumped over the lazy dog.
  19. @RaeKnowler #PyCon2018 Fullwidth and halfwidth characters 假借字, 形声字 Han characters

    (in Chinese, Japanese, Korean) are fullwidth
  20. @RaeKnowler #PyCon2018 Fullwidth and halfwidth characters 假借字, 形声字 ミムメモヤユヨラリルレロワン ミムメモヤユヨラリルレロワン

    There are fullwidth and halfwidth kana (Japanese)
  21. @RaeKnowler #PyCon2018 Fullwidth and halfwidth characters 假借字, 形声字 ミムメモヤユヨラリルレロワン ミムメモヤユヨラリルレロワン

    なにぬねのは Hiragana (Japanese) are always fullwidth
  22. @RaeKnowler #PyCon2018 Fullwidth and halfwidth characters Copyright © 2008 W3C®

    (MIT, ERCIM, Keio), All Rights Reserved. https://www.w3.org/2007/02/japanese-layout/docs/aligned/japanese-layout-requirements-en.html
  23. @RaeKnowler #PyCon2018 Fullwidth and halfwidth characters pip install jaconv

  24. @RaeKnowler #PyCon2018 Fullwidth and halfwidth characters

  25. @RaeKnowler #PyCon2018 Korean text https://en.wikipedia.org/wiki/Hangul#/media/File:Hangeul.svg

  26. @RaeKnowler #PyCon2018 Korean text Lots more detail here: http://www.gernot-katzers-spice-pages.com /var/korean_hangul_unicode.html

  27. @RaeKnowler #PyCon2018 Korean text Unicode canonical equivalence: You can build

    the same character in several different ways, and they mean the same thing. 한 means the same as ㅎㅏㄴ
  28. @RaeKnowler #PyCon2018 Korean text Unicode canonical equivalence: You can build

    the same character in several different ways, and they mean the same thing. 한 means the same as ㅎㅏㄴ Normal Form D (NFD): ㅎㅏㄴ Normal Form C (NFC): 한
  29. @RaeKnowler #PyCon2018 Korean text Unicode compatibility equivalence: There are multiple

    code points for identical characters, for backwards compatibility reasons U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I) (https://docs.python.org/2/library/unicodedata.html)
  30. @RaeKnowler #PyCon2018 Korean text

  31. @RaeKnowler #PyCon2018 Korean text

  32. @RaeKnowler #PyCon2018 Korean text

  33. @RaeKnowler #PyCon2018 Korean text

  34. @RaeKnowler #PyCon2018 Korean text

  35. @RaeKnowler #PyCon2018 Security This is a huge topic! A couple

    of quick examples...
  36. @RaeKnowler #PyCon2018 Security - SQL Injection User input: I don't

    like raisins Sanitised user input: 'I don\'t like raisins' Hex encoding of \ is 0x5C
  37. @RaeKnowler #PyCon2018 Security - SQL Injection Hex encoding for 稞:

    0xb8 0x5c User input: 0xb8' OR 1=1 Sanitised user input: '稞 OR 1=1'
  38. @RaeKnowler #PyCon2018 Security - SQL Injection More details here: http://howto.hackallthethings.com

  39. @RaeKnowler #PyCon2018 Security - Address Bar Spoofing A nice google.com

    link: http://google.com/test/test/test/تارﺎﻣا.ﻲﺑﺮﻋ This actually led to: http://تارﺎﻣا.ﻲﺑﺮﻋ/google.com/test/test/test
  40. @RaeKnowler #PyCon2018 Security - Address Bar Spoofing More details here:

    http://www.rafayhackingarticles.net/2016/08/google-c hrome-firefox-address-bar.html
  41. @RaeKnowler #PyCon2018 Security - Unicode characters in urls https://аррӏе.com vs

  42. @RaeKnowler #PyCon2018 Security - Unicode characters in urls https://www.xn--80ak6aa92e.com/ vs

  43. @RaeKnowler #PyCon2018 Security - Unicode characters in urls Xudong Zheng,

    Phishing with Unicode Domains: https://www.xudongz.com/blog/2017/idn-phishing/ Safari, Edge and Chrome: show an alert Firefox: see Zheng's page for a fix
  44. @RaeKnowler #PyCon2018 Security - Unicode characters in urls Unicode trick

    lets hackers hide phishing URLs (The Guardian, April 2017) https://www.theguardian.com/technology /2017/apr/19/phishing-url-trick-hackers
  45. @RaeKnowler #PyCon2018 Security - Unicode characters in urls Spoofing URLs

    with Unicode (Slashdot, May 2002) https://it.slashdot.org /story/02/05/28/0142248/spoofing-urls-with-unicode
  46. @RaeKnowler #PyCon2018 Conclusions This stuff isn't easy … but it

    is interesting! There are a lot of useful libraries out there. You won't be the first person to have your particular problem. Python 3 makes dealing with Unicode a lot easier.
  47. @RaeKnowler #PyCon2018 Further links • The Absolute Minimum Every Software

    Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!): http://www.joelonsoftware.com/articles/Unicode.html • Unicode, or why py3k was necessary: http://lukas-prokop.at/talks/pydays18-unicode/#1 • Dark corners of Unicode: https://eev.ee/blog/2015/09/12/dark-corners-of-unicode
  48. @RaeKnowler #PyCon2018 Further links • I Can Text You A

    Pile of Poo, But I Can’t Write My Name: https://modelviewculture.com/pieces/i-can-text-you-a-pile-of- poo-but-i-cant-write-my-name • Nope, Not Arabic: http://nopenotarabic.tumblr.com/ • Symbol Codes, Computing with Accents, Symbols and Foreign Scripts: http://sites.psu.edu/symbolcodes/
  49. @RaeKnowler #PyCon2018 Thanks! @RaeKnowler rae.knowler@liip.ch