Slide 1

Slide 1 text

Python, Locales and Writing Systems Rae Knowler PyCon US May 2018

Slide 2

Slide 2 text

Who I am @RaeKnowler Python (Django, CKAN) PHP, Go, JavaScript they/them/their https://www.flickr.com/photos/zurichtourism/5160475075

Slide 3

Slide 3 text

@RaeKnowler #PyCon2018

Slide 4

Slide 4 text

@RaeKnowler #PyCon2018 Python 3 is great Unicode by default! Source file encoding assumed to be UTF-8 No need to specify u'foobar' for non-ascii strings

Slide 5

Slide 5 text

@RaeKnowler #PyCon2018 Python 3 is great Less of this:

Slide 6

Slide 6 text

@RaeKnowler #PyCon2018 Turkish i and ı http://gizmodo.com/382026/a-cellphones-missing-do t-kills-two-people-puts-three-more-in-jail Emine Çalçoban Ramazan Çalçoban

Slide 7

Slide 7 text

@RaeKnowler #PyCon2018 Turkish i and ı Dotless: 'ı' (U+0131), 'I' (U+0049) Dotted: 'i' (U+0069), 'İ' (U+0130) More details here: http://www.i18nguy.com/unicode/turkish-i18n.html

Slide 8

Slide 8 text

@RaeKnowler #PyCon2018 Turkish i and ı

Slide 9

Slide 9 text

@RaeKnowler #PyCon2018 Turkish i and ı

Slide 10

Slide 10 text

@RaeKnowler #PyCon2018 Turkish i and ı - Solutions ● PyICU: a Python extension wrapping IBM’s International Components for Unicode C++ library (ICU). https://pypi.python.org/pypi/PyICU ● Or… make a translation table and use str.translate() to replace characters when changing the case

Slide 11

Slide 11 text

@RaeKnowler #PyCon2018 Right-to-left writing systems https://en.wikipedia.org/wiki/File:Simtat_Aluf_Batslut.JPG

Slide 12

Slide 12 text

@RaeKnowler #PyCon2018 Right-to-left writing systems Unicode wants characters ordered logically, not visually → we need bidirectional (bidi) support → pip install python-bidi

Slide 13

Slide 13 text

@RaeKnowler #PyCon2018 Right-to-left writing systems

Slide 14

Slide 14 text

@RaeKnowler #PyCon2018 Right-to-left writing systems Arabic letters have contextual forms. Their placement in the text changes their shape. https://en.wikipedia.org/wiki/Arabic_script_in_Unicode#Contextual_forms

Slide 15

Slide 15 text

@RaeKnowler #PyCon2018 → Python Arabic Reshaper to the rescue: https://github.com/mpcabd/python-arabic-reshaper Right-to-left writing systems

Slide 16

Slide 16 text

@RaeKnowler #PyCon2018 Right-to-left writing systems

Slide 17

Slide 17 text

@RaeKnowler #PyCon2018 Fullwidth and halfwidth characters Notice any difference? --- The quick brown fox jumped over  the lazy dog. The quick brown fox jumped over the lazy dog.

Slide 18

Slide 18 text

@RaeKnowler #PyCon2018 Fullwidth and halfwidth characters Some fonts don't even bother styling the fullwidth characters. --- The quick brown fox jumped over  the lazy dog. The quick brown fox jumped over the lazy dog.

Slide 19

Slide 19 text

@RaeKnowler #PyCon2018 Fullwidth and halfwidth characters 假借字, 形声字 Han characters (in Chinese, Japanese, Korean) are fullwidth

Slide 20

Slide 20 text

@RaeKnowler #PyCon2018 Fullwidth and halfwidth characters 假借字, 形声字 ミムメモヤユヨラリルレロワン ミムメモヤユヨラリルレロワン There are fullwidth and halfwidth kana (Japanese)

Slide 21

Slide 21 text

@RaeKnowler #PyCon2018 Fullwidth and halfwidth characters 假借字, 形声字 ミムメモヤユヨラリルレロワン ミムメモヤユヨラリルレロワン なにぬねのは Hiragana (Japanese) are always fullwidth

Slide 22

Slide 22 text

@RaeKnowler #PyCon2018 Fullwidth and halfwidth characters Copyright © 2008 W3C® (MIT, ERCIM, Keio), All Rights Reserved. https://www.w3.org/2007/02/japanese-layout/docs/aligned/japanese-layout-requirements-en.html

Slide 23

Slide 23 text

@RaeKnowler #PyCon2018 Fullwidth and halfwidth characters pip install jaconv

Slide 24

Slide 24 text

@RaeKnowler #PyCon2018 Fullwidth and halfwidth characters

Slide 25

Slide 25 text

@RaeKnowler #PyCon2018 Korean text https://en.wikipedia.org/wiki/Hangul#/media/File:Hangeul.svg

Slide 26

Slide 26 text

@RaeKnowler #PyCon2018 Korean text Lots more detail here: http://www.gernot-katzers-spice-pages.com /var/korean_hangul_unicode.html

Slide 27

Slide 27 text

@RaeKnowler #PyCon2018 Korean text Unicode canonical equivalence: You can build the same character in several different ways, and they mean the same thing. 한 means the same as ㅎㅏㄴ

Slide 28

Slide 28 text

@RaeKnowler #PyCon2018 Korean text Unicode canonical equivalence: You can build the same character in several different ways, and they mean the same thing. 한 means the same as ㅎㅏㄴ Normal Form D (NFD): ㅎㅏㄴ Normal Form C (NFC): 한

Slide 29

Slide 29 text

@RaeKnowler #PyCon2018 Korean text Unicode compatibility equivalence: There are multiple code points for identical characters, for backwards compatibility reasons U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I) (https://docs.python.org/2/library/unicodedata.html)

Slide 30

Slide 30 text

@RaeKnowler #PyCon2018 Korean text

Slide 31

Slide 31 text

@RaeKnowler #PyCon2018 Korean text

Slide 32

Slide 32 text

@RaeKnowler #PyCon2018 Korean text

Slide 33

Slide 33 text

@RaeKnowler #PyCon2018 Korean text

Slide 34

Slide 34 text

@RaeKnowler #PyCon2018 Korean text

Slide 35

Slide 35 text

@RaeKnowler #PyCon2018 Security This is a huge topic! A couple of quick examples...

Slide 36

Slide 36 text

@RaeKnowler #PyCon2018 Security - SQL Injection User input: I don't like raisins Sanitised user input: 'I don\'t like raisins' Hex encoding of \ is 0x5C

Slide 37

Slide 37 text

@RaeKnowler #PyCon2018 Security - SQL Injection Hex encoding for 稞: 0xb8 0x5c User input: 0xb8' OR 1=1 Sanitised user input: '稞 OR 1=1'

Slide 38

Slide 38 text

@RaeKnowler #PyCon2018 Security - SQL Injection More details here: http://howto.hackallthethings.com /2016/06/using-multi-byte-characters-to-nullify.html

Slide 39

Slide 39 text

@RaeKnowler #PyCon2018 Security - Address Bar Spoofing A nice google.com link: http://google.com/test/test/test/تارﺎﻣا.ﻲﺑﺮﻋ This actually led to: http://تارﺎﻣا.ﻲﺑﺮﻋ/google.com/test/test/test

Slide 40

Slide 40 text

@RaeKnowler #PyCon2018 Security - Address Bar Spoofing More details here: http://www.rafayhackingarticles.net/2016/08/google-c hrome-firefox-address-bar.html

Slide 41

Slide 41 text

@RaeKnowler #PyCon2018 Security - Unicode characters in urls https://аррӏе.com vs https://apple.com

Slide 42

Slide 42 text

@RaeKnowler #PyCon2018 Security - Unicode characters in urls https://www.xn--80ak6aa92e.com/ vs https://apple.com

Slide 43

Slide 43 text

@RaeKnowler #PyCon2018 Security - Unicode characters in urls Xudong Zheng, Phishing with Unicode Domains: https://www.xudongz.com/blog/2017/idn-phishing/ Safari, Edge and Chrome: show an alert Firefox: see Zheng's page for a fix

Slide 44

Slide 44 text

@RaeKnowler #PyCon2018 Security - Unicode characters in urls Unicode trick lets hackers hide phishing URLs (The Guardian, April 2017) https://www.theguardian.com/technology /2017/apr/19/phishing-url-trick-hackers

Slide 45

Slide 45 text

@RaeKnowler #PyCon2018 Security - Unicode characters in urls Spoofing URLs with Unicode (Slashdot, May 2002) https://it.slashdot.org /story/02/05/28/0142248/spoofing-urls-with-unicode

Slide 46

Slide 46 text

@RaeKnowler #PyCon2018 Conclusions This stuff isn't easy … but it is interesting! There are a lot of useful libraries out there. You won't be the first person to have your particular problem. Python 3 makes dealing with Unicode a lot easier.

Slide 47

Slide 47 text

@RaeKnowler #PyCon2018 Further links ● The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!): http://www.joelonsoftware.com/articles/Unicode.html ● Unicode, or why py3k was necessary: http://lukas-prokop.at/talks/pydays18-unicode/#1 ● Dark corners of Unicode: https://eev.ee/blog/2015/09/12/dark-corners-of-unicode

Slide 48

Slide 48 text

@RaeKnowler #PyCon2018 Further links ● I Can Text You A Pile of Poo, But I Can’t Write My Name: https://modelviewculture.com/pieces/i-can-text-you-a-pile-of- poo-but-i-cant-write-my-name ● Nope, Not Arabic: http://nopenotarabic.tumblr.com/ ● Symbol Codes, Computing with Accents, Symbols and Foreign Scripts: http://sites.psu.edu/symbolcodes/

Slide 49

Slide 49 text

@RaeKnowler #PyCon2018 Thanks! @RaeKnowler rae.knowler@liip.ch