Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode in python

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Unicode in python

簡單介紹在 python 中會遇到的 Unicode 的問題,涵蓋 python2, python3
並簡介寫多國語言 python 程式的簡單建議。

內容延伸自 Ned Batchelder 在 PyCon 2012 的 talk ,但會比他談到更多的東西。
http://nedbatchelder.com/text/unipain.html

這個 talk 針對初學者為主,不會有太難理解的 python 程式碼。

Avatar for birdhackor

birdhackor

April 19, 2014
Tweet

Other Decks in Programming

Transcript

  1. Unicode Unicode త֓೦ඇৗ؆ᄸɼଞब૾ੋి࿩฽Ұᒬɼڅ㑌ݸ Ꮠࣈʢి࿩ʣఆٛҰݸሣጯతࣈݩʢਓ໊ʣɼن্֨ɼ Unicode తදࣔ๏ੋ U+Ꮠࣈ ଖதʮᏐࣈʯੋҎे࿡ਐҐ දࣔɼൣᅴኺ 0000

    ౸ 10FFFF ័ڞ 1114112 ݸɻ Ṝᒬతදࣔ๏ɼ㑌ݸදࣔɼ౎ඃ᜝ҝҰݸ code point ɼ໵ बੋ㘸ɼ࠷ଟɼUnicode ՄҎሣጯڞ 1114112 ݸࣈݩɼࡏ ࠷৽త Unicode 6.3 (September 2013) نൣཫɼఆٛྃ 110,122 ݸࣈݩʢෆؚ߇੍ࣈݩ౳ʣɼ໵बੋ㘸౸໨લҝ ࢭɼ୞༻౸େ໿े෼೭Ұత code point ࣕቮɻ
  2. Unicode Unicode ೺ code point ੾෼੒኷ଟݸეմɼ㑌ݸეմᚭྔ์ ஔಉҰྨܕతจࣈɼॴҎ༗޷زݸეմҝྃະိՄೳधཁิ ॆจࣈత᮫܎ɼฒᔒ༗೺֘ეմॴ༗త code point

    ౎ఆٛ޷ จࣈɻ ኷ଟਓ༰қޡ။తҰݸࣄመੋɿʮ Unicode ຊ਎ෆੋฤᛰʯ ࣄመ্ Unicode ୞ᄸ७ఆٛሣጯ᮫܎ɼଞᔒ༗نఆ code point ཁዎኄ༻ bytes දࣔɻॴҎᅡ֨ိ㘸 Unicode ୞ੋݸ ሣጯදنൣɼෆੋฤᛰɻ
  3. ℙƴ☂ℌøἤ U+2119: DOUBLE-STRUCK CAPITAL P U+01B4: LATIN SMALL LETTER Y

    WITH HOOK U+2602: UMBRELLA U+210C: BLACK-LETTER CAPITAL H U+00F8: LATIN SMALL LETTER O WITH STROKE U+1F24: GREEK SMALL LETTER ETA WITH PSILI AND OXIA
  4. UTF-16 ᢛ UCS-2 UTF-16 ؆ᄸိ㘸ɼबੋҎ 16 ҐݩҝҰݸᄸҐိදࣔ Unicode ࣈݩతҰछฤᛰ๏ɻ Ұൠᔒ༗஫ҙաతਓඇৗ༰қޡ။ɼҎҝ

    UTF-16 ཫɼҎ 16 ҐݩҝҰݸᄸҐɼॴҎ ࠷ଟ୞ೳදࣔ 0~65535 త Unicode ࣈݩɻ ࣄመ্Ṝݸ૝๏ੋݸޡ။ɼҼҝਅਖ਼ඃݶఆࡏ୞ೳදࣔ 0~65535 త Unicode ࣈݩత ฤᛰɼଖመੋ UTF-16 తࢠू UCS-2 ࠽ሣɻ UTF-16 ᢛ UCS-2 ࠷େతࠩҟबੋɼUTF-16 ሣҰݸʮࣈݩʯੋՄҎվᏓ༻౸తҐݩ ௕౓తɼ໵बੋ㘸ɼҰݸᄸҐʢၷݸҐݩ૊ʣෆೳදࣔతࣈݩɼࡏ UTF-16 ཫ။༻ၷ ݸᄸҐʢڞ࢛ݸҐݩ૊ʣိදࣔɻࣕ UCS-2 ଇੋ௕౓ݻఆɼӬԕ༻ҰݸᄸҐʢၷݸ Ґݩ૊ʣိදࣔࣈݩɼॴҎ UCS-2 ࠷ଟ୞ೳදࣔ 0~65535 త Unicode ࣈݩɻ
  5. U+FFFF લత code point ௚઀ҎၷݸҐݩ૊ိṶଘᏐࣈɼଖ值௚઀ሣጯ code point తᏐ ࣈɻ ྫ೗ɿʮU+2602ʀ’ʯब௚઀Ҏʮ0010011000000010ʯṶ

    ଘɼଖத code point త 2602 ੋ 16 ਐҐදࣔ๏ɼ᫚׵੒ 2 ਐ Ґ 10011000000010 Ṷଘɼෆ଍త෦෼௚઀ิྵɼิᴡࢸ 2 ݸ Ґݩ૊ɻ ࠶ྫ೗ ɿʮU+00F8ʀøʯଇҎʮ0000000011111000ʯṶଘɻ
  6. U+FFFF ޙత code point ሡ code point ሣጯతᏐࣈݮڈ 10000(16ਐҐ)ɼሡ။ಘ౸ൣᅴ 000000~0FFFFF

    ័ڞ઎༻ 20 ݸҐݩతᏐࣈʢ୞༗ U+010000~U+10FFFF ။ धཁ༻౸ṜݸํࣜɼॴҎݮڈ 10000 ။ಘ౸ 000000~0FFFFF Ṝᒬతൣᅴʣ ሡୈҰ㑊ᱜॴಘ౸తᏐࣈɼલ 10 ҐݩՃ্ D800(16ਐҐ) ಘ౸ୈҰ૊ද ࣔɼ઎༻ҰݸᄸҐڞၷݸҐݩ૊ɻṜ૊දࣔඃ᜝ҝߴҐ୅ཧʢhigh surrogateʣɼṜ૊දࣔॴಘతᏐࣈ೭ൣᅴੋ D800~DBFF ɻ ሡୈೋ㑊ᱜ༻ႫԼత 10 ݸҐݩʢୈҰ㑊ᱜతޙ 10 ҐݩʣՃ্ DC00 ಘ ౸ୈೋ૊දࣔɻṜ૊දࣔඃ᜝ҝ௿Ґ୅ཧʢlow surrogateʣɼṜ૊දࣔॴಘ తᏐࣈ೭ൣᅴੋ DC00~DFFF ɻ
  7. U+FFFF ޙత code point Ṝཫզ၇༻ (U+1030A) ࣔൣҰ࣍መࡍૢ࡞ɻ 1030A ݮڈ 10000

    ಘ౸ 30A ɼ᫚׵੒ 2 ਐҐҝ 1100001010 ɼෆ଍ 20 Ґݩత෦෼ิྵɼᏓ੒ 00000000001100001010 લ 10 Ґݩҝ 0000000000 ɼ׵੒ 16 ਐҐҝ 0000 ɼՃ্ D800 ಘ౸ D800 ޙ 10 Ґݩҝ 1100001010 ɼ׵੒ 16 ਐҐҝ 030A ɼՃ্ DC00 ಘ౸ DF0A ॴҎɼU+1030A ࡏ UTF-16 बੋදࣔ੒ D800DF0A
  8. UTF-32 ᢛ UCS-4 UTF-32 ᢛ UCS-4 ၷछฤᛰɼ౎ੋ઎༻࢛ݸҐݩ૊ɼ࢛ݸ Ґݩ૊ՄҎදࣔతྔቮៃ௒ա੔ݸ Unicode తఆٛྃɼॴ

    Ҏᔒ༗Ꮣ׵௕౓త໰୊ɼ࢛ݸҐݩ૊Ұఆ夠༻ɻ දࣔ๏኷؆ᄸɼ᪑ UTF-16 ૬ಉɼ௚઀༻ code point తᏐ ࣈṶଘɼෆ଍࢛Ґݩ૊ิྵิ଍ɻ
  9. Ґݩ૊ং ૾ੋ্໘త U+1030A ࡏ UTF-16 ཫ໘දࣔ੒ D800DF0AɼୠੋɼṶଘ࣌ɼґরهԱᱪॱ ংɼڀᰈੋഉྻ੒ D8 00

    DF 0A ؐੋഉྻ੒ 00 D8 0A DF 䏆ʁ ґᎦ႔ཧث҃ੋ໢࿏ڠٞࣕఆɻ ౼࿦࠾༻ Big Endian ɼҼҝṜ᪑զ၇తᏐላӡࢉ݁ՌഉংҰᒬɼํศ౼࿦᪑㕕ᩇɻ ࣕҝྃࠂ஌࢖༻ऀʢ҃ఔࣜʣࢿྉੋዎኄഉংɼ ՄҎՃೖ BOM తඪࢽ౸ࢿྉత։಄ɼ ༻ҎႵࣝഉྻํࣜɻ
  10. UTF-8 UTF-8 ฤᛰඇৗᡏ໌ɻར༻ՄᏓฤᛰతํࣜိṶଘࢿྉɼሣԙ 0-127 త code point ౎ଘ੒ҰݸҐݩ૊ɼߋߴత code point

    ଇଘ੒ၷݸҐݩ૊Ҏ্ɻ Ṝछઃܭ༗ෆগ޷႔ɼҰํ໘ল༰ྔɼ㠥֎Ұํ໘ɼሣԙ 0-127 త code point ࣕݴɼASCII ฤᛰᢛ UTF-8 ฤᛰతදࣔ๏׬શ૬ಉɻṜ୅දॄኄʁṜ ୅ද㟬၊ઃจ݅ੋ UTF-8 ฤᛰతํࣜိ։啟 ASCII ฤᛰෆ။༗စᛰɼᰖࣔ݁ Ռ׬શ૬ಉɼෆ။ଧᗞط༗䈕Ҋత૬༰ੑɻ
  11. UTF-8 ؃֘Ґݩ૊։಄ੋ 0 ؐੋ 1 ɼ೗Ռੋ 0 දࣔࠑҐݩ૊ಉ ASCII ฤᛰɼሣጯ౸

    U+0000~U+007F ɼṜݸҐݩ૊௚઀දࣔᄸҰࣈݩɻ೗Ռੋ 1 ଇදࣔɼṜݸҐݩ૊ੋଟ Ґݩ૊ࣈݩతଖதҰݸҐݩ૊ɻ ೗ՌҐݩ૊։಄ੋ 1 දࣔṜݸҐݩ૊ੋଟҐݩ૊ࣈݩతଖதҰݸҐݩ૊ɼᒾ查ଖୈೋ ݸҐݩɼ೗Ռੋ 0 ɼ୅දଞෆੋୈҰݸҐݩ૊ɼ၊ઃṜݸࣈݩཁ༻ၷݸҐݩ૊දࣔɼಹ ኄ 10 ։಄తҐݩ૊बੋୈೋݸɻ೗ՌṜݸࣈݩੋ༻ࡾݸҐݩ૊දࣔɼಹኄ 10 ։಄తҐ ݩ૊बՄೳੋୈೋ૊҃ୈࡾ૊ɼҎࠑྨਪɻ ೗ՌҐݩ૊։಄ੋ 1 ׌ୈೋҐ໵ੋ 1 ɼଇṜݸҐݩ૊ੋଟҐݩ૊ࣈݩతଖதҰݸҐݩ ૊ɼࣕ׌Ұఆੋ։಄ɻṜݸҐݩ૊։಄༗ଟগݸ 1 ब୅දṜݸࣈݩ༻౸ଟগݸҐݩ૊ɼ ൺ೗㘸ɼ 110 ։಄ɼදࣔძҐݩ૊ࣈݩɼ1110 ։಄ɼදࣔࡾҐݩ૊తࣈݩɻ ্ड़༻ԙ൑Ꮧత։಄Ґݩ௚઀ڈᎃɼႫԼతরॱং㣥઀बੋሣጯత code point
  12. UTF-8 ၊ઃ㟬ख্༗ 00100011 11100101 10100100 10101001 Ṝᒬతࢿྉɼዎኄղੳ䏆ʁ ୈҰݸҐݩ૊ 00100011 ։಄बੋ

    0 ɼ௚઀ሣጯ ASCII ɼ׵ࢉ੒ 16 ਐҐੋ 23 ɼ查্໘ తදɼሣጯ # ɻ ୈೋݸҐݩ૊ 11100101 ։಄ੋ 1 ɼදࣔṜੋଟҐݩࣈݩతଖதҰݸҐݩ૊ɼ։಄ੋ 1110 ɼදࣔṜੋࡾҐݩ૊ࣈݩత։಄ɻॴҎޙ໘తၷݸҐݩ૊౎ጯ֘ཁੋ 10 ։಄ɻҰ ؃ୈࡾݸᢛୈ࢛ݸҐݩ૊ɼՌવੋ 10 ։಄ɻṜࡾݸҐݩ૊ཁҰىղੳ੒Ұݸࣈݩɻ ೺ୈೋݸҐݩ૊༻ိඪهత 1110 ڈᎃɼୈࡾɺୈ࢛ݸҐݩ૊༻ိඪهత 10 ڈᎃɼ㣥઀ ىိಘ౸ 0101100100101001 ׵ࢉ੒ 16 Ґݩबੋ 5929 ɼṜ୅දṜၷݸҐݩ૊දࣔతੋ U+5929 Ṝݸࣈݩɼ查දՄಘṜݸ code point ሣጯʮఱʯṜݸࣈݩɻ ॴҎ 00100011 11100101 10100100 10101001 Ṝᒬత UTF-8 ฤᛰ बදࣔ #ఱ
  13. unicode ᢛ str ࡏ Python 2.0 ࣌ɼ PEP 100 Ҿೖྃ

    unicode type ࢖ಘ Python 2.x ኺࠑ༗ྃ ၷݸྨܕతࣈ۲ɻ Python 2.x ཫɼ type str ੋҰ۲Ґݩ૊త ংྻɻࣕ type unicode ଇੋҰ۲ code point తংྻɻ
  14. ฏ୆૬ґ໰୊ ૣظɼ Python త unicode ఈ૚ੋར༻ unsigned long array ိṶଘࢿ

    ྉɼࡏ PEP 261 ։࢝ɼPython ҾೖྃՄҎᩋզ၇ར༻ฤᩄჩᏐࢦఆ ఈ૚༻ UCS2 ҃ UCS4 ိଘṜࠣࢿྉɻ Ṝಋகྃሣಉᒬత unicode ෺݅ࡏෆಉղੳث্ग़ݱෆಉ௕౓త㐫گ
  15. str ᢛ bytes Python 3 ཫɼ type str ੋҰ۲ code

    point తং ྻɻࣕ type bytes ଇੋҰ۲Ґݩ૊తংྻɻ
  16. Python 3 తຑ൥ ࿘අ༰ྔ ࡏ 3.3 Ҏલɼब૾લ໘ఏ౸ UTF-8 ࣌㘸աత໰୊ɼ UTF-16

    ҃ UTF-32 ଠ㟯༰ྔɼࣕ׌ؐ။଄੒ෆಉ Build ࣥߦ݁Ռෆ ಉత㐫گɻ ࡏ PEP 393 ࣌ɼҾೖྃʮFlexible String Representationʯ ؆ᄸိ㘸ɼࡏఈ૚Ṷଘࣈ۲࣌ɼ Python ။ڈᒾ查༻౸త code point ࠷େᏐࣈੋଟগɼ೗Ռ௒աҰݸҐݩ૊ɼಹब༻ ࠷େతॴधҐݩ૊ҝᄸҐṶଘɻ
  17. Python 3 తຑ൥ ൺ೗㘸Ұ࿈۲తࣈݩதɼ೗Ռੋ ASCII ٴҰݸ U+10FFFF ɼಹ੔ݸ str ఈ૚။೺㑌ݸ

    code point ༻ 4 ݸҐݩ૊Ṷଘɻ ೗Ռ੔ݸࣈ۲౎༻ ASCII ࣕቮɼಹ੔ݸ str ఈ૚။೺㑌ݸ code point ༻ 1 ݸҐݩ૊Ṷଘɻ আྃল༰ྔɼؐ༗ݸ኷େత޷႔ੋɼൺֱࣈ۲త࣌ީɼ೗Ռ code point ՖඅతҐݩ૊ෆಉಹ࿈㚎༰౎ෆधཁൺब஌ಓߠఆੋෆಉၷݸ ࣈ۲ྃɻ ࡏ 3.3 ೭ޙɼ໵ෆ࠶༗ෆಉ Build ჩᏐಋகఈ૚࠾༻ෆಉฤᛰత㐫گ ᚙੜྃɻ
  18. ݱመੈքతᓫࠅ ؃ىိ޷૾ࡏ 3.3 ൛Ҏޙɼᢜత໰୊౎ղܾྃʁ ݱመੈքੋ኷ᓫࠅతɻൺ೗㘸 WSGI ن֨ҝྃ૬༰ੑɼҎٴ९क Header తنൣɼ 㟬ࡏฤᛰ࣌ඃഭཁڈ႔ཧ

    utf8 ᢛ Latin 1 త૬༰໰୊ɻ Ṝ෦෼ੋ python 2 ෆधཁ፦৺తɻ PEP 383 ᦒવҾਐྃ surrogateescape څ os.fsencode ᢛ os.fsdecode ౳౳ Function ࢖༻ղܾྃ෦㟨త໰୊ɼୠՄ੯ಉ࣌໵଄੒ྃޙ᠃ḤሜఔࣜɼҰ୴۰౸Ṝํ໘తവ Ꮠबඞਢඇৗ஫ҙੋ൱༗႔ཧ౸ surrogateescapeɼ൱ଇ໵။Ҿᚙ Errorɻ๭छఔ౓ ্ိ㘸ɼฒᔒ༗੒ޭతࠜຊղܾ໰୊ɻ আྃṜࠣ೭֎ɼ CPython త C API ໵Ҽҝ unicode త૬᮫ޭೳӽိӽෳᯑత᮫܎ɼ ᏓಘӽိӽՄዑɼṜሣ။䉰౸ C API త࢖༻ऀိ㘸໵ෆੋ޷ࣄɻ
  19. ݐٞ ݐٞҰɿʮࡏ I/O ࣌ɼ੾هሣ֎ိత bytes ࢿྉ၏ decode ɼࡏ㚎෦ᚭྔ୞႔ཧ unicodeɼࣕ༌ग़ࢿྉ࣌ɼ ଇهಘ

    encode ੒ bytes ྨܕɼ࠶ڈ᪑֎෦࡞ޓಈʯ ݐٞೋɿʮผݏຑ൥ɼᚭྔՖᴍ৺ྗ֬อ㟬తఔࣜత෺ ݅ྨܕੋ኷໌֬తɼ೗Ռੋ bytes ɼ㟬ؐཁᚭྔ֬ఆሏ ࠾༻తฤᛰʯ ݐٞࡾɿʮ௅ࢸগࡾޒछෆಉྨܕతޠݴሣ㟬తఔࣜ၏ ଌࢼʯ