Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicode in python

Unicode in python

簡單介紹在 python 中會遇到的 Unicode 的問題,涵蓋 python2, python3
並簡介寫多國語言 python 程式的簡單建議。

內容延伸自 Ned Batchelder 在 PyCon 2012 的 talk ,但會比他談到更多的東西。
http://nedbatchelder.com/text/unipain.html

這個 talk 針對初學者為主,不會有太難理解的 python 程式碼。

birdhackor

April 19, 2014
Tweet

Other Decks in Programming

Transcript

  1. Unicode Unicode త֓೦ඇৗ؆ᄸɼଞब૾ੋి࿩฽Ұᒬɼڅ㑌ݸ Ꮠࣈʢి࿩ʣఆٛҰݸሣጯతࣈݩʢਓ໊ʣɼن্֨ɼ Unicode తදࣔ๏ੋ U+Ꮠࣈ ଖதʮᏐࣈʯੋҎे࿡ਐҐ දࣔɼൣᅴኺ 0000

    ౸ 10FFFF ័ڞ 1114112 ݸɻ Ṝᒬతදࣔ๏ɼ㑌ݸදࣔɼ౎ඃ᜝ҝҰݸ code point ɼ໵ बੋ㘸ɼ࠷ଟɼUnicode ՄҎሣጯڞ 1114112 ݸࣈݩɼࡏ ࠷৽త Unicode 6.3 (September 2013) نൣཫɼఆٛྃ 110,122 ݸࣈݩʢෆؚ߇੍ࣈݩ౳ʣɼ໵बੋ㘸౸໨લҝ ࢭɼ୞༻౸େ໿े෼೭Ұత code point ࣕቮɻ
  2. Unicode Unicode ೺ code point ੾෼੒኷ଟݸეմɼ㑌ݸეմᚭྔ์ ஔಉҰྨܕతจࣈɼॴҎ༗޷زݸეմҝྃະိՄೳधཁิ ॆจࣈత᮫܎ɼฒᔒ༗೺֘ეմॴ༗త code point

    ౎ఆٛ޷ จࣈɻ ኷ଟਓ༰қޡ။తҰݸࣄመੋɿʮ Unicode ຊ਎ෆੋฤᛰʯ ࣄመ্ Unicode ୞ᄸ७ఆٛሣጯ᮫܎ɼଞᔒ༗نఆ code point ཁዎኄ༻ bytes දࣔɻॴҎᅡ֨ိ㘸 Unicode ୞ੋݸ ሣጯදنൣɼෆੋฤᛰɻ
  3. ℙƴ☂ℌøἤ U+2119: DOUBLE-STRUCK CAPITAL P U+01B4: LATIN SMALL LETTER Y

    WITH HOOK U+2602: UMBRELLA U+210C: BLACK-LETTER CAPITAL H U+00F8: LATIN SMALL LETTER O WITH STROKE U+1F24: GREEK SMALL LETTER ETA WITH PSILI AND OXIA
  4. UTF-16 ᢛ UCS-2 UTF-16 ؆ᄸိ㘸ɼबੋҎ 16 ҐݩҝҰݸᄸҐိදࣔ Unicode ࣈݩతҰछฤᛰ๏ɻ Ұൠᔒ༗஫ҙաతਓඇৗ༰қޡ။ɼҎҝ

    UTF-16 ཫɼҎ 16 ҐݩҝҰݸᄸҐɼॴҎ ࠷ଟ୞ೳදࣔ 0~65535 త Unicode ࣈݩɻ ࣄመ্Ṝݸ૝๏ੋݸޡ။ɼҼҝਅਖ਼ඃݶఆࡏ୞ೳදࣔ 0~65535 త Unicode ࣈݩత ฤᛰɼଖመੋ UTF-16 తࢠू UCS-2 ࠽ሣɻ UTF-16 ᢛ UCS-2 ࠷େతࠩҟबੋɼUTF-16 ሣҰݸʮࣈݩʯੋՄҎվᏓ༻౸తҐݩ ௕౓తɼ໵बੋ㘸ɼҰݸᄸҐʢၷݸҐݩ૊ʣෆೳදࣔతࣈݩɼࡏ UTF-16 ཫ။༻ၷ ݸᄸҐʢڞ࢛ݸҐݩ૊ʣိදࣔɻࣕ UCS-2 ଇੋ௕౓ݻఆɼӬԕ༻ҰݸᄸҐʢၷݸ Ґݩ૊ʣိදࣔࣈݩɼॴҎ UCS-2 ࠷ଟ୞ೳදࣔ 0~65535 త Unicode ࣈݩɻ
  5. U+FFFF લత code point ௚઀ҎၷݸҐݩ૊ိṶଘᏐࣈɼଖ值௚઀ሣጯ code point తᏐ ࣈɻ ྫ೗ɿʮU+2602ʀ’ʯब௚઀Ҏʮ0010011000000010ʯṶ

    ଘɼଖத code point త 2602 ੋ 16 ਐҐදࣔ๏ɼ᫚׵੒ 2 ਐ Ґ 10011000000010 Ṷଘɼෆ଍త෦෼௚઀ิྵɼิᴡࢸ 2 ݸ Ґݩ૊ɻ ࠶ྫ೗ ɿʮU+00F8ʀøʯଇҎʮ0000000011111000ʯṶଘɻ
  6. U+FFFF ޙత code point ሡ code point ሣጯతᏐࣈݮڈ 10000(16ਐҐ)ɼሡ။ಘ౸ൣᅴ 000000~0FFFFF

    ័ڞ઎༻ 20 ݸҐݩతᏐࣈʢ୞༗ U+010000~U+10FFFF ။ धཁ༻౸ṜݸํࣜɼॴҎݮڈ 10000 ။ಘ౸ 000000~0FFFFF Ṝᒬతൣᅴʣ ሡୈҰ㑊ᱜॴಘ౸తᏐࣈɼલ 10 ҐݩՃ্ D800(16ਐҐ) ಘ౸ୈҰ૊ද ࣔɼ઎༻ҰݸᄸҐڞၷݸҐݩ૊ɻṜ૊දࣔඃ᜝ҝߴҐ୅ཧʢhigh surrogateʣɼṜ૊දࣔॴಘతᏐࣈ೭ൣᅴੋ D800~DBFF ɻ ሡୈೋ㑊ᱜ༻ႫԼత 10 ݸҐݩʢୈҰ㑊ᱜతޙ 10 ҐݩʣՃ্ DC00 ಘ ౸ୈೋ૊දࣔɻṜ૊දࣔඃ᜝ҝ௿Ґ୅ཧʢlow surrogateʣɼṜ૊දࣔॴಘ తᏐࣈ೭ൣᅴੋ DC00~DFFF ɻ
  7. U+FFFF ޙత code point Ṝཫզ၇༻ (U+1030A) ࣔൣҰ࣍መࡍૢ࡞ɻ 1030A ݮڈ 10000

    ಘ౸ 30A ɼ᫚׵੒ 2 ਐҐҝ 1100001010 ɼෆ଍ 20 Ґݩత෦෼ิྵɼᏓ੒ 00000000001100001010 લ 10 Ґݩҝ 0000000000 ɼ׵੒ 16 ਐҐҝ 0000 ɼՃ্ D800 ಘ౸ D800 ޙ 10 Ґݩҝ 1100001010 ɼ׵੒ 16 ਐҐҝ 030A ɼՃ্ DC00 ಘ౸ DF0A ॴҎɼU+1030A ࡏ UTF-16 बੋදࣔ੒ D800DF0A
  8. UTF-32 ᢛ UCS-4 UTF-32 ᢛ UCS-4 ၷछฤᛰɼ౎ੋ઎༻࢛ݸҐݩ૊ɼ࢛ݸ Ґݩ૊ՄҎදࣔతྔቮៃ௒ա੔ݸ Unicode తఆٛྃɼॴ

    Ҏᔒ༗Ꮣ׵௕౓త໰୊ɼ࢛ݸҐݩ૊Ұఆ夠༻ɻ දࣔ๏኷؆ᄸɼ᪑ UTF-16 ૬ಉɼ௚઀༻ code point తᏐ ࣈṶଘɼෆ଍࢛Ґݩ૊ิྵิ଍ɻ
  9. Ґݩ૊ং ૾ੋ্໘త U+1030A ࡏ UTF-16 ཫ໘දࣔ੒ D800DF0AɼୠੋɼṶଘ࣌ɼґরهԱᱪॱ ংɼڀᰈੋഉྻ੒ D8 00

    DF 0A ؐੋഉྻ੒ 00 D8 0A DF 䏆ʁ ґᎦ႔ཧث҃ੋ໢࿏ڠٞࣕఆɻ ౼࿦࠾༻ Big Endian ɼҼҝṜ᪑զ၇తᏐላӡࢉ݁ՌഉংҰᒬɼํศ౼࿦᪑㕕ᩇɻ ࣕҝྃࠂ஌࢖༻ऀʢ҃ఔࣜʣࢿྉੋዎኄഉংɼ ՄҎՃೖ BOM తඪࢽ౸ࢿྉత։಄ɼ ༻ҎႵࣝഉྻํࣜɻ
  10. UTF-8 UTF-8 ฤᛰඇৗᡏ໌ɻར༻ՄᏓฤᛰతํࣜိṶଘࢿྉɼሣԙ 0-127 త code point ౎ଘ੒ҰݸҐݩ૊ɼߋߴత code point

    ଇଘ੒ၷݸҐݩ૊Ҏ্ɻ Ṝछઃܭ༗ෆগ޷႔ɼҰํ໘ল༰ྔɼ㠥֎Ұํ໘ɼሣԙ 0-127 త code point ࣕݴɼASCII ฤᛰᢛ UTF-8 ฤᛰతදࣔ๏׬શ૬ಉɻṜ୅දॄኄʁṜ ୅ද㟬၊ઃจ݅ੋ UTF-8 ฤᛰతํࣜိ։啟 ASCII ฤᛰෆ။༗စᛰɼᰖࣔ݁ Ռ׬શ૬ಉɼෆ။ଧᗞط༗䈕Ҋత૬༰ੑɻ
  11. UTF-8 ؃֘Ґݩ૊։಄ੋ 0 ؐੋ 1 ɼ೗Ռੋ 0 දࣔࠑҐݩ૊ಉ ASCII ฤᛰɼሣጯ౸

    U+0000~U+007F ɼṜݸҐݩ૊௚઀දࣔᄸҰࣈݩɻ೗Ռੋ 1 ଇදࣔɼṜݸҐݩ૊ੋଟ Ґݩ૊ࣈݩతଖதҰݸҐݩ૊ɻ ೗ՌҐݩ૊։಄ੋ 1 දࣔṜݸҐݩ૊ੋଟҐݩ૊ࣈݩతଖதҰݸҐݩ૊ɼᒾ查ଖୈೋ ݸҐݩɼ೗Ռੋ 0 ɼ୅දଞෆੋୈҰݸҐݩ૊ɼ၊ઃṜݸࣈݩཁ༻ၷݸҐݩ૊දࣔɼಹ ኄ 10 ։಄తҐݩ૊बੋୈೋݸɻ೗ՌṜݸࣈݩੋ༻ࡾݸҐݩ૊දࣔɼಹኄ 10 ։಄తҐ ݩ૊बՄೳੋୈೋ૊҃ୈࡾ૊ɼҎࠑྨਪɻ ೗ՌҐݩ૊։಄ੋ 1 ׌ୈೋҐ໵ੋ 1 ɼଇṜݸҐݩ૊ੋଟҐݩ૊ࣈݩతଖதҰݸҐݩ ૊ɼࣕ׌Ұఆੋ։಄ɻṜݸҐݩ૊։಄༗ଟগݸ 1 ब୅දṜݸࣈݩ༻౸ଟগݸҐݩ૊ɼ ൺ೗㘸ɼ 110 ։಄ɼදࣔძҐݩ૊ࣈݩɼ1110 ։಄ɼදࣔࡾҐݩ૊తࣈݩɻ ্ड़༻ԙ൑Ꮧత։಄Ґݩ௚઀ڈᎃɼႫԼతরॱং㣥઀बੋሣጯత code point
  12. UTF-8 ၊ઃ㟬ख্༗ 00100011 11100101 10100100 10101001 Ṝᒬతࢿྉɼዎኄղੳ䏆ʁ ୈҰݸҐݩ૊ 00100011 ։಄बੋ

    0 ɼ௚઀ሣጯ ASCII ɼ׵ࢉ੒ 16 ਐҐੋ 23 ɼ查্໘ తදɼሣጯ # ɻ ୈೋݸҐݩ૊ 11100101 ։಄ੋ 1 ɼදࣔṜੋଟҐݩࣈݩతଖதҰݸҐݩ૊ɼ։಄ੋ 1110 ɼදࣔṜੋࡾҐݩ૊ࣈݩత։಄ɻॴҎޙ໘తၷݸҐݩ૊౎ጯ֘ཁੋ 10 ։಄ɻҰ ؃ୈࡾݸᢛୈ࢛ݸҐݩ૊ɼՌવੋ 10 ։಄ɻṜࡾݸҐݩ૊ཁҰىղੳ੒Ұݸࣈݩɻ ೺ୈೋݸҐݩ૊༻ိඪهత 1110 ڈᎃɼୈࡾɺୈ࢛ݸҐݩ૊༻ိඪهత 10 ڈᎃɼ㣥઀ ىိಘ౸ 0101100100101001 ׵ࢉ੒ 16 Ґݩबੋ 5929 ɼṜ୅දṜၷݸҐݩ૊දࣔతੋ U+5929 Ṝݸࣈݩɼ查දՄಘṜݸ code point ሣጯʮఱʯṜݸࣈݩɻ ॴҎ 00100011 11100101 10100100 10101001 Ṝᒬత UTF-8 ฤᛰ बදࣔ #ఱ
  13. unicode ᢛ str ࡏ Python 2.0 ࣌ɼ PEP 100 Ҿೖྃ

    unicode type ࢖ಘ Python 2.x ኺࠑ༗ྃ ၷݸྨܕతࣈ۲ɻ Python 2.x ཫɼ type str ੋҰ۲Ґݩ૊త ংྻɻࣕ type unicode ଇੋҰ۲ code point తংྻɻ
  14. ฏ୆૬ґ໰୊ ૣظɼ Python త unicode ఈ૚ੋར༻ unsigned long array ိṶଘࢿ

    ྉɼࡏ PEP 261 ։࢝ɼPython ҾೖྃՄҎᩋզ၇ར༻ฤᩄჩᏐࢦఆ ఈ૚༻ UCS2 ҃ UCS4 ိଘṜࠣࢿྉɻ Ṝಋகྃሣಉᒬత unicode ෺݅ࡏෆಉղੳث্ग़ݱෆಉ௕౓త㐫گ
  15. str ᢛ bytes Python 3 ཫɼ type str ੋҰ۲ code

    point తং ྻɻࣕ type bytes ଇੋҰ۲Ґݩ૊తংྻɻ
  16. Python 3 తຑ൥ ࿘අ༰ྔ ࡏ 3.3 Ҏલɼब૾લ໘ఏ౸ UTF-8 ࣌㘸աత໰୊ɼ UTF-16

    ҃ UTF-32 ଠ㟯༰ྔɼࣕ׌ؐ။଄੒ෆಉ Build ࣥߦ݁Ռෆ ಉత㐫گɻ ࡏ PEP 393 ࣌ɼҾೖྃʮFlexible String Representationʯ ؆ᄸိ㘸ɼࡏఈ૚Ṷଘࣈ۲࣌ɼ Python ။ڈᒾ查༻౸త code point ࠷େᏐࣈੋଟগɼ೗Ռ௒աҰݸҐݩ૊ɼಹब༻ ࠷େతॴधҐݩ૊ҝᄸҐṶଘɻ
  17. Python 3 తຑ൥ ൺ೗㘸Ұ࿈۲తࣈݩதɼ೗Ռੋ ASCII ٴҰݸ U+10FFFF ɼಹ੔ݸ str ఈ૚။೺㑌ݸ

    code point ༻ 4 ݸҐݩ૊Ṷଘɻ ೗Ռ੔ݸࣈ۲౎༻ ASCII ࣕቮɼಹ੔ݸ str ఈ૚။೺㑌ݸ code point ༻ 1 ݸҐݩ૊Ṷଘɻ আྃল༰ྔɼؐ༗ݸ኷େత޷႔ੋɼൺֱࣈ۲త࣌ީɼ೗Ռ code point ՖඅతҐݩ૊ෆಉಹ࿈㚎༰౎ෆधཁൺब஌ಓߠఆੋෆಉၷݸ ࣈ۲ྃɻ ࡏ 3.3 ೭ޙɼ໵ෆ࠶༗ෆಉ Build ჩᏐಋகఈ૚࠾༻ෆಉฤᛰత㐫گ ᚙੜྃɻ
  18. ݱመੈքతᓫࠅ ؃ىိ޷૾ࡏ 3.3 ൛Ҏޙɼᢜత໰୊౎ղܾྃʁ ݱመੈքੋ኷ᓫࠅతɻൺ೗㘸 WSGI ن֨ҝྃ૬༰ੑɼҎٴ९क Header తنൣɼ 㟬ࡏฤᛰ࣌ඃഭཁڈ႔ཧ

    utf8 ᢛ Latin 1 త૬༰໰୊ɻ Ṝ෦෼ੋ python 2 ෆधཁ፦৺తɻ PEP 383 ᦒવҾਐྃ surrogateescape څ os.fsencode ᢛ os.fsdecode ౳౳ Function ࢖༻ղܾྃ෦㟨త໰୊ɼୠՄ੯ಉ࣌໵଄੒ྃޙ᠃ḤሜఔࣜɼҰ୴۰౸Ṝํ໘తവ Ꮠबඞਢඇৗ஫ҙੋ൱༗႔ཧ౸ surrogateescapeɼ൱ଇ໵။Ҿᚙ Errorɻ๭छఔ౓ ্ိ㘸ɼฒᔒ༗੒ޭతࠜຊղܾ໰୊ɻ আྃṜࠣ೭֎ɼ CPython త C API ໵Ҽҝ unicode త૬᮫ޭೳӽိӽෳᯑత᮫܎ɼ ᏓಘӽိӽՄዑɼṜሣ။䉰౸ C API త࢖༻ऀိ㘸໵ෆੋ޷ࣄɻ
  19. ݐٞ ݐٞҰɿʮࡏ I/O ࣌ɼ੾هሣ֎ိత bytes ࢿྉ၏ decode ɼࡏ㚎෦ᚭྔ୞႔ཧ unicodeɼࣕ༌ग़ࢿྉ࣌ɼ ଇهಘ

    encode ੒ bytes ྨܕɼ࠶ڈ᪑֎෦࡞ޓಈʯ ݐٞೋɿʮผݏຑ൥ɼᚭྔՖᴍ৺ྗ֬อ㟬తఔࣜత෺ ݅ྨܕੋ኷໌֬తɼ೗Ռੋ bytes ɼ㟬ؐཁᚭྔ֬ఆሏ ࠾༻తฤᛰʯ ݐٞࡾɿʮ௅ࢸগࡾޒछෆಉྨܕతޠݴሣ㟬తఔࣜ၏ ଌࢼʯ