Slide 1

Slide 1 text

Unicode in python

Slide 2

Slide 2 text

ྛ༐੒ ೦ඍిࢠ
 ઀ᨀ python ໿ၷ೥ ଖመੋݸ֎ߦ রยᙛવෆੋզ... ݸਓ໢᜾ɿ https://blog.niwyclin.org/ <--ؐᔒՍ޷

Slide 3

Slide 3 text

ޒݸʮࣄመʯᢛࡾݸʮݐ ٞʯ

Slide 4

Slide 4 text

ࣄመҰ ʮిᡵతੈքཫɼॴ༗ࢿྉ౎ੋ Bytes ɼᔒ ༗ʰ७จࣈʱṜछ౦੢ɻʯ

Slide 5

Slide 5 text

㐊ى

Slide 6

Slide 6 text

ߔាت଴

Slide 7

Slide 7 text

ࣄመೋ ʮશٿԽަྲྀత࣌୅ɼผ૝ᯪҰݸҐݩ૊ղ ܾҰ੾ɻʯ

Slide 8

Slide 8 text

Unicode Unicode త֓೦ඇৗ؆ᄸɼଞब૾ੋి࿩฽Ұᒬɼڅ㑌ݸ Ꮠࣈʢి࿩ʣఆٛҰݸሣጯతࣈݩʢਓ໊ʣɼن্֨ɼ Unicode తදࣔ๏ੋ U+Ꮠࣈ ଖதʮᏐࣈʯੋҎे࿡ਐҐ දࣔɼൣᅴኺ 0000 ౸ 10FFFF ័ڞ 1114112 ݸɻ Ṝᒬతදࣔ๏ɼ㑌ݸදࣔɼ౎ඃ᜝ҝҰݸ code point ɼ໵ बੋ㘸ɼ࠷ଟɼUnicode ՄҎሣጯڞ 1114112 ݸࣈݩɼࡏ ࠷৽త Unicode 6.3 (September 2013) نൣཫɼఆٛྃ 110,122 ݸࣈݩʢෆؚ߇੍ࣈݩ౳ʣɼ໵बੋ㘸౸໨લҝ ࢭɼ୞༻౸େ໿े෼೭Ұత code point ࣕቮɻ

Slide 9

Slide 9 text

Unicode Unicode ೺ code point ੾෼੒኷ଟݸეմɼ㑌ݸეմᚭྔ์ ஔಉҰྨܕతจࣈɼॴҎ༗޷زݸეմҝྃະိՄೳधཁิ ॆจࣈత᮫܎ɼฒᔒ༗೺֘ეմॴ༗త code point ౎ఆٛ޷ จࣈɻ ኷ଟਓ༰қޡ။తҰݸࣄመੋɿʮ Unicode ຊ਎ෆੋฤᛰʯ ࣄመ্ Unicode ୞ᄸ७ఆٛሣጯ᮫܎ɼଞᔒ༗نఆ code point ཁዎኄ༻ bytes දࣔɻॴҎᅡ֨ိ㘸 Unicode ୞ੋݸ ሣጯදنൣɼෆੋฤᛰɻ

Slide 10

Slide 10 text

ℙƴ☂ℌøἤ U+2119: DOUBLE-STRUCK CAPITAL P U+01B4: LATIN SMALL LETTER Y WITH HOOK U+2602: UMBRELLA U+210C: BLACK-LETTER CAPITAL H U+00F8: LATIN SMALL LETTER O WITH STROKE U+1F24: GREEK SMALL LETTER ETA WITH PSILI AND OXIA

Slide 11

Slide 11 text

UTF-16 ᢛ UCS-2 UTF-16 ؆ᄸိ㘸ɼबੋҎ 16 ҐݩҝҰݸᄸҐိදࣔ Unicode ࣈݩతҰछฤᛰ๏ɻ Ұൠᔒ༗஫ҙաతਓඇৗ༰қޡ။ɼҎҝ UTF-16 ཫɼҎ 16 ҐݩҝҰݸᄸҐɼॴҎ ࠷ଟ୞ೳදࣔ 0~65535 త Unicode ࣈݩɻ ࣄመ্Ṝݸ૝๏ੋݸޡ။ɼҼҝਅਖ਼ඃݶఆࡏ୞ೳදࣔ 0~65535 త Unicode ࣈݩత ฤᛰɼଖመੋ UTF-16 తࢠू UCS-2 ࠽ሣɻ UTF-16 ᢛ UCS-2 ࠷େతࠩҟबੋɼUTF-16 ሣҰݸʮࣈݩʯੋՄҎվᏓ༻౸తҐݩ ௕౓తɼ໵बੋ㘸ɼҰݸᄸҐʢၷݸҐݩ૊ʣෆೳදࣔతࣈݩɼࡏ UTF-16 ཫ။༻ၷ ݸᄸҐʢڞ࢛ݸҐݩ૊ʣိදࣔɻࣕ UCS-2 ଇੋ௕౓ݻఆɼӬԕ༻ҰݸᄸҐʢၷݸ Ґݩ૊ʣိදࣔࣈݩɼॴҎ UCS-2 ࠷ଟ୞ೳදࣔ 0~65535 త Unicode ࣈݩɻ

Slide 12

Slide 12 text

U+FFFF લత code point ௚઀ҎၷݸҐݩ૊ိṶଘᏐࣈɼଖ值௚઀ሣጯ code point తᏐ ࣈɻ ྫ೗ɿʮU+2602ʀ’ʯब௚઀Ҏʮ0010011000000010ʯṶ ଘɼଖத code point త 2602 ੋ 16 ਐҐදࣔ๏ɼ᫚׵੒ 2 ਐ Ґ 10011000000010 Ṷଘɼෆ଍త෦෼௚઀ิྵɼิᴡࢸ 2 ݸ Ґݩ૊ɻ ࠶ྫ೗ ɿʮU+00F8ʀøʯଇҎʮ0000000011111000ʯṶଘɻ

Slide 13

Slide 13 text

U+FFFF ޙత code point ሡ code point ሣጯతᏐࣈݮڈ 10000(16ਐҐ)ɼሡ။ಘ౸ൣᅴ 000000~0FFFFF ័ڞ઎༻ 20 ݸҐݩతᏐࣈʢ୞༗ U+010000~U+10FFFF ။ धཁ༻౸ṜݸํࣜɼॴҎݮڈ 10000 ။ಘ౸ 000000~0FFFFF Ṝᒬతൣᅴʣ ሡୈҰ㑊ᱜॴಘ౸తᏐࣈɼલ 10 ҐݩՃ্ D800(16ਐҐ) ಘ౸ୈҰ૊ද ࣔɼ઎༻ҰݸᄸҐڞၷݸҐݩ૊ɻṜ૊දࣔඃ᜝ҝߴҐ୅ཧʢhigh surrogateʣɼṜ૊දࣔॴಘతᏐࣈ೭ൣᅴੋ D800~DBFF ɻ ሡୈೋ㑊ᱜ༻ႫԼత 10 ݸҐݩʢୈҰ㑊ᱜతޙ 10 ҐݩʣՃ্ DC00 ಘ ౸ୈೋ૊දࣔɻṜ૊දࣔඃ᜝ҝ௿Ґ୅ཧʢlow surrogateʣɼṜ૊දࣔॴಘ తᏐࣈ೭ൣᅴੋ DC00~DFFF ɻ

Slide 14

Slide 14 text

U+FFFF ޙత code point ༝ԙୈೋᢛୈࡾ㑊ᱜॴ㗞ੜతදࣔɼଖൣᅴ׬શෆಉɼࣕࡏ Unicode ཫ໘ɼṜၷ૊ॴࡏతൣᅴ D800~DFFF ຢނҙඃอཹԼ ိෆఆٛจࣈɼॴҎզ၇ՄҎ᫊ᱷಘ஌Ṝᒬతදࣔੋ༻ԙ surrogate pair ׌Մ᫊қ෼Ⴕ୅දతੋલ 10 ݸҐݩؐੋޙ 10 ݸҐ ݩɼ෮ݪޙॏ৽૊߹बՄҎಘ஌ਖ਼֬ሣጯత Unicode code point ྃɻ

Slide 15

Slide 15 text

U+FFFF ޙత code point Ṝཫզ၇༻ (U+1030A) ࣔൣҰ࣍መࡍૢ࡞ɻ 1030A ݮڈ 10000 ಘ౸ 30A ɼ᫚׵੒ 2 ਐҐҝ 1100001010 ɼෆ଍ 20 Ґݩత෦෼ิྵɼᏓ੒ 00000000001100001010 લ 10 Ґݩҝ 0000000000 ɼ׵੒ 16 ਐҐҝ 0000 ɼՃ্ D800 ಘ౸ D800 ޙ 10 Ґݩҝ 1100001010 ɼ׵੒ 16 ਐҐҝ 030A ɼՃ্ DC00 ಘ౸ DF0A ॴҎɼU+1030A ࡏ UTF-16 बੋදࣔ੒ D800DF0A

Slide 16

Slide 16 text

UTF-32 ᢛ UCS-4 UTF-32 ᢛ UCS-4 ၷछฤᛰɼ౎ੋ઎༻࢛ݸҐݩ૊ɼ࢛ݸ Ґݩ૊ՄҎදࣔతྔቮៃ௒ա੔ݸ Unicode తఆٛྃɼॴ Ҏᔒ༗Ꮣ׵௕౓త໰୊ɼ࢛ݸҐݩ૊Ұఆ夠༻ɻ දࣔ๏኷؆ᄸɼ᪑ UTF-16 ૬ಉɼ௚઀༻ code point తᏐ ࣈṶଘɼෆ଍࢛Ґݩ૊ิྵิ଍ɻ

Slide 17

Slide 17 text

Ґݩ૊ং ૾ੋ্໘త U+1030A ࡏ UTF-16 ཫ໘දࣔ੒ D800DF0AɼୠੋɼṶଘ࣌ɼґরهԱᱪॱ ংɼڀᰈੋഉྻ੒ D8 00 DF 0A ؐੋഉྻ੒ 00 D8 0A DF 䏆ʁ ґᎦ႔ཧث҃ੋ໢࿏ڠٞࣕఆɻ ౼࿦࠾༻ Big Endian ɼҼҝṜ᪑զ၇తᏐላӡࢉ݁ՌഉংҰᒬɼํศ౼࿦᪑㕕ᩇɻ ࣕҝྃࠂ஌࢖༻ऀʢ҃ఔࣜʣࢿྉੋዎኄഉংɼ ՄҎՃೖ BOM తඪࢽ౸ࢿྉత։಄ɼ ༻ҎႵࣝഉྻํࣜɻ

Slide 18

Slide 18 text

UTF-8 UTF-8 ฤᛰඇৗᡏ໌ɻར༻ՄᏓฤᛰతํࣜိṶଘࢿྉɼሣԙ 0-127 త code point ౎ଘ੒ҰݸҐݩ૊ɼߋߴత code point ଇଘ੒ၷݸҐݩ૊Ҏ্ɻ Ṝछઃܭ༗ෆগ޷႔ɼҰํ໘ল༰ྔɼ㠥֎Ұํ໘ɼሣԙ 0-127 త code point ࣕݴɼASCII ฤᛰᢛ UTF-8 ฤᛰతදࣔ๏׬શ૬ಉɻṜ୅දॄኄʁṜ ୅ද㟬၊ઃจ݅ੋ UTF-8 ฤᛰతํࣜိ։啟 ASCII ฤᛰෆ။༗စᛰɼᰖࣔ݁ Ռ׬શ૬ಉɼෆ။ଧᗞط༗䈕Ҋత૬༰ੑɻ

Slide 19

Slide 19 text

UTF-8 ؃֘Ґݩ૊։಄ੋ 0 ؐੋ 1 ɼ೗Ռੋ 0 දࣔࠑҐݩ૊ಉ ASCII ฤᛰɼሣጯ౸ U+0000~U+007F ɼṜݸҐݩ૊௚઀දࣔᄸҰࣈݩɻ೗Ռੋ 1 ଇදࣔɼṜݸҐݩ૊ੋଟ Ґݩ૊ࣈݩతଖதҰݸҐݩ૊ɻ ೗ՌҐݩ૊։಄ੋ 1 දࣔṜݸҐݩ૊ੋଟҐݩ૊ࣈݩతଖதҰݸҐݩ૊ɼᒾ查ଖୈೋ ݸҐݩɼ೗Ռੋ 0 ɼ୅දଞෆੋୈҰݸҐݩ૊ɼ၊ઃṜݸࣈݩཁ༻ၷݸҐݩ૊දࣔɼಹ ኄ 10 ։಄తҐݩ૊बੋୈೋݸɻ೗ՌṜݸࣈݩੋ༻ࡾݸҐݩ૊දࣔɼಹኄ 10 ։಄తҐ ݩ૊बՄೳੋୈೋ૊҃ୈࡾ૊ɼҎࠑྨਪɻ ೗ՌҐݩ૊։಄ੋ 1 ׌ୈೋҐ໵ੋ 1 ɼଇṜݸҐݩ૊ੋଟҐݩ૊ࣈݩతଖதҰݸҐݩ ૊ɼࣕ׌Ұఆੋ։಄ɻṜݸҐݩ૊։಄༗ଟগݸ 1 ब୅දṜݸࣈݩ༻౸ଟগݸҐݩ૊ɼ ൺ೗㘸ɼ 110 ։಄ɼදࣔძҐݩ૊ࣈݩɼ1110 ։಄ɼදࣔࡾҐݩ૊తࣈݩɻ ্ड़༻ԙ൑Ꮧత։಄Ґݩ௚઀ڈᎃɼႫԼతরॱং㣥઀बੋሣጯత code point

Slide 20

Slide 20 text

UTF-8 ၊ઃ㟬ख্༗ 00100011 11100101 10100100 10101001 Ṝᒬతࢿྉɼዎኄղੳ䏆ʁ ୈҰݸҐݩ૊ 00100011 ։಄बੋ 0 ɼ௚઀ሣጯ ASCII ɼ׵ࢉ੒ 16 ਐҐੋ 23 ɼ查্໘ తදɼሣጯ # ɻ ୈೋݸҐݩ૊ 11100101 ։಄ੋ 1 ɼදࣔṜੋଟҐݩࣈݩతଖதҰݸҐݩ૊ɼ։಄ੋ 1110 ɼදࣔṜੋࡾҐݩ૊ࣈݩత։಄ɻॴҎޙ໘తၷݸҐݩ૊౎ጯ֘ཁੋ 10 ։಄ɻҰ ؃ୈࡾݸᢛୈ࢛ݸҐݩ૊ɼՌવੋ 10 ։಄ɻṜࡾݸҐݩ૊ཁҰىղੳ੒Ұݸࣈݩɻ ೺ୈೋݸҐݩ૊༻ိඪهత 1110 ڈᎃɼୈࡾɺୈ࢛ݸҐݩ૊༻ိඪهత 10 ڈᎃɼ㣥઀ ىိಘ౸ 0101100100101001 ׵ࢉ੒ 16 Ґݩबੋ 5929 ɼṜ୅දṜၷݸҐݩ૊දࣔతੋ U+5929 Ṝݸࣈݩɼ查දՄಘṜݸ code point ሣጯʮఱʯṜݸࣈݩɻ ॴҎ 00100011 11100101 10100100 10101001 Ṝᒬత UTF-8 ฤᛰ बදࣔ #ఱ

Slide 21

Slide 21 text

Python ཫత String ᢛ Bytes

Slide 22

Slide 22 text

Python 2

Slide 23

Slide 23 text

unicode ᢛ str ࡏ Python 2.0 ࣌ɼ PEP 100 Ҿೖྃ unicode type ࢖ಘ Python 2.x ኺࠑ༗ྃ ၷݸྨܕతࣈ۲ɻ Python 2.x ཫɼ type str ੋҰ۲Ґݩ૊త ংྻɻࣕ type unicode ଇੋҰ۲ code point తংྻɻ

Slide 24

Slide 24 text

unicode ᢛ str Ṝၷछܕଶɼ൴ࠑ೭ؒៃ༝ encode() ᢛ decode() ၏᫚׵ɻ unicode .encode() → bytes bytes .decode() → unicode

Slide 25

Slide 25 text

unicode ᢛ str

Slide 26

Slide 26 text

unicode ᢛ str ᙛવɼ೗Ռ encode ࣌બᎩతฤᛰແ๏දࣔ ֘ code point ɼब။ใࡨڅ㟬ɼ൓೭ຠવɻ

Slide 27

Slide 27 text

ࡨޡ႔ཧ ᙛ۰౸্໘ಹᒬత৘ܗ֘ዎኄ㭎䏆ʁզ၇ՄҎࠂૌ Python զ၇ཁዎኄ႔ཧࡨޡɼ༗ࡾछ؆ᄸతྨܕɻ replace: ೺֘ฤᛰෆೝࣝతશ෦׵੒ ? xmlcharrefreplace: ೺֘ฤᛰෆೝࣝతશ෦ґᎦ HTML/ XML character entity reference ၏ߋ׵ɻ ignore: ೺֘ฤᛰෆೝࣝత௚઀ࠌུᎃɻ

Slide 28

Slide 28 text

ࡨޡ႔ཧ

Slide 29

Slide 29 text

Python 2 తຑ൥

Slide 30

Slide 30 text

Python 2 తຑ൥

Slide 31

Slide 31 text

ࣄመࡾ ʮզ၇ጯ֘೺ྨܕ၏ಘ௚઀Ұᴍɼ௚઀༝զ ၇႔ཧၷऀʯ

Slide 32

Slide 32 text

ࣄመ࢛ ʮ㟬ෆՄೳࡏෆ஌ಓ֬੾ฤᛰత৘گԼɼ༻ ᘃଌ҃ᒾଌతํࣜඦ෼೭Ұඦࣗಈ֬ೝฤᛰʯ

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

ࣄመޒ ʮ࢖༻ऀࠂૌ㟬తฤᛰՄೳੋࡨతʯ

Slide 35

Slide 35 text

ฏ୆૬ґ໰୊ ૣظɼ Python త unicode ఈ૚ੋར༻ unsigned long array ိṶଘࢿ ྉɼࡏ PEP 261 ։࢝ɼPython ҾೖྃՄҎᩋզ၇ར༻ฤᩄჩᏐࢦఆ ఈ૚༻ UCS2 ҃ UCS4 ိଘṜࠣࢿྉɻ Ṝಋகྃሣಉᒬత unicode ෺݅ࡏෆಉղੳث্ग़ݱෆಉ௕౓త㐫گ

Slide 36

Slide 36 text

Python 3

Slide 37

Slide 37 text

str ᢛ bytes Python 3 ཫɼ type str ੋҰ۲ code point తং ྻɻࣕ type bytes ଇੋҰ۲Ґݩ૊తংྻɻ

Slide 38

Slide 38 text

str ᢛ bytes

Slide 39

Slide 39 text

Python 3 తຑ൥ Python 3 తઃܭతଖதҰݸຑ൥႔ɼࡏԙ䈕Ҋܥ౷Ṝํ໘ɼ ෆಉ OS ႔ཧํࣜෆಉɻ ሣԙ Pathname Unix: Bytestrings Windows: UTF-16 OS X: UTF-16 Python: str type

Slide 40

Slide 40 text

Python 3 తຑ൥ ࿘අ༰ྔ ࡏ 3.3 Ҏલɼब૾લ໘ఏ౸ UTF-8 ࣌㘸աత໰୊ɼ UTF-16 ҃ UTF-32 ଠ㟯༰ྔɼࣕ׌ؐ။଄੒ෆಉ Build ࣥߦ݁Ռෆ ಉత㐫گɻ ࡏ PEP 393 ࣌ɼҾೖྃʮFlexible String Representationʯ ؆ᄸိ㘸ɼࡏఈ૚Ṷଘࣈ۲࣌ɼ Python ။ڈᒾ查༻౸త code point ࠷େᏐࣈੋଟগɼ೗Ռ௒աҰݸҐݩ૊ɼಹब༻ ࠷େతॴधҐݩ૊ҝᄸҐṶଘɻ

Slide 41

Slide 41 text

Python 3 తຑ൥ ൺ೗㘸Ұ࿈۲తࣈݩதɼ೗Ռੋ ASCII ٴҰݸ U+10FFFF ɼಹ੔ݸ str ఈ૚။೺㑌ݸ code point ༻ 4 ݸҐݩ૊Ṷଘɻ ೗Ռ੔ݸࣈ۲౎༻ ASCII ࣕቮɼಹ੔ݸ str ఈ૚။೺㑌ݸ code point ༻ 1 ݸҐݩ૊Ṷଘɻ আྃল༰ྔɼؐ༗ݸ኷େత޷႔ੋɼൺֱࣈ۲త࣌ީɼ೗Ռ code point ՖඅతҐݩ૊ෆಉಹ࿈㚎༰౎ෆधཁൺब஌ಓߠఆੋෆಉၷݸ ࣈ۲ྃɻ ࡏ 3.3 ೭ޙɼ໵ෆ࠶༗ෆಉ Build ჩᏐಋகఈ૚࠾༻ෆಉฤᛰత㐫گ ᚙੜྃɻ

Slide 42

Slide 42 text

ݱመੈքతᓫࠅ

Slide 43

Slide 43 text

ݱመੈքతᓫࠅ ؃ىိ޷૾ࡏ 3.3 ൛Ҏޙɼᢜత໰୊౎ղܾྃʁ ݱመੈքੋ኷ᓫࠅతɻൺ೗㘸 WSGI ن֨ҝྃ૬༰ੑɼҎٴ९क Header తنൣɼ 㟬ࡏฤᛰ࣌ඃഭཁڈ႔ཧ utf8 ᢛ Latin 1 త૬༰໰୊ɻ Ṝ෦෼ੋ python 2 ෆधཁ፦৺తɻ PEP 383 ᦒવҾਐྃ surrogateescape څ os.fsencode ᢛ os.fsdecode ౳౳ Function ࢖༻ղܾྃ෦㟨త໰୊ɼୠՄ੯ಉ࣌໵଄੒ྃޙ᠃ḤሜఔࣜɼҰ୴۰౸Ṝํ໘తവ Ꮠबඞਢඇৗ஫ҙੋ൱༗႔ཧ౸ surrogateescapeɼ൱ଇ໵။Ҿᚙ Errorɻ๭छఔ౓ ্ိ㘸ɼฒᔒ༗੒ޭతࠜຊղܾ໰୊ɻ আྃṜࠣ೭֎ɼ CPython త C API ໵Ҽҝ unicode త૬᮫ޭೳӽိӽෳᯑత᮫܎ɼ ᏓಘӽိӽՄዑɼṜሣ။䉰౸ C API త࢖༻ऀိ㘸໵ෆੋ޷ࣄɻ

Slide 44

Slide 44 text

ݐٞ ݐٞҰɿʮࡏ I/O ࣌ɼ੾هሣ֎ိత bytes ࢿྉ၏ decode ɼࡏ㚎෦ᚭྔ୞႔ཧ unicodeɼࣕ༌ग़ࢿྉ࣌ɼ ଇهಘ encode ੒ bytes ྨܕɼ࠶ڈ᪑֎෦࡞ޓಈʯ ݐٞೋɿʮผݏຑ൥ɼᚭྔՖᴍ৺ྗ֬อ㟬తఔࣜత෺ ݅ྨܕੋ኷໌֬తɼ೗Ռੋ bytes ɼ㟬ؐཁᚭྔ֬ఆሏ ࠾༻తฤᛰʯ ݐٞࡾɿʮ௅ࢸগࡾޒछෆಉྨܕతޠݴሣ㟬తఔࣜ၏ ଌࢼʯ