Slide 1

Slide 1 text

PYTHON'S UNICODE INTERNALS B e n j a mi n P e t e r s o n < > [email protected]

Slide 2

Slide 2 text

“Modern programs must handle Unicode —Python has excellent support for Unicode, and will keep getting better.” - G v R

Slide 3

Slide 3 text

PURPOSES E x p l a i n t h e h i s t o r y P y t h o n ' s U n i c o d e s u p p o r t . E x a mi n e i n d e p t h t h e c u r r e n t U n i c o d e i mp l e me n t a t i o n .

Slide 4

Slide 4 text

SOME HISTORY

Slide 5

Slide 5 text

MORE INNOCENT TIMES UNICODE IN PYTHON 2

Slide 6

Slide 6 text

GENESIS - PEP 100 (PYTHON 2.0) u n i c o d e t y p e c o d e c s mo d u l e s t r < - > u n i c o d e c o e r c i o n mo d e l S i mp l e ( 4 . 5 K l o c )

Slide 7

Slide 7 text

DATA FORMAT ARRAY OF U N S I G N E D S H O R T CODEUNITS (UTF-16)

Slide 8

Slide 8 text

UNICODE AS AN OPTIONAL FEATURE $ p y t h o n - S P y t h o n 2 . 7 . 3 + ( 2 . 7 : f 6 e 7 4 7 5 9 d 7 4 0 , J a n 1 2 0 1 3 , 2 3 : 0 6 : 3 4 ) [ G C C 4 . 5 . 4 ] o n l i n u x 2 > > > u n i c o d e T r a c e b a c k ( m o s t r e c e n t c a l l l a s t ) : F i l e " < s t d i n > " , l i n e 1 , i n < m o d u l e > N a m e E r r o r : n a m e ' u n i c o d e ' i s n o t d e f i n e d > > > t y p e ( u " H e l l o , W o r l d ! " ) < t y p e ' s t r ' >

Slide 9

Slide 9 text

PEP 261 - SUPPORT FOR "WIDE" UNICODE CHARACTERS - - E N A B L E - U N I C O D E = ( U C S 2 | U C S 4 ) "THIS PEP REPRESENTS THE LEAST-EFFORT SOLUTION."

Slide 10

Slide 10 text

UNICODE IN PYTHON 3 A BRAVE NEW WORLD

Slide 11

Slide 11 text

KEY PYTHON 3 UNICODE CHANGES s t r i s n o w a U n i c o d e t y p e T h e o l d s t r t y p e b e c o me s b y t e s b y t e s a n d s t r a r e n o t i mp l i c i t l y c o e r c i b l e I d e n t i f i e r s a r e U n i c o d e

Slide 12

Slide 12 text

NEW STRESSES ON AN OLD SYSTEM

Slide 13

Slide 13 text

WHAT'S A PATHNAME? UNIX: BYTESTRINGS WINDOWS: UTF-16 OSX: UTF-16 (WITH WEIRD NORMALIZATION) PYTHON: S T R

Slide 14

Slide 14 text

UNIX IS PROBLEMATIC > > > f i l e n a m e = b " m y f i l e - \ x e X \ x e . t x t " > > > o p e n ( f i l e n a m e , " w " ) . w r i t e ( " h i " ) > > > f i l e n a m e . d e c o d e ( " u t f - 8 " ) T r a c e b a c k ( m o s t r e c e n t c a l l l a s t ) : F i l e " " , l i n e 1 , i n U n i c o d e D e c o d e E r r o r : ' u t f - 8 ' f a i l e d > > > f i l e n a m e . d e c o d e ( " u t f - 1 6 " ) T r a c e b a c k ( m o s t r e c e n t c a l l l a s t ) : F i l e " " , l i n e 1 , i n U n i c o d e D e c o d e E r r o r : ' u t f 1 6 ' f a i l e d

Slide 15

Slide 15 text

PEP 383 (PYTHON 3.1) NON-DECODABLE BYTES IN SYSTEM CHARACTER INTERFACES

Slide 16

Slide 16 text

PRESERVING UNDECODABLE BYTES On d e c o d e ( e . g . o s . l i s t d i r ) , ma p u n d e c o d a b l e b y t e s t o p r i v a t e u s e c o d e p o i n t s . ( U + E 0 0 0 t o U + F 8 F F ) On e n c o d e ( e . g . o s . s t a t ) , ma p p r i v a t e u s e c h a r a c t e r s b a c k t o b y t e s .

Slide 17

Slide 17 text

ROUND-TRIPPING BYTESTRING FILENAMES > > > f i l e n a m e = b " m y f i l e - \ x e f \ x e f . t x t " > > > n e w = o s . f s d e c o d e ( b " m y f i l e - \ x e X \ x e . t x t " ) > > > n e w ' m y f i l e - \ u d c e X \ u d c e . t x t ' > > > o s . f s e n c o d e ( n e w ) b " m y f i l e - \ x e X \ x e . t x t "

Slide 18

Slide 18 text

UTF-16 VS. UTF-32 REPRESENTATION

Slide 19

Slide 19 text

NARROW BUILD WIDE BUILD > > > c h a r = u " \ U 0 0 0 1 F 0 7 F " > > > l e n ( c h a r ) 2 > > > u n i c o d e d a t a . c a t e g o r y ( c h a r [ 0 ] ) ' C s ' # S u r r o g a t e > > > c h a r = u " \ U 0 0 0 1 F 0 7 F " > > > l e n ( c h a r ) 1 > > > u n i c o d e d a t a . c a t e g o r y ( c h a r [ 0 ] ) ' S o ' # S y m b o l

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

UTF-16 WASTES MEMORY UTF-32 WASTES TONS OF MEMORY

Slide 22

Slide 22 text

PEP 393 (PYTHON 3.3) FLEXIBLE STRING REPRESENTATION

Slide 23

Slide 23 text

IDEA USE SMALLEST REPRESENTATION POSSIBLE FOR A GIVEN STRING

Slide 24

Slide 24 text

PEP 393 DATA REPRESENTATION Ma x i mu m c o d e p o i n t Ma x i mu m c o d e p o i n t D a t a D a t a s i z e s i z eA S C I I f l a g A S C I I f l a gE x a mp l e E x a mp l e 1 2 7 1 1 H e l l o , Wo r l d ! 2 5 5 1 0 S c h l ü s s e l 6 5 5 3 5 2 0 1 1 1 4 1 1 1 4 0 Y

Slide 25

Slide 25 text

INVARIANT: SMALLEST REPRESENTATION IS ALWAYS USED. E.G. ASCII STRINGS ALWAYS USE ONE BYTE STRINGS

Slide 26

Slide 26 text

PEP 393 IMPLICATIONS

Slide 27

Slide 27 text

EVERYTHING IS A CODEPOINT! N a r r o w v s w i d e b u i l d s a b o l i s h e d . l e n ( s ) g i v e s l e n g t h i n c o d e p o i n t s . I n d e x i n g a s t r i n g a l w a y s g i v e s a v a l i d c o d e p o i n t .

Slide 28

Slide 28 text

PERFORMANCE CHARACTERISTICS ARE COMPLICATED

Slide 29

Slide 29 text

ASCII CAN BE BLAZING FAST (AND A LOT OF THINGS ARE ASCII.)

Slide 30

Slide 30 text

OTHER CASES ARE LESS CLEAR

Slide 31

Slide 31 text

COMPLEXITY LINES IN CORE UNICODE IMPLEMENTATION 3 . 2 : 1 5 , 0 0 0 h g t i p : 2 0 , 0 0 0

Slide 32

Slide 32 text

COMPLEXITY # d e f i n e P y U n i c o d e _ G E T _ S I Z E ( o p ) \ ( a s s e r t ( P y U n i c o d e _ C h e c k ( o p ) ) , \ ( ( ( P y A S C I I O b j e c t * ) ( o p ) ) - > w s t r ) ? \ P y U n i c o d e _ W S T R _ L E N G T H ( o p ) : \ ( ( v o i d ) P y U n i c o d e _ A s U n i c o d e ( ( P y O b j e c t * ) ( o p ) ) , \ a s s e r t ( ( ( P y A S C I I O b j e c t * ) ( o p ) ) - > w s t r ) , \ P y U n i c o d e _ W S T R _ L E N G T H ( o p ) ) )

Slide 33

Slide 33 text

C-API

Slide 34

Slide 34 text

OLD C-API P y _ s s i z e _ t c o u n t _ a s c i i ( P y O b j e c t * s t r i n g ) { P y _ U N I C O D E * d a t a = P y U n i c o d e _ A S _ U N I C O D E ( s t r i n g ) ; P y _ s s i z e _ t i , c o u n t = 0 ; f o r ( i = 0 ; i < P y U n i c o d e _ G E T _ S I Z E ( s t r i n g ) ; i + + ) { i f ( d a t a [ i ] < = 1 2 7 ) c o u n t + + ; } r e t u r n c o u n t ; }

Slide 35

Slide 35 text

NEW C-API P y _ s s i z e _ t c o u n t _ a s c i i ( P y O b j e c t * s t r i n g ) { i f ( P y U n i c o d e _ R E A D Y ( s t r i n g ) < 0 ) r e t u r n - 1 ; i n t k i n d = P y U n i c o d e _ K I N D ( s t r i n g ) ; v o i d * d a t a = P y U n i c o d e _ D A T A ( s t r i n g ) ; P y _ s s i z e _ t i , c o u n t = 0 , l e n = P y U n i c o d e _ G E T _ L E N G T H ( s t r i n g ) ; f o r ( i = 0 ; i < l e n ; i + + ) { i f ( P y U n i c o d e _ R E A D ( k i n d , d a t a , i ) < = 1 2 7 ) c o u n t + + ; } r e t u r n c o u n t ; }

Slide 36

Slide 36 text

OLD C-API MUST CONTINUE TO WORK. RESULT: LEGACY STRINGS

Slide 37

Slide 37 text

FUTURE WORK Mo r e p e r f o r ma n c e i mp r o v e me n t s U n i c o d e s p e c c o mp l i a n c e Mo r e U n i c o d e a l g o r i t h ms r e mo d u l e c o u l d u s e s o me w o r k

Slide 38

Slide 38 text

LESSONS G l o b a l c o n f i g u r a t i o n o p t i o n s a r e b a d . I t ' s o k a y t o s t a r t s i mp l e ; e v o l u t i o n i s p o s s i b l e . I t ' s mu c h e a s i e r t o p r e s e r v e c o mp a t i b i l i t y f o r P y t h o n c o d e t h a n C - A P I c l i e n t s . Op t i mi z e f o r t h e c o mmo n c a s e .

Slide 39

Slide 39 text

QUESTIONS? F u r t h e r c o n t a c t : b e n j a mi n @p y t h o n . o r g