Python's Unicode Internals by Benjamin Peterson

Python's Unicode Internals by Benjamin Peterson

Afcfefa1f067d10bd021de0cc2e5e806?s=128

PyCon 2013

March 15, 2013
Tweet

Transcript

  1. 1.

    PYTHON'S UNICODE INTERNALS B e n j a mi n

    P e t e r s o n < > benjamin@python.org
  2. 2.

    “Modern programs must handle Unicode —Python has excellent support for

    Unicode, and will keep getting better.” - G v R
  3. 3.

    PURPOSES E x p l a i n t h

    e h i s t o r y P y t h o n ' s U n i c o d e s u p p o r t . E x a mi n e i n d e p t h t h e c u r r e n t U n i c o d e i mp l e me n t a t i o n .
  4. 6.

    GENESIS - PEP 100 (PYTHON 2.0) u n i c

    o d e t y p e c o d e c s mo d u l e s t r < - > u n i c o d e c o e r c i o n mo d e l S i mp l e ( 4 . 5 K l o c )
  5. 7.

    DATA FORMAT ARRAY OF U N S I G N

    E D S H O R T CODEUNITS (UTF-16)
  6. 8.

    UNICODE AS AN OPTIONAL FEATURE $ p y t h

    o n - S P y t h o n 2 . 7 . 3 + ( 2 . 7 : f 6 e 7 4 7 5 9 d 7 4 0 , J a n 1 2 0 1 3 , 2 3 : 0 6 : 3 4 ) [ G C C 4 . 5 . 4 ] o n l i n u x 2 > > > u n i c o d e T r a c e b a c k ( m o s t r e c e n t c a l l l a s t ) : F i l e " < s t d i n > " , l i n e 1 , i n < m o d u l e > N a m e E r r o r : n a m e ' u n i c o d e ' i s n o t d e f i n e d > > > t y p e ( u " H e l l o , W o r l d ! " ) < t y p e ' s t r ' >
  7. 9.

    PEP 261 - SUPPORT FOR "WIDE" UNICODE CHARACTERS - -

    E N A B L E - U N I C O D E = ( U C S 2 | U C S 4 ) "THIS PEP REPRESENTS THE LEAST-EFFORT SOLUTION."
  8. 11.

    KEY PYTHON 3 UNICODE CHANGES s t r i s

    n o w a U n i c o d e t y p e T h e o l d s t r t y p e b e c o me s b y t e s b y t e s a n d s t r a r e n o t i mp l i c i t l y c o e r c i b l e I d e n t i f i e r s a r e U n i c o d e
  9. 14.

    UNIX IS PROBLEMATIC > > > f i l e

    n a m e = b " m y f i l e - \ x e X \ x e . t x t " > > > o p e n ( f i l e n a m e , " w " ) . w r i t e ( " h i " ) > > > f i l e n a m e . d e c o d e ( " u t f - 8 " ) T r a c e b a c k ( m o s t r e c e n t c a l l l a s t ) : F i l e " " , l i n e 1 , i n U n i c o d e D e c o d e E r r o r : ' u t f - 8 ' f a i l e d > > > f i l e n a m e . d e c o d e ( " u t f - 1 6 " ) T r a c e b a c k ( m o s t r e c e n t c a l l l a s t ) : F i l e " " , l i n e 1 , i n U n i c o d e D e c o d e E r r o r : ' u t f 1 6 ' f a i l e d
  10. 16.

    PRESERVING UNDECODABLE BYTES On d e c o d e

    ( e . g . o s . l i s t d i r ) , ma p u n d e c o d a b l e b y t e s t o p r i v a t e u s e c o d e p o i n t s . ( U + E 0 0 0 t o U + F 8 F F ) On e n c o d e ( e . g . o s . s t a t ) , ma p p r i v a t e u s e c h a r a c t e r s b a c k t o b y t e s .
  11. 17.

    ROUND-TRIPPING BYTESTRING FILENAMES > > > f i l e

    n a m e = b " m y f i l e - \ x e f \ x e f . t x t " > > > n e w = o s . f s d e c o d e ( b " m y f i l e - \ x e X \ x e . t x t " ) > > > n e w ' m y f i l e - \ u d c e X \ u d c e . t x t ' > > > o s . f s e n c o d e ( n e w ) b " m y f i l e - \ x e X \ x e . t x t "
  12. 19.

    NARROW BUILD WIDE BUILD > > > c h a

    r = u " \ U 0 0 0 1 F 0 7 F " > > > l e n ( c h a r ) 2 > > > u n i c o d e d a t a . c a t e g o r y ( c h a r [ 0 ] ) ' C s ' # S u r r o g a t e > > > c h a r = u " \ U 0 0 0 1 F 0 7 F " > > > l e n ( c h a r ) 1 > > > u n i c o d e d a t a . c a t e g o r y ( c h a r [ 0 ] ) ' S o ' # S y m b o l
  13. 20.
  14. 24.

    PEP 393 DATA REPRESENTATION Ma x i mu m c

    o d e p o i n t Ma x i mu m c o d e p o i n t D a t a D a t a s i z e s i z eA S C I I f l a g A S C I I f l a gE x a mp l e E x a mp l e 1 2 7 1 1 H e l l o , Wo r l d ! 2 5 5 1 0 S c h l ü s s e l 6 5 5 3 5 2 0 1 1 1 4 1 1 1 4 0 Y
  15. 27.

    EVERYTHING IS A CODEPOINT! N a r r o w

    v s w i d e b u i l d s a b o l i s h e d . l e n ( s ) g i v e s l e n g t h i n c o d e p o i n t s . I n d e x i n g a s t r i n g a l w a y s g i v e s a v a l i d c o d e p o i n t .
  16. 31.

    COMPLEXITY LINES IN CORE UNICODE IMPLEMENTATION 3 . 2 :

    1 5 , 0 0 0 h g t i p : 2 0 , 0 0 0
  17. 32.

    COMPLEXITY # d e f i n e P y

    U n i c o d e _ G E T _ S I Z E ( o p ) \ ( a s s e r t ( P y U n i c o d e _ C h e c k ( o p ) ) , \ ( ( ( P y A S C I I O b j e c t * ) ( o p ) ) - > w s t r ) ? \ P y U n i c o d e _ W S T R _ L E N G T H ( o p ) : \ ( ( v o i d ) P y U n i c o d e _ A s U n i c o d e ( ( P y O b j e c t * ) ( o p ) ) , \ a s s e r t ( ( ( P y A S C I I O b j e c t * ) ( o p ) ) - > w s t r ) , \ P y U n i c o d e _ W S T R _ L E N G T H ( o p ) ) )
  18. 33.
  19. 34.

    OLD C-API P y _ s s i z e

    _ t c o u n t _ a s c i i ( P y O b j e c t * s t r i n g ) { P y _ U N I C O D E * d a t a = P y U n i c o d e _ A S _ U N I C O D E ( s t r i n g ) ; P y _ s s i z e _ t i , c o u n t = 0 ; f o r ( i = 0 ; i < P y U n i c o d e _ G E T _ S I Z E ( s t r i n g ) ; i + + ) { i f ( d a t a [ i ] < = 1 2 7 ) c o u n t + + ; } r e t u r n c o u n t ; }
  20. 35.

    NEW C-API P y _ s s i z e

    _ t c o u n t _ a s c i i ( P y O b j e c t * s t r i n g ) { i f ( P y U n i c o d e _ R E A D Y ( s t r i n g ) < 0 ) r e t u r n - 1 ; i n t k i n d = P y U n i c o d e _ K I N D ( s t r i n g ) ; v o i d * d a t a = P y U n i c o d e _ D A T A ( s t r i n g ) ; P y _ s s i z e _ t i , c o u n t = 0 , l e n = P y U n i c o d e _ G E T _ L E N G T H ( s t r i n g ) ; f o r ( i = 0 ; i < l e n ; i + + ) { i f ( P y U n i c o d e _ R E A D ( k i n d , d a t a , i ) < = 1 2 7 ) c o u n t + + ; } r e t u r n c o u n t ; }
  21. 37.

    FUTURE WORK Mo r e p e r f o

    r ma n c e i mp r o v e me n t s U n i c o d e s p e c c o mp l i a n c e Mo r e U n i c o d e a l g o r i t h ms r e mo d u l e c o u l d u s e s o me w o r k
  22. 38.

    LESSONS G l o b a l c o n

    f i g u r a t i o n o p t i o n s a r e b a d . I t ' s o k a y t o s t a r t s i mp l e ; e v o l u t i o n i s p o s s i b l e . I t ' s mu c h e a s i e r t o p r e s e r v e c o mp a t i b i l i t y f o r P y t h o n c o d e t h a n C - A P I c l i e n t s . Op t i mi z e f o r t h e c o mmo n c a s e .
  23. 39.

    QUESTIONS? F u r t h e r c o

    n t a c t : b e n j a mi n @p y t h o n . o r g