Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python's Unicode Internals by Benjamin Peterson

Python's Unicode Internals by Benjamin Peterson

PyCon 2013

March 15, 2013
Tweet

More Decks by PyCon 2013

Other Decks in Technology

Transcript

  1. “Modern programs must handle Unicode —Python has excellent support for

    Unicode, and will keep getting better.” - G v R
  2. PURPOSES E x p l a i n t h

    e h i s t o r y P y t h o n ' s U n i c o d e s u p p o r t . E x a mi n e i n d e p t h t h e c u r r e n t U n i c o d e i mp l e me n t a t i o n .
  3. GENESIS - PEP 100 (PYTHON 2.0) u n i c

    o d e t y p e c o d e c s mo d u l e s t r < - > u n i c o d e c o e r c i o n mo d e l S i mp l e ( 4 . 5 K l o c )
  4. DATA FORMAT ARRAY OF U N S I G N

    E D S H O R T CODEUNITS (UTF-16)
  5. UNICODE AS AN OPTIONAL FEATURE $ p y t h

    o n - S P y t h o n 2 . 7 . 3 + ( 2 . 7 : f 6 e 7 4 7 5 9 d 7 4 0 , J a n 1 2 0 1 3 , 2 3 : 0 6 : 3 4 ) [ G C C 4 . 5 . 4 ] o n l i n u x 2 > > > u n i c o d e T r a c e b a c k ( m o s t r e c e n t c a l l l a s t ) : F i l e " < s t d i n > " , l i n e 1 , i n < m o d u l e > N a m e E r r o r : n a m e ' u n i c o d e ' i s n o t d e f i n e d > > > t y p e ( u " H e l l o , W o r l d ! " ) < t y p e ' s t r ' >
  6. PEP 261 - SUPPORT FOR "WIDE" UNICODE CHARACTERS - -

    E N A B L E - U N I C O D E = ( U C S 2 | U C S 4 ) "THIS PEP REPRESENTS THE LEAST-EFFORT SOLUTION."
  7. KEY PYTHON 3 UNICODE CHANGES s t r i s

    n o w a U n i c o d e t y p e T h e o l d s t r t y p e b e c o me s b y t e s b y t e s a n d s t r a r e n o t i mp l i c i t l y c o e r c i b l e I d e n t i f i e r s a r e U n i c o d e
  8. UNIX IS PROBLEMATIC > > > f i l e

    n a m e = b " m y f i l e - \ x e X \ x e . t x t " > > > o p e n ( f i l e n a m e , " w " ) . w r i t e ( " h i " ) > > > f i l e n a m e . d e c o d e ( " u t f - 8 " ) T r a c e b a c k ( m o s t r e c e n t c a l l l a s t ) : F i l e " " , l i n e 1 , i n U n i c o d e D e c o d e E r r o r : ' u t f - 8 ' f a i l e d > > > f i l e n a m e . d e c o d e ( " u t f - 1 6 " ) T r a c e b a c k ( m o s t r e c e n t c a l l l a s t ) : F i l e " " , l i n e 1 , i n U n i c o d e D e c o d e E r r o r : ' u t f 1 6 ' f a i l e d
  9. PRESERVING UNDECODABLE BYTES On d e c o d e

    ( e . g . o s . l i s t d i r ) , ma p u n d e c o d a b l e b y t e s t o p r i v a t e u s e c o d e p o i n t s . ( U + E 0 0 0 t o U + F 8 F F ) On e n c o d e ( e . g . o s . s t a t ) , ma p p r i v a t e u s e c h a r a c t e r s b a c k t o b y t e s .
  10. ROUND-TRIPPING BYTESTRING FILENAMES > > > f i l e

    n a m e = b " m y f i l e - \ x e f \ x e f . t x t " > > > n e w = o s . f s d e c o d e ( b " m y f i l e - \ x e X \ x e . t x t " ) > > > n e w ' m y f i l e - \ u d c e X \ u d c e . t x t ' > > > o s . f s e n c o d e ( n e w ) b " m y f i l e - \ x e X \ x e . t x t "
  11. NARROW BUILD WIDE BUILD > > > c h a

    r = u " \ U 0 0 0 1 F 0 7 F " > > > l e n ( c h a r ) 2 > > > u n i c o d e d a t a . c a t e g o r y ( c h a r [ 0 ] ) ' C s ' # S u r r o g a t e > > > c h a r = u " \ U 0 0 0 1 F 0 7 F " > > > l e n ( c h a r ) 1 > > > u n i c o d e d a t a . c a t e g o r y ( c h a r [ 0 ] ) ' S o ' # S y m b o l
  12. PEP 393 DATA REPRESENTATION Ma x i mu m c

    o d e p o i n t Ma x i mu m c o d e p o i n t D a t a D a t a s i z e s i z eA S C I I f l a g A S C I I f l a gE x a mp l e E x a mp l e 1 2 7 1 1 H e l l o , Wo r l d ! 2 5 5 1 0 S c h l ü s s e l 6 5 5 3 5 2 0 1 1 1 4 1 1 1 4 0 Y
  13. EVERYTHING IS A CODEPOINT! N a r r o w

    v s w i d e b u i l d s a b o l i s h e d . l e n ( s ) g i v e s l e n g t h i n c o d e p o i n t s . I n d e x i n g a s t r i n g a l w a y s g i v e s a v a l i d c o d e p o i n t .
  14. COMPLEXITY LINES IN CORE UNICODE IMPLEMENTATION 3 . 2 :

    1 5 , 0 0 0 h g t i p : 2 0 , 0 0 0
  15. COMPLEXITY # d e f i n e P y

    U n i c o d e _ G E T _ S I Z E ( o p ) \ ( a s s e r t ( P y U n i c o d e _ C h e c k ( o p ) ) , \ ( ( ( P y A S C I I O b j e c t * ) ( o p ) ) - > w s t r ) ? \ P y U n i c o d e _ W S T R _ L E N G T H ( o p ) : \ ( ( v o i d ) P y U n i c o d e _ A s U n i c o d e ( ( P y O b j e c t * ) ( o p ) ) , \ a s s e r t ( ( ( P y A S C I I O b j e c t * ) ( o p ) ) - > w s t r ) , \ P y U n i c o d e _ W S T R _ L E N G T H ( o p ) ) )
  16. OLD C-API P y _ s s i z e

    _ t c o u n t _ a s c i i ( P y O b j e c t * s t r i n g ) { P y _ U N I C O D E * d a t a = P y U n i c o d e _ A S _ U N I C O D E ( s t r i n g ) ; P y _ s s i z e _ t i , c o u n t = 0 ; f o r ( i = 0 ; i < P y U n i c o d e _ G E T _ S I Z E ( s t r i n g ) ; i + + ) { i f ( d a t a [ i ] < = 1 2 7 ) c o u n t + + ; } r e t u r n c o u n t ; }
  17. NEW C-API P y _ s s i z e

    _ t c o u n t _ a s c i i ( P y O b j e c t * s t r i n g ) { i f ( P y U n i c o d e _ R E A D Y ( s t r i n g ) < 0 ) r e t u r n - 1 ; i n t k i n d = P y U n i c o d e _ K I N D ( s t r i n g ) ; v o i d * d a t a = P y U n i c o d e _ D A T A ( s t r i n g ) ; P y _ s s i z e _ t i , c o u n t = 0 , l e n = P y U n i c o d e _ G E T _ L E N G T H ( s t r i n g ) ; f o r ( i = 0 ; i < l e n ; i + + ) { i f ( P y U n i c o d e _ R E A D ( k i n d , d a t a , i ) < = 1 2 7 ) c o u n t + + ; } r e t u r n c o u n t ; }
  18. FUTURE WORK Mo r e p e r f o

    r ma n c e i mp r o v e me n t s U n i c o d e s p e c c o mp l i a n c e Mo r e U n i c o d e a l g o r i t h ms r e mo d u l e c o u l d u s e s o me w o r k
  19. LESSONS G l o b a l c o n

    f i g u r a t i o n o p t i o n s a r e b a d . I t ' s o k a y t o s t a r t s i mp l e ; e v o l u t i o n i s p o s s i b l e . I t ' s mu c h e a s i e r t o p r e s e r v e c o mp a t i b i l i t y f o r P y t h o n c o d e t h a n C - A P I c l i e n t s . Op t i mi z e f o r t h e c o mmo n c a s e .
  20. QUESTIONS? F u r t h e r c o

    n t a c t : b e n j a mi n @p y t h o n . o r g