Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Binary COPY in Postgres

Karthik
July 25, 2015

Binary COPY in Postgres

Talk given at local python meetup about using COPY for bulk inserts.
code to the ipython notebook is here https://github.com/shrayasr/talks/tree/master/2015/july/chennaipy/beauty-of-pg-1

Karthik

July 25, 2015
Tweet

More Decks by Karthik

Other Decks in Technology

Transcript

  1. There are 2 kinds of Inserts Ones that insert a

    single row Ones that insert multiple rows
  2. Ones that insert a single row I n [ 2

    ] : r o w = { ' s o m e _ t e x t ' : " c h e n n a i P y " , ' s o m e _ i n t ' : 7 , ' s o m e _ d o u b l e ' : 3 . 1 4 , ' s o m e _ b o o l e a n ' : T r u e }
  3. I n [ 3 ] : c u r .

    e x e c u t e ( " I N S E R T I N T O r a n d o m _ t e s t _ t a b l e ( s o m e _ t e x t , s o m e _ i n t , s o m e _ d o u b l e , s o m e _ b o o l e a n ) V A L U E S ( % s , % s , % s , % s ) " , ( r o w [ ' s o m e _ t e x t ' ] , r o w [ ' s o m e _ i n t ' ] , r o w [ ' s o m e _ d o u b l e ' ] , r o w [ ' s o m e _ b o o l e a n ' ] ) )
  4. Ones that insert a multiple row I n [ 5

    ] : r o w s = ( { ' s o m e _ t e x t ' : ' c h e n n a i P y ' , ' s o m e _ i n t ' : 7 , ' s o m e _ d o u b l e ' : 3 . 1 4 , ' s o m e _ b o o l e a n ' : T r u e } , { ' s o m e _ t e x t ' : ' I M S c ' , ' s o m e _ i n t ' : 1 4 , ' s o m e _ d o u b l e ' : 6 . 2 8 , ' s o m e _ b o o l e a n ' : F a l s e } , { ' s o m e _ t e x t ' : ' J u l y 2 5 ' , ' s o m e _ i n t ' : 2 8 , ' s o m e _ d o u b l e ' : 9 . 4 2 , ' s o m e _ b o o l e a n ' : T r u e } )
  5. I n [ 6 ] : f o r r

    o w i n r o w s : c u r . e x e c u t e ( " I N S E R T I N T O r a n d o m _ t e s t _ t a b l e ( s o m e _ t e x t , s o m e _ i n t , s o m e _ d o u b l e , s o m e _ b o o l e a n ) V A L U E S ( % s , % s , % s , % s ) " , ( r o w [ ' s o m e _ t e x t ' ] , r o w [ ' s o m e _ i n t ' ] , r o w [ ' s o m e _ d o u b l e ' ] , r o w [ ' s o m e _ b o o l e a n ' ] ) )
  6. I n [ 8 ] : c u r .

    e x e c u t e m a n y ( " I N S E R T I N T O r a n d o m _ t e s t _ t a b l e ( s o m e _ t e x t , s o m e _ i n t , s o m e _ d o u b l e , s o m e _ b o o l e a n ) V A L U E S ( % ( s o m e _ t e x t ) s , % ( s o m e _ i n t ) s , % ( s o m e _ d o u b l e ) s , % ( s o m e _ b o o l e a n ) s ) " , ( r o w s ) )
  7. How I wish we had less data ( { "

    s o m e _ t e x t " : " D o n i s i p r o i d e n t l a b o r e a l i q u i p n o n . " , " s o m e _ i n t " : 3 8 0 2 6 8 , " s o m e _ d o u b l e " : 9 1 1 . 3 0 7 4 , " s o m e _ b o o l e a n " : f a l s e } , { " s o m e _ t e x t " : " D e s e r u n t l a b o r u m e a e i u s m o d e s s e m i n i m e n i m a n i m . " , " s o m e _ i n t " : 8 3 9 8 2 , " s o m e _ d o u b l e " : 1 2 1 1 . 3 5 8 8 , " s o m e _ b o o l e a n " : f a l s e } , { " s o m e _ t e x t " : " Q u i s r e p r e h e n d e r i t d e s e r u n t e t l a b o r e o c c a e c a t a u t e e s t m o l l i t q u i s . " , " s o m e _ i n t " : 6 7 0 2 6 4 , " s o m e _ d o u b l e " : 9 9 6 . 9 8 7 8 , " s o m e _ b o o l e a n " : t r u e } , { " s o m e _ t e x t " : " C u p i d a t a t n u l l a n o s t r u d m i n i m s u n t . " , " s o m e _ i n t " : 4 8 9 6 0 6 , " s o m e _ d o u b l e " : 1 1 6 5 . 5 4 1 2 , " s o m e _ b o o l e a n " : t r u e } , { " s o m e _ t e x t " : " E n i m p r o i d e n t c i l l u m a m e t s i n t c u l p a e s s e a m e t f u g i a t s u n t m i n i m s i n t e l i t e t s i t . " , " s o m e _ i n t " : 6 5 2 2 1 1 , " s o m e _ d o u b l e " : 1 1 1 1 . 1 1 9 8 , " s o m e _ b o o l e a n " : t r u e } , { " s o m e _ t e x t " : " I d v e n i a m p a r i a t u r a u t e c u p i d a t a t m o l l i t q u i s a n i m a m e t q u i s e i u s m o d p a r i a t u r s u n t d o . " , " s o m e _ i n t " : 5 9 5 0 3 5 , " s o m e _ d o u b l e " : 1 2 0 0 . 9 6 8 9 , " s o m e _ b o o l e a n " : t r u e } , { " s o m e _ t e x t " : " I n e l i t a l i q u i p m o l l i t n o n i d a u t e . " , " s o m e _ i n t " : 5 6 8 6 2 6 , " s o m e _ d o u b l e " : 9 6 1 . 6 3 8 7 , " s o m e _ b o o l e a n " : t r u e } , { " s o m e _ t e x t " : " A n i m a l i q u a q u i e x e r c i t a t i o n s i t v e l i t d e s e r u n t l a b o r u m d o r e p r e h e n d e r i t . " , " s o m e _ i n t " : 5 5 4 1 5 5 , " s o m e _ d o u b l e " : 9 8 0 . 9 9 1 , " s o m e _ b o o l e a n " : f a l s e } , { " s o m e _ t e x t " : " C u p i d a t a t e s s e i n o c c a e c a t a m e t q u i s r e p r e h e n d e r i t d o l o r e c o n s e q u a t p a r i a t u r . " , " s o m e _ i n t " : 8 5 2 7 3 8 , " s o m e _ d o u b l e " : 9 3 7 . 3 5 3 7 , " s o m e _ b o o l e a n " : f a l s e } , { " s o m e _ t e x t " : " E x c e p t e u r s u n t n o n s u n t l a b o r u m l a b o r i s p a r i a t u r a l i q u i p l a b o r e v o l u p t a t e a l i q u a i r u r e . " , " s o m e _ i n t " : 1 3 8 9 8 3 , " s o m e _ d o u b l e " : 9 4 6 . 2 2 2 1 , " s o m e _ b o o l e a n " : f a l s e } , . . . , { " s o m e _ t e x t " : " D e s e r u n t n u l l a o c c a e c a t l a b o r u m a l i q u i p m o l l i t c o n s e q u a t . " , " s o m e _ i n t " : 4 1 5 0 6 9 , " s o m e _ d o u b l e " : 9 3 5 . 2 3 2 , " s o m e _ b o o l e a n " : t r u e } , { " s o m e _ t e x t " : " E n i m s u n t a u t e e x c u p i d a t a t r e p r e h e n d e r i t e i u s m o d a m e t . " , " s o m e _ i n t " : 7 1 5 3 5 9 , " s o m e _ d o u b l e " : 9 2 6 . 8 9 5 1 , " s o m e _ b o o l e a n " : t r u e } , { " s o m e _ t e x t " : " D u i s d o l o r n o n d u i s m o l l i t e u e s t . " , " s o m e _ i n t " : 1 4 2 9 8 8 , " s o m e _ d o u b l e " : 9 4 9 . 6 2 3 8 , " s o m e _ b o o l e a n " : t r u e } ) I n [ 1 2 ] : % t i m e i n s e r t _ m u l t i _ r o w ( ) C P U t i m e s : u s e r 2 2 7 m s , s y s : 8 7 . 8 m s , t o t a l : 3 1 5 m s W a l l t i m e : 1 . 2 s
  8. The COPY command is optimized for loading large numbers of

    rows; it is less flexible than INSERT, but incurs significantly less overhead for large data loads. (http://www.postgresql.org/docs/current/interactive/populate.html)
  9. Binary # c r e a t e a n

    e w b y t e s t r e a m c y = B y t e s I O ( )
  10. Header # 1 1 - b y t e s

    e q u e n c e P G C O P Y \ n \ 3 7 7 \ r \ n \ 0 c y . w r i t e ( p a c k ( ' > 1 1 s ' , b ' P G C O P Y \ n \ 3 7 7 \ r \ n \ 0 ' ) ) # F l a g s f i e l d c y . w r i t e ( p a c k ( ' > i ' , 0 ) ) # l e n g t h i n b y t e s o f r e m a i n d e r o f h e a d e r ; c u r r e n t l y , t h i s i s z e r o c y . w r i t e ( p a c k ( ' > i ' , 0 ) )
  11. Tuples -- # 1 6 - b i t i

    n t e g e r c o u n t o f t h e n u m b e r o f f i e l d s i n t h e t u p l e # t h i s i s d o n e f o r e v e r y t u p l e c y . w r i t e ( p a c k ( ' > h ' , 4 ) ) t e x t _ l e n = l e n ( r o w [ ' s o m e _ t e x t ' ] . e n c o d e ( ' u t f - 8 ' ) ) # 3 2 - b i t l e n g t h w o r d c y . w r i t e ( p a c k ( ' > i ' , t e x t _ l e n ) ) # f o l l o w e d b y t h a t m a n y b y t e s o f f i e l d d a t a c y . w r i t e ( p a c k ( ' > ' + s t r ( t e x t _ l e n ) + ' s ' , r o w [ ' s o m e _ t e x t ' ] . e n c o d e ( ' u t f - 8 ' ) ) ) c y . w r i t e ( p a c k ( ' > i ' , 8 ) ) c y . w r i t e ( p a c k ( ' > q ' , r o w [ ' s o m e _ i n t ' ] ) ) c y . w r i t e ( p a c k ( ' > i ' , 8 ) ) c y . w r i t e ( p a c k ( ' > d ' , r o w [ ' s o m e _ d o u b l e ' ] ) ) c y . w r i t e ( p a c k ( ' > i ' , 1 ) ) c y . w r i t e ( p a c k ( ' > ? ' , r o w [ ' s o m e _ b o o l e a n ' ] ) ) https://docs.python.org/2/library/struct.html http://www.postgresql.org/docs/9.4/static/datatype-numeric.html
  12. Trailer # f i l e t r a i

    l e r c o n s i s t s o f a 1 6 - b i t i n t e g e r w o r d c o n t a i n i n g - 1 c y . w r i t e ( p a c k ( ' > h ' , - 1 ) )
  13. But is it any better ? I n [ 1

    4 ] : % t i m e i n s e r t _ v i a _ b i n a r y _ c o p y ( ) C P U t i m e s : u s e r 7 9 . 7 m s , s y s : 2 1 . 5 m s , t o t a l : 1 0 1 m s W a l l t i m e : 2 5 2 m s
  14. Closing points C O P Y is great for inserting

    large amounts data No freeform prepared statements Sometimes very poor error messages.