Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data, Small Machine

Big Data, Small Machine

Datasets too large to fit into RAM are increasingly common, even in simple environments like Kaggle competitions. Adam will introduce some ways of dealing with this issue in addition to demonstrating some scalable machine learning techniques which are production ready and capable of processing over 10s of thousands of events per second on an old laptop.

adamdrake

June 16, 2016
Tweet

More Decks by adamdrake

Other Decks in Technology

Transcript

  1. Big Data, Small Machine Data Science Singapore - 20160616 Adam

    Drake @aadrake http://aadrake.com 1 / 33
  2. Claims: RAM is growing faster than data Many techniques for

    dealing with Big Data One machine is ne for ML 3 / 33
  3. Big RAM is eating big data Big EC2 instance RAM

    size increase by 50% y/y Year Type RAM (GiB) 2007 m1.xlarge 15 2009 m2.4xlarge 68 2012 hs1.8xlarge 117 2014 r3.8xlarge 244 2016 x1.32xlarge 1952 ⇒ single node in-memory analytics forever!? Note: Tyan FT76-B7922 has 6TB RAM Source: https://github.com/ogrisel/decks/tree/master/2016_pydata_berlin 8 / 33
  4. Online advertising: 0.17% CTR or... ~ 20 clicks per 10,000

    views Source: http://www.smartinsights.com/internet-advertising/internet-advertising-analytics/display-advertising-clickthrough-rates/ 10 / 33
  5. Data Source d e f g e t R e

    c o r d ( p a t h , n u m F e a t u r e s ) : c o u n t = 0 f o r i , l i n e i n e n u m e r a t e ( o p e n ( p a t h ) ) : i f i = = 0 : # d o w h a t e v e r y o u w a n t a t i n i t i a l i z a t i o n x = [ 0 ] * n u m F e a t u r e s # S o w e d o n ' t n e e d t o c r e a t e a n e w x e v e r y t i m e c o n t i n u e f o r t , f e a t i n e n u m e r a t e ( l i n e . s t r i p ( ) . s p l i t ( ' , ' ) ) : i f t = = 0 : y = f e a t # a s s u m i n g f i r s t p o s i t i o n i n r e c o r d i s s o m e k i n d o f l a b e l e l s e : # d o s o m e t h i n g w i t h t h e f e a t u r e s x [ m ] = f e a t y i e l d ( c o u n t , x , y ) Or... r e a d e r = p d . r e a d _ c s v ( ' b l a h . c s v ' , c h u n k s i z e = 1 0 0 0 0 ) f o r c h u n k i n r e a d e r : d o S o m e t h i n g ( c h u n k ) 12 / 33
  6. Stateless Feature Extraction Hello hashing trick... Assume data like f

    n a m e , l n a m e , l o c a t i o n A d a m , D r a k e , S i n g a p o r e f e a t u r e s = [ ' f n a m e A d a m ' , ' l n a m e D r a k e ' , ' l o c a t i o n S i n g a p o r e ' ] m a x W e i g h t s = 2 * * 2 5 d e f h a s h e d F e a t u r e s ( l i s t ) : h a s h e s = [ h a s h ( x ) f o r x i n f e a t u r e s ] r e t u r n [ x % m a x W e i g h t s f o r x i n h a s h e s ] p r i n t ( h a s h e d F e a t u r e s ( f e a t u r e s ) ) # [ 1 8 4 4 5 0 0 8 , 8 6 4 3 7 8 6 , 2 0 4 4 5 1 8 7 ] These are the indices in the weights array 13 / 33
  7. Incremental learning Just use any model in sklearn which has

    a p a r t i a l _ f i t ( ) method Classi cation sklearn.naive_bayes.MultinomialNB sklearn.naive_bayes.BernoulliNB sklearn.linear_model.Perceptron sklearn.linear_model.SGDClassi er sklearn.linear_model.PassiveAggressiveClassi er Regression sklearn.linear_model.SGDRegressor sklearn.linear_model.PassiveAggressiveRegressor Clustering 14 / 33
  8. Incremental learning contd. Not all models can handle stateless features

    and will need to know the classes in advance. Check the documentation and presence of c l a s s e s argument for p a r t i a l _ f i t ( ) Or... Just stick with S G D C l a s s i f i e r and S G D R e g r e s s o r . 15 / 33
  9. Or write your own... # T u r n t

    h e r e c o r d i n t o a l i s t o f h a s h v a l u e s x = [ 0 ] # 0 i s t h e i n d e x o f t h e b i a s t e r m f o r k e y , v a l u e i n r e c o r d . i t e m s ( ) : i n d e x = i n t ( v a l u e + k e y [ 1 : ] , 1 6 ) % D # w e a k e s t h a s h e v e r ; ) x . a p p e n d ( i n d e x ) # G e t t h e p r e d i c t i o n f o r t h e g i v e n r e c o r d ( n o w t r a n s f o r m e d t o h a s h v a l u e s ) w T x = 0 . f o r i i n x : # d o w T x w T x + = w [ i ] # w [ i ] * x [ i ] , b u t i f i i n x w e g o t x [ i ] = 1 . p = 1 . / ( 1 . + e x p ( - m a x ( m i n ( w T x , 2 0 . ) , - 2 0 . ) ) ) # b o u n d e d s i g m o i d # U p d a t e t h e l o s s p = m a x ( m i n ( p , 1 . - 1 0 e - 1 2 ) , 1 0 e - 1 2 ) l o s s + = - l o g ( p ) i f y = = 1 . e l s e - l o g ( 1 . - p ) # U p d a t e t h e w e i g h t s f o r i i n x : # a l p h a / ( s q r t ( n ) + 1 ) i s t h e a d a p t i v e l e a r n i n g r a t e h e u r i s t i c # ( p - y ) * x [ i ] i s t h e c u r r e n t g r a d i e n t # n o t e t h a t i n o u r c a s e , i f i i n x t h e n x [ i ] = 1 w [ i ] - = ( p - y ) * a l p h a / ( s q r t ( n [ i ] ) + 1 . ) n [ i ] + = 1 . Current logloss: 0.463056 Run time 1h06m36s 16 / 33
  10. Power up: Multicore Use all cores Lock shared memory to

    prevent non-atomic modi cations 17 / 33
  11. f r o m m u l t i p

    r o c e s s i n g . s h a r e d c t y p e s i m p o r t R a w A r r a y f r o m m u l t i p r o c e s s i n g i m p o r t P r o c e s s i m p o r t t i m e i m p o r t r a n d o m d e f i n c r ( a r r , i ) : t i m e . s l e e p ( r a n d o m . r a n d i n t ( 1 , 4 ) ) a r r [ i ] + = 1 p r i n t ( a r r [ : ] ) a r r = R a w A r r a y ( ' d ' , 1 0 ) p r o c s = [ P r o c e s s ( t a r g e t = i n c r , a r g s = ( a r r , i ) ) f o r i i n r a n g e ( 1 0 ) ] f o r p i n p r o c s : p . s t a r t ( ) f o r p i n p r o c s : p . j o i n ( ) ' ' ' [ 0 . 0 , 0 . 0 , 0 . 0 , 0 . 0 , 0 . 0 , 1 . 0 , 0 . 0 , 0 . 0 , 0 . 0 , 0 . 0 ] [ 0 . 0 , 0 . 0 , 0 . 0 , 0 . 0 , 0 . 0 , 1 . 0 , 0 . 0 , 0 . 0 , 1 . 0 , 0 . 0 ] [ 0 . 0 , 1 . 0 , 0 . 0 , 0 . 0 , 0 . 0 , 1 . 0 , 0 . 0 , 0 . 0 , 1 . 0 , 0 . 0 ] [ 0 . 0 , 1 . 0 , 0 . 0 , 1 . 0 , 0 . 0 , 1 . 0 , 0 . 0 , 0 . 0 , 1 . 0 , 0 . 0 ] [ 0 . 0 , 1 . 0 , 0 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 0 . 0 , 0 . 0 , 1 . 0 , 0 . 0 ] [ 0 . 0 , 1 . 0 , 0 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 0 . 0 , 1 . 0 , 0 . 0 ] [ 0 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 0 . 0 , 1 . 0 , 0 . 0 ] [ 0 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 0 . 0 ] [ 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 0 . 0 ] [ 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 , 1 . 0 ] ' ' ' 19 / 33
  12. f r o m m u l t i p

    r o c e s s i n g i m p o r t Q u e u e p r o c s = [ P r o c e s s ( t a r g e t = w o r k e r , a r g s = ( q , w , n , D , a l p h a , l o s s , c o u n t , ) ) \ f o r x i n r a n g e ( 4 ) ] f o r p i n p r o c s : p . s t a r t ( ) f o r t , r o w i n e n u m e r a t e ( D i c t R e a d e r ( o p e n ( t r a i n ) ) ) : q . p u t ( r o w ) 20 / 33
  13. f r o m m u l t i p

    r o c e s s i n g i m p o r t Q u e u e p r o c s = [ P r o c e s s ( t a r g e t = w o r k e r , a r g s = ( q , w , n , D , a l p h a , l o s s , c o u n t , ) ) \ f o r x i n r a n g e ( 4 ) ] f o r p i n p r o c s : p . s t a r t ( ) f o r t , r o w i n e n u m e r a t e ( D i c t R e a d e r ( o p e n ( t r a i n ) ) ) : q . p u t ( r o w ) But it's SLOWER 21 / 33
  14. Go / / H a s h t h e

    r e c o r d v a l u e s f o r i , v : = r a n g e r e c o r d { h a s h R e s u l t : = h a s h ( [ ] b y t e ( f i e l d s [ i ] + v ) ) % i n t ( D ) x [ i + 1 ] = i n t ( m a t h . A b s ( f l o a t 6 4 ( h a s h R e s u l t ) ) ) } / / G e t t h e p r e d i c t i o n f o r t h e g i v e n r e c o r d ( n o w t r a n s f o r m e d t o h a s h v a l u e s ) w T x : = 0 . 0 f o r _ , v : = r a n g e x { w T x + = ( * w ) [ v ] } p : = 1 . 0 / ( 1 . 0 + m a t h . E x p ( - m a t h . M a x ( m a t h . M i n ( w T x , 2 0 . 0 ) , - 2 0 . 0 ) ) ) / / U p d a t e t h e l o s s p = m a t h . M a x ( m a t h . M i n ( p , 1 . - m a t h . P o w ( 1 0 , - 1 2 ) ) , m a t h . P o w ( 1 0 , - 1 2 ) ) i f y = = 1 { * l o s s + = - m a t h . L o g ( p ) } e l s e { * l o s s + = - m a t h . L o g ( 1 . 0 - p ) } / / U p d a t e t h e w e i g h t s f o r _ , v : = r a n g e x { ( * w ) [ v ] = ( * w ) [ v ] - ( p - f l o a t 6 4 ( y ) ) * a l p h a / ( m a t h . S q r t ( ( * n ) [ v ] ) + 1 . 0 ) ( * n ) [ v ] + + } 23 / 33
  15. # T u r n t h e r e

    c o r d i n t o a l i s t o f h a s h v a l u e s x = [ 0 ] # 0 i s t h e i n d e x o f t h e b i a s t e r m f o r k e y , v a l u e i n r e c o r d . i t e m s ( ) : i n d e x = i n t ( v a l u e + k e y [ 1 : ] , 1 6 ) % D # w e a k e s t h a s h e v e r ; ) x . a p p e n d ( i n d e x ) # G e t t h e p r e d i c t i o n f o r t h e g i v e n r e c o r d ( n o w t r a n s f o r m e d t o h a s h v a l u e s ) w T x = 0 . f o r i i n x : # d o w T x w T x + = w [ i ] # w [ i ] * x [ i ] , b u t i f i i n x w e g o t x [ i ] = 1 . p = 1 . / ( 1 . + e x p ( - m a x ( m i n ( w T x , 2 0 . ) , - 2 0 . ) ) ) # b o u n d e d s i g m o i d # U p d a t e t h e l o s s p = m a x ( m i n ( p , 1 . - 1 0 e - 1 2 ) , 1 0 e - 1 2 ) l o s s + = - l o g ( p ) i f y = = 1 . e l s e - l o g ( 1 . - p ) # U p d a t e t h e w e i g h t s f o r i i n x : # a l p h a / ( s q r t ( n ) + 1 ) i s t h e a d a p t i v e l e a r n i n g r a t e h e u r i s t i c # ( p - y ) * x [ i ] i s t h e c u r r e n t g r a d i e n t # n o t e t h a t i n o u r c a s e , i f i i n x t h e n x [ i ] = 1 w [ i ] - = ( p - y ) * a l p h a / ( s q r t ( n [ i ] ) + 1 . ) n [ i ] + = 1 . 24 / 33
  16. Power up: Multicore Use all cores and lock weights to

    prevent non-atomic modi cations f o r i : = 0 ; i < 4 ; i + + { w g . A d d ( 1 ) g o w o r k e r ( i n p u t , f i e l d s , & w , & n , D , a l p h a , & l o s s , & c o u n t , & w g , m u t e x ) } 26 / 33
  17. Python Current logloss: 0.463056 Run time 1h06m36s Go Current logloss:

    0.459211 Run time 8m22s Go (4 cores) Current logloss: 0.459252 Run time 7m3s 27 / 33
  18. Stochastic Gradient Descent / / U p d a t

    e t h e w e i g h t s f o r _ , v : = r a n g e x { ( * w ) [ v ] = ( * w ) [ v ] - ( p - f l o a t 6 4 ( y ) ) * a l p h a / ( m a t h . S q r t ( ( * n ) [ v ] ) + 1 . 0 ) ( * n ) [ v ] + + } Source: Robbins and Monro, 1950 28 / 33
  19. More e cient multicore How about round-robin updates? Multiple cores

    compute gradients but only one at a time updates weight vector. This is a bit faster Source: http://papers.nips.cc/paper/3888-slow-learners-are-fast.pdf 29 / 33
  20. Python Current logloss: 0.463056 Run time 1h06m36s 11,471 RPS Multicore

    Go (4 cores, no locks) Current logloss: 0.455223 Run time 3m55s 195,066 RPS 32 / 33