Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Expressive Parallel Analytics with Clojure

Henry Garner
December 04, 2015

Expressive Parallel Analytics with Clojure

Sharing experience gained from his work on a mission-critical data product earlier this year, Henry will speak about some newer features of Clojure that enable data scientists to write concise, expressive and performant data processing code. He’ll explore transducers and reducing functions, and show how simple functional combinators can make even sophisticated analytical code both faster and easier to comprehend.

Henry Garner is a freelance data engineer working primarily in Clojure. He’s author of the Packt book 'Clojure for Data Science' and managed to squeeze the buzzwords 'big data' and 'machine learning' onto the cover. And also into this biography.

Henry Garner

December 04, 2015
Tweet

More Decks by Henry Garner

Other Decks in Programming

Transcript

  1. Analytic sequence 1. Load & join 2. Apply rules 1.

    Apply filters 2. Normalise data 1. Harmonise units 2. Summary statistics 3. Harmonise ranges 3. Calculate score 3. Output 4. x 13 x 7
  2. ( l o a d - d a t a

    " d a t a . e d n " ) ; ; ( { : n a m e " A " , : f x 0 . 8 , : a 9 0 , : b 5 0 } ; ; { : n a m e " B " , : f x 0 . 2 , : a 8 0 , : b 8 0 } ; ; { : n a m e " C " , : f x 0 . 1 , : a 6 0 , : b 4 0 } ; ; { : n a m e " D " , : f x 0 . 5 , : a 5 0 , : b 7 0 } )
  3. ( - > > ( l o a d -

    d a t a " d a t a . e d n " ) ( f i l t e r r e l e v a n t ? ) ( m a p c o n v e r t - c u r r e n c y ) ( m a p a s s i g n - s c o r e ) ) ; ; ( { : n a m e " A " , : f x 0 . 8 , : a 1 1 2 . 5 , : b 6 2 . 5 , : s c o r e 1 7 5 . 0 } ; ; { : n a m e " B " , : f x 0 . 2 , : a 4 0 0 . 0 , : b 4 0 0 . 0 , : s c o r e 8 0 0 . 0 } ; ; { : n a m e " D " , : f x 0 . 5 , : a 1 0 0 . 0 , : b 1 4 0 . 0 , : s c o r e 2 4 0 . 0 } )
  4. ( f i l t e r r e l

    e v a n t ? )
  5. ( d e f x f o r m (

    c o m p ( f i l t e r r e l e v a n t ? ) ( m a p c o n v e r t - c u r r e n c y ) ( m a p a s s i g n - s c o r e ) ) )
  6. ( s e q u e n c e x

    f o r m ( l o a d - d a t a " d a t a . e d n " ) ) ; ; ( { : n a m e " A " , : f x 0 . 8 , : a 1 1 2 . 5 , : b 6 2 . 5 , : s c o r e 1 7 5 . 0 } ; ; { : n a m e " B " , : f x 0 . 2 , : a 4 0 0 . 0 , : b 4 0 0 . 0 , : s c o r e 8 0 0 . 0 } ; ; { : n a m e " D " , : f x 0 . 5 , : a 1 0 0 . 0 , : b 1 4 0 . 0 , : s c o r e 2 4 0 . 0 } )
  7. ( - > > ( l o a d -

    d a t a " d a t a . e d n " ) ( s e q u e n c e ( c o m p x f o r m ( t a k e 2 ) ) ) ) ; ; ( { : n a m e " A " , : f x 0 . 8 , : a 1 1 2 . 5 , : b 6 2 . 5 , : s c o r e 1 7 5 . 0 } ; ; { : n a m e " B " , : f x 0 . 2 , : a 4 0 0 . 0 , : b 4 0 0 . 0 , : s c o r e 8 0 0 . 0 } ) ( - > > ( l o a d - d a t a " d a t a . e d n " ) ( s e q u e n c e ( c o m p x f o r m ( m a p : s c o r e ) ) ) ) ; ; ( 1 7 5 . 0 8 0 0 . 0 2 4 0 . 0 )
  8. ( d e f s c o r e s

    ( c o m p x f o r m ( m a p : s c o r e ) ) ) ( - > > ( l o a d - d a t a " d a t a . e d n " ) ( t r a n s d u c e s c o r e s + ) ) ; ; 1 2 1 5 . 0
  9. ( d e f n m e a n [

    x s ] ( l e t [ s u m ( r e d u c e + x s ) c o u n t ( c o u n t x s ) ] ( w h e n - n o t ( z e r o ? c o u n t ) ( / s u m c o u n t ) ) )
  10. ( d e f n m e a n [

    a c c u m x ] ( - > ( u p d a t e - i n a c c u m [ : s u m ] + x ) ( u p d a t e - i n [ : c o u n t ] i n c ) ) ) ( r e d u c e m e a n ( r a n g e 1 0 ) ) ; ; = > . . . ?
  11. 1 . U n h a n d l e

    d j a v a . l a n g . N u l l P o i n t e r E x c e p t i o n ( N o m e s s a g e ) N u m b e r s . j a v a : 1 0 1 3 c l o j u r e . l a n g . N u m b e r s / o p s N u m b e r s . j a v a : 1 1 2 c l o j u r e . l a n g . N u m b e r s / i n c c o r e . c l j : 8 9 2 c l o j u r e . c o r e / i n c A F n . j a v a : 1 5 4 c l o j u r e . l a n g . A F n / a p p l y T o H e l p e r A F n . j a v a : 1 4 4 c l o j u r e . l a n g . A F n / a p p l y T o c o r e . c l j : 6 3 2 c l o j u r e . c o r e / a p p l y c o r e . c l j : 5 9 2 3 c l o j u r e . c o r e / u p d a t e - i n R e s t F n . j a v a : 4 4 5 c l o j u r e . l a n g . R e s t F n / i n v o k e s w e e t . c l j : 2 4 2 e x a m p l e . s w e e t / m e a n - r e d u c e r L o n g R a n g e . j a v a : 2 2 2 c l o j u r e . l a n g . L o n g R a n g e / r e d u c e c o r e . c l j : 6 5 1 4 c l o j u r e . c o r e / r e d u c e R E P L : 1 e x a m p l e . s w e e t / e v a l 2 8 3 3 7
  12. ( r e d u c e m e a

    n { : s u m 0 : c o u n t 0 } ( r a n g e 1 0 ) ) ; ; = > { : s u m 4 5 , : c o u n t 1 0 }
  13. ( d e f n m e a n ;

    ; I n i t ( [ ] { : s u m 0 : c o u n t 0 } ) ; ; S t e p ( [ a c c u m x ] ( - > ( u p d a t e - i n a c c u m [ : c o u n t ] i n c ) ( u p d a t e - i n [ : s u m ] + x ) ) ) ) ( r e d u c e m e a n ( m e a n ) ( r a n g e 1 0 ) ) ; ; = > { : s u m 4 5 , : c o u n t 1 0 }
  14. ( d e f n m e a n ;

    ; I n i t ( [ ] { : s u m 0 : c o u n t 0 } ) ; ; S t e p ( [ a c c u m x ] ( - > ( u p d a t e - i n a c c u m [ : c o u n t ] i n c ) ( u p d a t e - i n [ : s u m ] + x ) ) ) ; ; C o m p l e t e ( [ { : k e y s [ s u m c o u n t ] } ] ( w h e n - n o t ( z e r o ? c o u n t ) ( / s u m c o u n t ) ) ) ) ( m e a n ( r e d u c e m e a n ( m e a n ) ( r a n g e 1 0 ) ) ) ; ; = > 9 / 2
  15. ( t r a n s d u c e

    ( m a p i d e n t i t y ) m e a n ( r a n g e 1 0 ) ) ; ; = > 9 / 2
  16. ( d e f n i d e n t

    i t y - t r a n s d u c e r [ r f ] ( f n ( [ ] ( r f ) ) ; ; I n i t ( [ a c c ] ( r f a c c ) ) ; ; C o m p l e t e ( [ a c c x ] ( r f a c c x ) ) ) ) ; ; S t e p ( d e f n i d e n t i t y - t r a n s d u c e r [ r f ] r f ) ( d e f i d e n t i t y - t r a n s d u c e r i d e n t i t y )
  17. ( t r a n s d u c e

    i d e n t i t y m e a n ( r a n g e 1 0 ) ) ; ; = > 9 / 2
  18. https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance ( d e f n v a r i

    a n c e ; ; I n i t ( [ ] [ 0 0 0 ] ) ; ; S t e p ( [ [ c o u n t m e a n s u m - o f - s q u a r e s ] x ] ( l e t [ c o u n t ' ( i n c c o u n t ) m e a n ' ( + m e a n ( / ( - x m e a n ) c o u n t ' ) ) ] [ c o u n t ' m e a n ' ( + s u m - o f - s q u a r e s ( * ( - x m e a n ' ) ( - x m e a n ) ) ) ] ) ) ; ; C o m p l e t e ( [ [ c o u n t m e a n s u m - o f - s q u a r e s ] ] ( / s u m - o f - s q u a r e s ( m a x 1 ( d e c c o u n t ) ) ) ) )
  19. ( - > > ( l o a d -

    d a t a " d a t a . e d n " ) ( t r a n s d u c e s c o r e s v a r i a n c e ) ) ; ; = > 1 1 8 0 7 5 . 0
  20. ( d e f s t a n d a

    r d - d e v i a t i o n ( c o m p l e t i n g v a r i a n c e # ( M a t h / s q r t ( v a r i a n c e % ) ) ) ) ( - > > ( l o a d - d a t a " d a t a . e d n " ) ( t r a n s d u c e s c o r e s s t a n d a r d - d e v i a t i o n ) ) ; ; = > 3 4 3 . 6 2 0 4 3 0 1 2 6 0 3 3 1
  21. ( - > > ( l o a d -

    d a t a " d a t a . e d n " ) ( m a p ( j u x t : a : b ) ) ) ; ; ( [ 9 0 5 0 ] [ 8 0 8 0 ] [ 6 0 4 0 ] [ 5 0 7 0 ] )
  22. ( - > > ( l o a d -

    d a t a " d a t a . e d n " ) ( t r a n s d u c e s c o r e s ( j u x t m e a n s t a n d a r d - d e v i a t i o n ) ) ) ; ; = > ?
  23. 1 . U n h a n d l e

    d j a v a . l a n g . N u l l P o i n t e r E x c e p t i o n ( N o m e s s a g e ) N u m b e r s . j a v a : 1 0 1 3 c l o j u r e . l a n g . N u m b e r s / o p s N u m b e r s . j a v a : 1 1 2 c l o j u r e . l a n g . N u m b e r s / i n c c o r e . c l j : 8 9 2 c l o j u r e . c o r e / i n c A F n . j a v a : 1 5 4 c l o j u r e . l a n g . A F n / a p p l y T o H e l p e r A F n . j a v a : 1 4 4 c l o j u r e . l a n g . A F n / a p p l y T o c o r e . c l j : 6 3 2 c l o j u r e . c o r e / a p p l y c o r e . c l j : 5 9 2 3 c l o j u r e . c o r e / u p d a t e - i n R e s t F n . j a v a : 4 4 5 c l o j u r e . l a n g . R e s t F n / i n v o k e s w e e t . c l j : 8 2 e x a m p l e . s w e e t / m e a n c o r e . c l j : 2 4 6 4 c l o j u r e . c o r e / j u x t / f n c o r e . c l j : 2 6 1 1 c l o j u r e . c o r e / m a p / f n / f n c o r e . c l j : 2 6 1 1 c l o j u r e . c o r e / m a p / f n / f n c o r e . c l j : 2 6 1 1 c l o j u r e . c o r e / m a p / f n / f n c o r e . c l j : 2 6 7 5 c l o j u r e . c o r e / f i l t e r / f n / f n p r o t o c o l s . c l j : 1 6 7 c l o j u r e . c o r e . p r o t o c o l s / f n p r o t o c o l s . c l j : 1 9 c l o j u r e . c o r e . p r o t o c o l s / f n / G p r o t o c o l s . c l j : 3 1 c l o j u r e . c o r e . p r o t o c o l s / s e q - r e d u c e p r o t o c o l s . c l j : 1 0 1 c l o j u r e . c o r e . p r o t o c o l s / f n
  24. ( d e f n j u x t -

    r [ & r f n s ] ( f n ( [ ] ( m a p v ( f n [ f ] ( f ) ) r f n s ) ) ( [ a c c ] ( m a p v ( f n [ f a ] ( f a ) ) r f n s a c c ) ) ( [ a c c x ] ( m a p v ( f n [ f a ] ( f a x ) ) r f n s a c c ) ) ) ) ( d e f r f ( j u x t - r + c o n j ) ) ( t r a n s d u c e i d e n t i t y r f ( r a n g e 1 0 ) ) ; ; = > [ 4 5 [ 0 1 2 3 4 5 6 7 8 9 ] ]
  25. ( d e f r f ( j u x

    t - r + ( ( t a k e 3 ) c o n j ) ) ) ( t r a n s d u c e i d e n t i t y r f ( r a n g e 1 0 ) ) ; ; = > . . . ? ; ;
  26. ( d e f r f ( j u x

    t - r + ( ( t a k e 3 ) c o n j ) ) ) ( t r a n s d u c e i d e n t i t y r f ( r a n g e 1 0 ) ) ; ; = > [ 4 5 # o b j e c t [ c l o j u r e . l a n g . R e d u c e d { : s t a t u s : r e a d y , ; ; : v a l [ 0 1 2 ] } ] ]
  27. ( d e f n t a k e [

    n ] ( f n [ r f ] ( l e t [ n v ( v o l a t i l e ! n ) ] ( f n ( [ ] ( r f ) ) ( [ r e s u l t ] ( r f r e s u l t ) ) ( [ r e s u l t i n p u t ] ( l e t [ n @ n v n n ( v s w a p ! n v d e c ) r e s u l t ( i f ( p o s ? n ) ( r f r e s u l t i n p u t ) r e s u l t ) ] ( i f ( n o t ( p o s ? n n ) ) ( e n s u r e - r e d u c e d r e s u l t ) r e s u l t ) ) ) ) ) ) )
  28. ( d e f n j u x t -

    r [ & r f n s ] ( f n ( [ ] ( m a p v ( f n [ f ] ( f ) ) r f n s ) ) ( [ a c c ] ( m a p v ( f n [ f a ] ( f ( u n r e d u c e d a ) ) ) r f n s a c c ) ) ( [ a c c x ] ( l e t [ a l l - r e d u c e d ? ( v o l a t i l e ! t r u e ) r e s u l t s ( m a p v ( f n [ f a ] ( i f - n o t ( r e d u c e d ? a ) ( d o ( v r e s e t ! a l l - r e d u c e d ? f a l s e ) ( f a x ) ) a ) ) r f n s a c c ) ] ( i f @ a l l - r e d u c e d ? ( r e d u c e d r e s u l t s ) r e s u l t s ) ) ) ) )
  29. ( d e f r f ( j u x

    t - r + ( ( t a k e 3 ) c o n j ) ) ) ( t r a n s d u c e i d e n t i t y r f ( r a n g e 1 0 ) ) ; ; = > [ 4 5 [ 0 1 2 ] ] …but… ( t r a n s d u c e i d e n t i t y r f ( r a n g e 1 0 ) ) ; ; = > [ 4 5 [ ] ]
  30. ( d e f r f ( ( m a

    p i n c ) + ) ) ( t r a n s d u c e i d e n t i t y r f ( r a n g e 1 0 ) ) ; ; = > 5 5
  31. ( d e f n f a c e t

    [ r f f n s ] ( - > > ( m a p ( f n [ f ] ( ( m a p f ) r f ) ) f n s ) ( a p p l y j u x t - r ) ) ) ( d e f r f ( f a c e t + [ : a : b ] ) ) ( - > > ( l o a d - d a t a " d a t a . e d n " ) ( t r a n s d u c e i d e n t i t y r f ) ) ; ; = > [ 2 8 0 2 4 0 ]
  32. ( d e f n w e i g h

    t e d - m e a n [ n f d f ] ( l e t [ r f ( f a c e t m e a n [ n f d f ] ) ] ( c o m p l e t i n g r f ( f n [ x ] ( l e t [ [ n d ] ( r f x ) ] ( w h e n - n o t ( z e r o ? d ) ( / n d ) ) ) ) ) ) ) ( - > > ( l o a d - d a t a " d a t a . e d n " ) ( t r a n s d u c e i d e n t i t y ( w e i g h t e d - m e a n : a : b ) ) ) ; ; = > 7 / 6
  33. ( d e f n f u s e [

    k v s ] ( l e t [ r f n s ( v a l s k v s ) r f ( a p p l y j u x t - r r f n s ) ] ( c o m p l e t i n g r f # ( z i p m a p ( k e y s k v s ) ( r f % ) ) ) ) ) ( d e f r f ( f u s e { : m e a n m e a n : s d s t a n d a r d - d e v i a t i o n } ) ) ( - > > ( l o a d - d a t a " d a t a . e d n " ) ( t r a n s d u c e ( m a p : a ) r f ) ) ; ; = > { : m e a n 7 0 , : s d 1 8 . 2 5 7 4 1 8 5 8 3 5 0 5 5 3 7 }
  34. ( d e f r f ( f u s

    e { : m e a n - s c o r e ( ( m a p : s c o r e ) m e a n ) : f i e l d s ( f a c e t ( f u s e { : m e a n m e a n : s d s t a n d a r d - d e v i a t i o n } ) [ : a : b ] ) } ) ) ( - > > ( l o a d - d a t a " d a t a . e d n " ) ( t r a n s d u c e x f o r m r f ) ) ; ; { : m e a n - s c o r e 4 0 5 . 0 , ; ; : f i e l d s [ { : m e a n 2 0 4 . 1 6 6 6 6 6 6 6 6 6 6 6 6 6 , ; ; : s d 1 6 9 . 7 1 1 7 6 5 4 5 3 4 6 9 1 8 } ; ; { : m e a n 2 0 0 . 8 3 3 3 3 3 3 3 3 3 3 3 3 4 , ; ; : s d 1 7 6 . 7 8 2 5 8 7 7 5 4 9 4 0 7 8 } ] }
  35. Reducers s o l v e ( p r o

    b l e m ) : i f p r o b l e m i s s m a l l e n o u g h : s o l v e p r o b l e m d i r e c t l y ( s e q u e n t i a l a l g o r i t h m ) e l s e : f o r p a r t i n s u b d i v i d e ( p r o b l e m ) f o r k s u b t a s k t o s o l v e p a r t j o i n a l l s u b t a s k s s p a w n e d i n p r e v i o u s l o o p c o m b i n e r e s u l t s f r o m s u b t a s k s
  36. ( i m p o r t ' [ o

    r g . H d r H i s t o g r a m D o u b l e H i s t o g r a m ] ) ( d e f n i q r - r e d u c e r ( [ ] ( D o u b l e H i s t o g r a m . 1 e 8 3 ) ) ( [ h i s t x ] ( d o t o h i s t ( . r e c o r d V a l u e x ) ) ) ( [ h i s t ] h i s t ) ) ( d e f n i q r - c o m b i n e r ( [ ] ( D o u b l e H i s t o g r a m . 1 e 8 3 ) ) ( [ a b ] ( d o t o a ( . a d d b ) ) ) ( [ h i s t ] ( v e c t o r ( . g e t V a l u e A t P e r c e n t i l e h i s t 2 5 ) ( . g e t V a l u e A t P e r c e n t i l e h i s t 7 5 ) ) ) )
  37. ( r e q u i r e ' [

    c l o j u r e . c o r e . r e d u c e r s : a s r ] ) ( - > > ( l o a d - d a t a " d a t a . e d n " ) ( e d u c t i o n x f o r m ( m a p : s c o r e ) ) ( r / f o l d i q r - c o m b i n e r i q r - r e d u c e r ) ) ; ; = > # o b j e c t [ o r g . H d r H i s t o g r a m . D o u b l e H i s t o g r a m ]
  38. ( r e q u i r e ' [

    c l o j u r e . c o r e . r e d u c e r s : a s r ] ) ( - > > ( l o a d - d a t a " d a t a . e d n " ) ( e d u c t i o n x f o r m ( m a p : s c o r e ) ) ( r / f o l d i q r - c o m b i n e r i q r - r e d u c e r ) ( i q r - c o m b i n e r ) ) ; ; = > [ 1 7 5 . 0 2 4 0 . 0 ]
  39. ( r e q u i r e ' [

    c l o j u r e . c o r e . a s y n c : a s a s y n c ] ) ( d e f n f o l d [ n x f o r m r e d u c e f c o m b i n e f i n ] ( l e t [ r e d u c e d ( a s y n c / c h a n n ) f ( x f o r m r e d u c e f ) ] ( - > > ( f o r [ _ ( r a n g e n ) ] ( a s y n c / r e d u c e f ( f ) i n ) ) ( a s y n c / m e r g e ) ( a s y n c / p i p e l i n e n r e d u c e d ( m a p f ) ) ) ( a s y n c / g o ( - > > ( a s y n c / r e d u c e c o m b i n e f ( c o m b i n e f ) r e d u c e d ) ( a s y n c / < ! ) ( c o m b i n e f ) ) ) ) )
  40. ( d e f d a t a ( t

    a k e 1 0 0 0 0 0 ( c y c l e ( l o a d - d a t a " d a t a . e d n " ) ) ) ) ( q u i c k - b e n c h ( - > > ( a s y n c / t o - c h a n d a t a ) ( f o l d 8 ( c o m p x f o r m ( m a p : s c o r e ) ) h i s t o g r a m - r e d u c e r h i s t o g r a m - c o m b i n e r ) ( a s y n c / < ! ! ) ) ) ; ; E x e c u t i o n t i m e m e a n : 1 6 2 . 8 1 1 3 5 4 m s ; ; E x e c u t i o n t i m e s t d - d e v i a t i o n : 1 6 8 . 6 6 4 2 7 9 m s
  41. ( q u i c k - b e n

    c h ( - > > ( e d u c t i o n x f o r m ( m a p : s c o r e ) d a t a ) ( r / f o l d 8 h i s t o g r a m - c o m b i n e r h i s t o g r a m - r e d u c e r ) ) ) ; ; E x e c u t i o n t i m e m e a n : 5 0 . 5 9 3 1 1 3 m s ; ; E x e c u t i o n t i m e s t d - d e v i a t i o n : 2 . 6 4 4 2 6 1 m s
  42. Summary Step functions init, step, complete reduced? composition: juxt-r, facet,

    fuse Transducible contexts: sequence, transduce, eduction fold pipeline
  43. References reduce, into, by-key, partition, pad, for and window str,

    str!, avg, count, juxt, juxt-map and first correlation, variance, covariance, standard-deviation Logic Programming, Core.Async, Transients, and more https://github.com/cgrand/xforms https://github.com/aphyr/tesser https://tbaldridge.pivotshare.com/