Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Berlin Meetup Nov 2015 - (Some of the) things I wish I knew before starting using Python for Data Science

PyData Berlin Meetup Nov 2015 - (Some of the) things I wish I knew before starting using Python for Data Science

Lighting talk during PyData Berlin Nov' 15 Meetup.

D0ab1fbc41764f8ea112824449b33e18?s=128

Miguel Cabrera

November 19, 2015
Tweet

Transcript

  1. (Some of the) Things I wish I knew before starting

    using Python for Data Science Miguel Cabrera mfcabrera@gmail.com
  2. Background C/Java Experience Python at the University Mostly Numpy/Scikit-Learn Not

    Pythonic
  3. From This

  4. To This

  5. None
  6. Integration Time You have to integrate your code into existing

    code base. You have to make your code maintainable and reusable. Sometimes your code deal with semi-structure and textual data.
  7. The Things

  8. Autovivification

  9. One way Straight out of Wikipedia: f r o m

    c o l l e c t i o n s i m p o r t d e f a u l t d i c t d e f t r e e ( ) : r e t u r n d e f a u l t d i c t ( t r e e ) c o m m o n _ n a m e = t r e e ( ) c o m m o n _ n a m e [ ' M a m m a l i a ' ] [ ' P r i m a t e s ' ] [ ' H o m o ' ] [ ' H . s a p i e n s ' ] = ' h u m a n b e i n g ' r e t u r n c o m m o n _ n a m e d e f a u l t d i c t ( < f u n c t i o n t r e e a t 0 x 1 0 0 6 0 7 c 8 0 > , { ' M a m m a l i a ' : d e f a u l t d i c t ( < f u n c t i o n t r e e a t 0 x 1 0 0 6 0 7 c 8 0 > , { ' P r i m a t e s ' : d e f a u l t d i c t ( < f u n c t i o n t r e e a t 0 x 1 0 0 6 0 7 c 8 0 > , { ' H o m o ' : d e f a u l t d i c t ( < f u n c t i o n t r e e a t 0 x 1 0 0 6 0 7 c 8 0 > , { ' H . s a p i e n s ' : ' h u m a n b e i n g ' } ) } ) } ) } )
  10. Another Way This on Stackoverflow shows an alternative (maybe clearer)

    way: question c l a s s V i v i d i c t ( d i c t ) : d e f _ _ m i s s i n g _ _ ( s e l f , k e y ) : v a l u e = s e l f [ k e y ] = t y p e ( s e l f ) ( ) r e t u r n v a l u e c o m m o n _ n a m e = V i v i d i c t ( ) c o m m o n _ n a m e [ ' M a m m a l i a ' ] [ ' P r i m a t e s ' ] [ ' H o m o ' ] [ ' H . s a p i e n s ' ] = ' h u m a n b e i n g ' r e t u r n c o m m o n _ n a m e Mammalia : (Primates : (Homo : (H. sapiens : human being)))
  11. What for? We have this: id-1 a 20 10 id-2

    a 50 2 id-1 b -1 -5 id-3 c 10 30 id-2 d -1 -2 And let's say we would like to end up with something like: { " i d - 1 " : { " a " : { " s c o r e _ 1 " : 2 0 , " s c o r e _ 2 " : 1 0 } } { " b " : { " s c o r e _ 1 " : - 1 , " s c o r e _ 2 " : - 5 } } }
  12. With a ViviDict i m p o r t p

    p r i n t c l a s s V i v i d i c t ( d i c t ) : d e f _ _ m i s s i n g _ _ ( s e l f , k e y ) : v a l u e = s e l f [ k e y ] = t y p e ( s e l f ) ( ) r e t u r n v a l u e z o m b i e = V i v i d i c t ( ) f o r r o w i n t a b l e : z o m b i e [ r o w [ 0 ] ] [ r o w [ 1 ] ] [ ' s c o r e _ 1 ' ] = r o w [ 2 ] z o m b i e [ r o w [ 0 ] ] [ r o w [ 1 ] ] [ ' s c o r e _ 2 ' ] = r o w [ 3 ] p p r i n t . p p r i n t ( z o m b i e ) { ' i d - 1 ' : { ' a ' : { ' s c o r e _ 1 ' : 2 0 , ' s c o r e _ 2 ' : 1 0 } , ' b ' : { ' s c o r e _ 1 ' : - 1 , ' s c o r e _ 2 ' : - 5 } } , ' i d - 2 ' : { ' a ' : { ' s c o r e _ 1 ' : 5 0 , ' s c o r e _ 2 ' : 2 } , ' d ' : { ' s c o r e _ 1 ' : - 1 , ' s c o r e _ 2 ' : - 2 } } , ' i d - 3 ' : { ' c ' : { ' s c o r e _ 1 ' : 1 0 , ' s c o r e _ 2 ' : 3 0 } } }
  13. Iterators and Iterables

  14. What? source: http://nvie.com/posts/iterators-vs-generators/

  15. Example: A Generator g e n e r a t

    o r = ( w o r d + ' ! ' f o r w o r d i n ' h i t m e b a b y o n e m o r e t i m e ' . s p l i t ( ) ) t r y : l e n ( g e n e r a t o r ) e x c e p t T y p e E r r o r : p r i n t ( " G e n e r a t o r s h a s n o l e n g t h ! " ) f o r w i n g e n e r a t o r : p r i n t w G e n e r a t o r s h a s n o l e n g t h ! h i t ! m e ! b a b y ! o n e ! m o r e ! t i m e !
  16. What does it have to do with Data Science? Data

    Streaming through Lazy Evaluation Excellent discussion: http://rare-technologies.com/data-streaming-in-python-generators-iterators- iterables/
  17. Something more useful c l a s s H d

    f s L i n e S e n t e n c e ( o b j e c t ) : d e f _ _ i t e r _ _ ( s e l f ) : s t r e a m = s e l f . s o u r c e . o p e n ( ' r ' ) f o r l i n e i n s t r e a m : c i d , s = l i n e . s p l i t ( ' \ t ' ) s = u " " . j o i n ( c o d e c s . d e c o d e ( w o r d , ' u t f - 8 ' , ' r e p l a c e ' ) f o r w o r d i n s . s p l i t ( ) ) s = s . s p l i t ( ) y i e l d s
  18. NamedTuples

  19. Why Many Python developers write code around the d i

    c t class or tuples You never know what to expect Code becomes hard to read From http://stackoverflow.com/questions/2970608/what-are-named-tuples-in-python p t 1 = ( 1 . 0 , 5 . 0 ) p t 2 = ( 2 . 5 , 1 . 5 ) f r o m m a t h i m p o r t s q r t l i n e _ l e n g t h = s q r t ( ( p t 1 [ 0 ] - p t 2 [ 0 ] ) * * 2 + ( p t 1 [ 1 ] - p t 2 [ 1 ] ) * * 2 )
  20. Enter NamedTuples Named tuples assign meaning to each position in

    a tuple and allow for more readable, self-documenting code. They can be used wherever regular tuples are used, and they add the ability to access fields by name instead of position index. f r o m c o l l e c t i o n s i m p o r t n a m e d t u p l e P o i n t = n a m e d t u p l e ( ' P o i n t ' , ' x y ' ) p t 1 = P o i n t ( 1 . 0 , 5 . 0 ) p t 2 = P o i n t ( 2 . 5 , 1 . 5 ) f r o m m a t h i m p o r t s q r t l i n e _ l e n g t h = s q r t ( ( p t 1 . x - p t 2 . x ) * * 2 + ( p t 1 . y - p t 2 . y ) * * 2 )
  21. NamedTuples provide cool methods Some of them: Name Description _

    a s d i c t Return a new OrderedDict which maps field names to their values _ m a k e ( i t e r a b l e ) Class method that makes a new instance from an existing sequence or iterable.
  22. You can extend a NamedTuple _ H o t e

    l B a s e = n a m e d t u p l e ( ' H o t e l D e s c r i p t o r ' , [ ' c l u s t e r _ i d ' , ' t r u s t _ s c o r e ' , ' r e v i e w s _ c o u n t ' , ' c a t e g o r y _ s c o r e s ' , ' i n t e n s i t y _ f a c t o r s ' ] , ) c l a s s H o t e l D e s c r i p t o r ( _ H o t e l B a s e ) : d e f c o m p u t e _ p r i o r ( s e l f ) : i f n o t s e l f . t r u s t _ s c o r e o r n o t s e l f . r e v i e w s _ c o u n t : r a i s e N o t E n o u g h D a t a F o r R a n k i n g ( " C a n n o t c o m p u t e p r i o r w i t h o u t t y s c o r e a n d r e v i e w s " ) r e t u r n _ c o m p u t e _ p r i o r ( s e l f . t r u s t _ s c o r e , s e l f . r e v i e w s _ c o u n t ) ( . . . )
  23. Conclusion (Aspiring) Data Scientists / Engineers should learn: Standard library

    (i.e. the c o l l e c t i o n s module in particular) Iterables and Iterators Object oriented practices Documenting your code How to package Exposing your models (i.e. via an API)
  24. Questions?

  25. Created by Miguel Cabrera.