Slide 1

Slide 1 text

(Some of the) Things I wish I knew before starting using Python for Data Science Miguel Cabrera [email protected]

Slide 2

Slide 2 text

Background C/Java Experience Python at the University Mostly Numpy/Scikit-Learn Not Pythonic

Slide 3

Slide 3 text

From This

Slide 4

Slide 4 text

To This

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Integration Time You have to integrate your code into existing code base. You have to make your code maintainable and reusable. Sometimes your code deal with semi-structure and textual data.

Slide 7

Slide 7 text

The Things

Slide 8

Slide 8 text

Autovivification

Slide 9

Slide 9 text

One way Straight out of Wikipedia: f r o m c o l l e c t i o n s i m p o r t d e f a u l t d i c t d e f t r e e ( ) : r e t u r n d e f a u l t d i c t ( t r e e ) c o m m o n _ n a m e = t r e e ( ) c o m m o n _ n a m e [ ' M a m m a l i a ' ] [ ' P r i m a t e s ' ] [ ' H o m o ' ] [ ' H . s a p i e n s ' ] = ' h u m a n b e i n g ' r e t u r n c o m m o n _ n a m e d e f a u l t d i c t ( < f u n c t i o n t r e e a t 0 x 1 0 0 6 0 7 c 8 0 > , { ' M a m m a l i a ' : d e f a u l t d i c t ( < f u n c t i o n t r e e a t 0 x 1 0 0 6 0 7 c 8 0 > , { ' P r i m a t e s ' : d e f a u l t d i c t ( < f u n c t i o n t r e e a t 0 x 1 0 0 6 0 7 c 8 0 > , { ' H o m o ' : d e f a u l t d i c t ( < f u n c t i o n t r e e a t 0 x 1 0 0 6 0 7 c 8 0 > , { ' H . s a p i e n s ' : ' h u m a n b e i n g ' } ) } ) } ) } )

Slide 10

Slide 10 text

Another Way This on Stackoverflow shows an alternative (maybe clearer) way: question c l a s s V i v i d i c t ( d i c t ) : d e f _ _ m i s s i n g _ _ ( s e l f , k e y ) : v a l u e = s e l f [ k e y ] = t y p e ( s e l f ) ( ) r e t u r n v a l u e c o m m o n _ n a m e = V i v i d i c t ( ) c o m m o n _ n a m e [ ' M a m m a l i a ' ] [ ' P r i m a t e s ' ] [ ' H o m o ' ] [ ' H . s a p i e n s ' ] = ' h u m a n b e i n g ' r e t u r n c o m m o n _ n a m e Mammalia : (Primates : (Homo : (H. sapiens : human being)))

Slide 11

Slide 11 text

What for? We have this: id-1 a 20 10 id-2 a 50 2 id-1 b -1 -5 id-3 c 10 30 id-2 d -1 -2 And let's say we would like to end up with something like: { " i d - 1 " : { " a " : { " s c o r e _ 1 " : 2 0 , " s c o r e _ 2 " : 1 0 } } { " b " : { " s c o r e _ 1 " : - 1 , " s c o r e _ 2 " : - 5 } } }

Slide 12

Slide 12 text

With a ViviDict i m p o r t p p r i n t c l a s s V i v i d i c t ( d i c t ) : d e f _ _ m i s s i n g _ _ ( s e l f , k e y ) : v a l u e = s e l f [ k e y ] = t y p e ( s e l f ) ( ) r e t u r n v a l u e z o m b i e = V i v i d i c t ( ) f o r r o w i n t a b l e : z o m b i e [ r o w [ 0 ] ] [ r o w [ 1 ] ] [ ' s c o r e _ 1 ' ] = r o w [ 2 ] z o m b i e [ r o w [ 0 ] ] [ r o w [ 1 ] ] [ ' s c o r e _ 2 ' ] = r o w [ 3 ] p p r i n t . p p r i n t ( z o m b i e ) { ' i d - 1 ' : { ' a ' : { ' s c o r e _ 1 ' : 2 0 , ' s c o r e _ 2 ' : 1 0 } , ' b ' : { ' s c o r e _ 1 ' : - 1 , ' s c o r e _ 2 ' : - 5 } } , ' i d - 2 ' : { ' a ' : { ' s c o r e _ 1 ' : 5 0 , ' s c o r e _ 2 ' : 2 } , ' d ' : { ' s c o r e _ 1 ' : - 1 , ' s c o r e _ 2 ' : - 2 } } , ' i d - 3 ' : { ' c ' : { ' s c o r e _ 1 ' : 1 0 , ' s c o r e _ 2 ' : 3 0 } } }

Slide 13

Slide 13 text

Iterators and Iterables

Slide 14

Slide 14 text

What? source: http://nvie.com/posts/iterators-vs-generators/

Slide 15

Slide 15 text

Example: A Generator g e n e r a t o r = ( w o r d + ' ! ' f o r w o r d i n ' h i t m e b a b y o n e m o r e t i m e ' . s p l i t ( ) ) t r y : l e n ( g e n e r a t o r ) e x c e p t T y p e E r r o r : p r i n t ( " G e n e r a t o r s h a s n o l e n g t h ! " ) f o r w i n g e n e r a t o r : p r i n t w G e n e r a t o r s h a s n o l e n g t h ! h i t ! m e ! b a b y ! o n e ! m o r e ! t i m e !

Slide 16

Slide 16 text

What does it have to do with Data Science? Data Streaming through Lazy Evaluation Excellent discussion: http://rare-technologies.com/data-streaming-in-python-generators-iterators- iterables/

Slide 17

Slide 17 text

Something more useful c l a s s H d f s L i n e S e n t e n c e ( o b j e c t ) : d e f _ _ i t e r _ _ ( s e l f ) : s t r e a m = s e l f . s o u r c e . o p e n ( ' r ' ) f o r l i n e i n s t r e a m : c i d , s = l i n e . s p l i t ( ' \ t ' ) s = u " " . j o i n ( c o d e c s . d e c o d e ( w o r d , ' u t f - 8 ' , ' r e p l a c e ' ) f o r w o r d i n s . s p l i t ( ) ) s = s . s p l i t ( ) y i e l d s

Slide 18

Slide 18 text

NamedTuples

Slide 19

Slide 19 text

Why Many Python developers write code around the d i c t class or tuples You never know what to expect Code becomes hard to read From http://stackoverflow.com/questions/2970608/what-are-named-tuples-in-python p t 1 = ( 1 . 0 , 5 . 0 ) p t 2 = ( 2 . 5 , 1 . 5 ) f r o m m a t h i m p o r t s q r t l i n e _ l e n g t h = s q r t ( ( p t 1 [ 0 ] - p t 2 [ 0 ] ) * * 2 + ( p t 1 [ 1 ] - p t 2 [ 1 ] ) * * 2 )

Slide 20

Slide 20 text

Enter NamedTuples Named tuples assign meaning to each position in a tuple and allow for more readable, self-documenting code. They can be used wherever regular tuples are used, and they add the ability to access fields by name instead of position index. f r o m c o l l e c t i o n s i m p o r t n a m e d t u p l e P o i n t = n a m e d t u p l e ( ' P o i n t ' , ' x y ' ) p t 1 = P o i n t ( 1 . 0 , 5 . 0 ) p t 2 = P o i n t ( 2 . 5 , 1 . 5 ) f r o m m a t h i m p o r t s q r t l i n e _ l e n g t h = s q r t ( ( p t 1 . x - p t 2 . x ) * * 2 + ( p t 1 . y - p t 2 . y ) * * 2 )

Slide 21

Slide 21 text

NamedTuples provide cool methods Some of them: Name Description _ a s d i c t Return a new OrderedDict which maps field names to their values _ m a k e ( i t e r a b l e ) Class method that makes a new instance from an existing sequence or iterable.

Slide 22

Slide 22 text

You can extend a NamedTuple _ H o t e l B a s e = n a m e d t u p l e ( ' H o t e l D e s c r i p t o r ' , [ ' c l u s t e r _ i d ' , ' t r u s t _ s c o r e ' , ' r e v i e w s _ c o u n t ' , ' c a t e g o r y _ s c o r e s ' , ' i n t e n s i t y _ f a c t o r s ' ] , ) c l a s s H o t e l D e s c r i p t o r ( _ H o t e l B a s e ) : d e f c o m p u t e _ p r i o r ( s e l f ) : i f n o t s e l f . t r u s t _ s c o r e o r n o t s e l f . r e v i e w s _ c o u n t : r a i s e N o t E n o u g h D a t a F o r R a n k i n g ( " C a n n o t c o m p u t e p r i o r w i t h o u t t y s c o r e a n d r e v i e w s " ) r e t u r n _ c o m p u t e _ p r i o r ( s e l f . t r u s t _ s c o r e , s e l f . r e v i e w s _ c o u n t ) ( . . . )

Slide 23

Slide 23 text

Conclusion (Aspiring) Data Scientists / Engineers should learn: Standard library (i.e. the c o l l e c t i o n s module in particular) Iterables and Iterators Object oriented practices Documenting your code How to package Exposing your models (i.e. via an API)

Slide 24

Slide 24 text

Questions?

Slide 25

Slide 25 text

Created by Miguel Cabrera.