Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Europython 2016 - Things I wish I knew before using Python for Data Processing

Europython 2016 - Things I wish I knew before using Python for Data Processing

30 minute talk in Europyhthon 2016 based on the a lighting talk.

D0ab1fbc41764f8ea112824449b33e18?s=128

Miguel Cabrera

July 20, 2016
Tweet

Transcript

  1. Things I wish I Knew Before Using Python for Data

    Processing   Miguel  Cabrera   @mfcabrera     20  July  2016  
  2. Hello! I  am  Miguel!   Data  Engineer  /  Scien6st  @

     TrustYou   Python  ~  2  years     Berlin       @mfcabrera   mfcabrera  at  gmail       hAp://mfcabrera.com     2
  3. •  (Rela6vely)  New  to  Python,  mostly  Scien6fic   stack  

    •  You  have  used  things  like  Numpy,  Scikit-­‐ Learn,  Gensim,  etc…     •  Your  job  6tle  includes  either  the  word  Data   or  “Machine  Learning”.     •  Not  necessarily  a  trained  SoWware  Engineer   3 Priors!
  4. •  Data  Scien6st?   •  Data  Analyst?   •  Data

     Engineer?   •  Machine  Learning  Developer?   •  SoWware  Developer?   •  Other?   4 Who is who?
  5. •  Basic  Concepts  and  prac6ces   •  Some  goodies  of

     the  collec6on   module   •  Iterators  and  Iterables   •  Conclusion             5 Agenda
  6. •  Recent  university  grad   •  Mostly  R  and  Matlab

        •  Writes  niWy  code  to  classify   documents  using  Jupyter   Notebooks   •  Mostly  NltK  and  Scikit-­‐Learn     6 David’s Story
  7. 7

  8. 8

  9. 9

  10. 10 Data Science Spaghetti Code

  11. 1. Back to the basics 11

  12. Code vs Software 12 Daniel  Moisset:  hAps://www.youtube.com/watch?v=4dlWg0B4ASw  

  13. “Code is something that runs on a Computer” 13

  14. Code does not necessarily…   • Have  tests     • Follow

     conven6ons   • Have  documenta6on   • Follow  processes   14
  15. “Software is the programming text that is part of a

    deliverable” 15
  16. You want to build Software…   •  Maintainable   • 

    Testable   •  Deployable   16
  17. “ Python  is  an  interpreted,   interac8ve,  object-­‐oriented   programming

     language.  It   incorporates  modules,  excep8ons,   dynamic  typing,  very  high  level   dynamic  data  types,  and  classes.     17 Python is… Source: https://docs.python.org/3/faq/general.html#what-is-python
  18. “ (OOP)  is  a  programming  paradigm   based  on  the

     concept  of  "objects",   which  may  contain  data,  in  the   form  of  fields,  oGen  known  as   a0ributes;  and  code,  in  the  form  of   procedures,  oGen  known  as   methods.   18 Object Oriented Programming (OOP) Source: https://en.wikipedia.org/wiki/Object-oriented_programming
  19. How does an object look in Python? 19

  20. How does an object look in Python? 20 Cookie Cutters

    & Cookies
  21. How does an object look in Python? 21 Classes &

    Objects
  22. 22 class  Cookie(object):    def  __init__(self,  sugar=5):      self.sugar

     =  sugar    def  eat(self):      pass    def  split(self):      pass    
  23. 23 class  Cookie(object):    def  __init__(self,  sugar=5):      self.sugar

     =  sugar    def  eat(self):      pass    def  split(self):      pass    
  24. 24 class  Cookie(object):    def  __init__(self,  sugar=5):      self.sugar

     =  sugar    def  eat(self):      pass    def  split(self):      pass    
  25. 25 class  Cookie(object):    def  __init__(self,  sugar=5):      self.sugar

     =  sugar    def  eat(self):      pass    def  split(self):      pass  
  26. 26 class  Cookie(object):    def  __init__(self,  sugar=5):      self.sugar

     =  sugar    def  eat(self):      pass    def  split(self):      pass     c  =  Cookie(3)  
  27. 27 class  Alfajor(Cookie):      def  __init__(self,  chocolate=10,  sugar=10):  

     super(Alfajor,  self).__init__(sugar=sugar)    self.chocolate  =  chocolate     a  =  Alfajor(chocolate=20,  sugar=30)  
  28. 28 from  sklearn  import  svm   data    =  #

     multiple  lines  to  load  the  data   X  =  #  multiples  lines  extract  the  features   y  =  #  ...   clf  =  svm.SVC()   clf.fit(X,  y)   clf.predict(...)   #  multiples  lines  store  the  results  
  29. How do I write good object oriented code? 29

  30. How do I write good OO code?   •  DRY

      •  KISS   •  SOLID   30
  31. “ Every  piece  of  knowledge  must   have  a  single,

     unambiguous,   authorita8ve  representa8on  within   a  system   31 Source: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself DRY: Don’t Repeat Yourself
  32. “ Simplicity  is  the  ul8mate   sophis8ca8on   32 Leonardo

    Davicni KISS: Keep it Simple Stupid
  33. •  Single  responsibility  principle   •  Open/closed  principle   • 

    Liskov  subs6tu6on  principle   •  Interface  segrega6on  principle   •  Dependency  inversion  principle     33 Be SOLID “Principles Of OOD”, Robert C. Martin Source: https://es.wikipedia.org/wiki/SOLID
  34. •  Single  responsibility  principle   •  Open/closed  principle   • 

    Liskov  subs6tu6on  principle   •  Interface  segrega6on  principle   •  Dependency  inversion  principle     34 Be SOLID “Principles Of OOD”, Robert C. Martin Source: https://es.wikipedia.org/wiki/SOLID
  35. Learn OOP in Python 35

  36. 36 Coding Conventions

  37. •  “Readability  counts”  (PEP20)   •  Spaces  vs.  Tabs  

    •  Indenta6on  rules   •  Code  organiza6on   •  PEP-­‐8  is  the  de-­‐facto  style       37 Coding Conventions
  38. 38 PEP-8 hAp://pep8.org    

  39. 39

  40. •  Project  structure   •  Tes6ng  (Check  out  py.test!)  

    •  Versioning  and  branching     •  Code  Reviews   •  SoWware  Development  Life  Cycle   40 Other Topics
  41. Books! 41

  42. Talks @ Europython 42   •  Clean  Code  in  Python

     by  Mariano  Anaya     (Today  at  15:45  Barria  2)   •  What’s  the  point  of  Object  Orienta6on?   by  Iwan  Vosloo  (Thursday  11:15  A2)    
  43. 2. Tips & Tricks 43

  44. 2. Tips & Tricks The  Collec6on  Module   44

  45. 45 Counting

  46. First Attempt 46

  47. 47 items  =  ["a",  "b",  "a",  "x",  "x",  "y",  "c",

     "c",   "a"]     item_counts  =  {}     for  i  in  items:          if  i  in  items:                  item_counts[i]  =  item_counts[i]  +  1          else:                  item_counts[i]  =  1   Using dicts!
  48. 48 items  =  ["a",  "b",  "a",  "x",  "x",  "y",  "c",

     "c",   "a"]     item_counts  =  {}     for  i  in  items:          try:                  item_counts[i]  =  item_counts[i]  +  1          except  KeyError:                  item_counts[i]  =  1   Using dicts (EAFP version )
  49. We can do better 49

  50. Let’s use defaultdict   50

  51. 51 from  collections  import  defaultdict     items  =  ["a",

     "b",  "a",  "x",  "x",  "y",  "c",  "c",   "a"]     item_counts  =  defaultdict(int)     for  i  in  items:          item_counts[i]  =  item_counts[i]  +  1     defaultdict(int,  {'a':  3,  'b':  1,  'c':  2,  'x':  2,   'y':  1})     Using defaultdict
  52. Let’s use Counter   52

  53. 53 from  collections  import  Counter     items  =  ["a",

     "b",  "a",  "x",  "x",  "y",  "c",  "c",   "a"]     item_counts  =  Counter(items)   print(item_counts)     Using Counter Counter({'a':  3,  'x':  2,  'c':  2,  'b':  1,  'y':  1})  
  54. Counter’s extra goodies   54

  55. 55 Extra goodies >>>  c  =  Counter(a=3,  b=1)   >>>

     d  =  Counter(a=1,  b=2)   >>>  c.most_common()     >>>  c.values()   >>>  c  +  d                                                 >>>  c  -­‐  d                                                 >>>  c  &  d                  #  intersection:    min(c[x],   d[x])   >>>  c  |  d                  #  union:    max(c[x],  d[x])  
  56. Counter is a class   56

  57. Classes can be extended   57

  58. 58 Counter based PMF from  collections  import  Counter    

      class  PMF(Counter):          def  normalize(self):                  total  =  float(sum(self.values()))                  for  key  in  self:                          self[key]  /=  total  
  59. 59 Counter based PMF from  collections  import  Counter    

      class  PMF(Counter):          def  normalize(self):                  total  =  float(sum(self.values()))                  for  key  in  self:                          self[key]  /=  total            def  __init__(self,  *args,  **kwargs):                  super(PMF,  self).__init__(…)                  self.normalize()  
  60. •  Use  the  and  extend  Counter   class   • 

    Awesome  ar6cle  from   @TreyHunner  on  Coun6ng  [1].       60 On Counting and Python [1] http://treyhunner.com/2015/11/counting-things-in-python/
  61. Named Tuples 61

  62. •  Code  around  the  dict,  tuples  or   lists  

    •  Never  know  what  to  expect   •  Code  becomes  hard  to  read     62 Named Tuples
  63. 63 Example from  math  import  sqrt   pt1  =  (1.0,

     5.0)   pt2  =  (2.5,  1.5)       line_length  =  sqrt((pt1[0]  -­‐  pt2[0])**2  +  (pt1[1]   -­‐  pt2[1])**2)   Source:  hAp://stackoverflow.com/ques6ons/2970608/what-­‐are-­‐named-­‐tuples-­‐in-­‐python  
  64. 64 Example           line_length  =  sqrt((pt1[0]

     -­‐  pt2[0])**2  +  (pt1[1]   -­‐  pt2[1])**2)   Source:  hAp://stackoverflow.com/ques6ons/2970608/what-­‐are-­‐named-­‐tuples-­‐in-­‐python  
  65. 65 Example from  collections  import  namedtuple   from  math  import

     sqrt   Point  =  namedtuple('Point',  'x  y')   pt1  =  Point(1.0,  5.0)   pt2  =  Point(2.5,  1.5)     line_length  =  sqrt((pt1.x  -­‐  pt2.x)**2  +  (pt1.y  -­‐   pt2.y)**2)   Source:  hAp://stackoverflow.com/ques6ons/2970608/what-­‐are-­‐named-­‐tuples-­‐in-­‐python  
  66. 66 It has cool methods _asdict Return  a  new  OrderedDict

     which   maps  field  names  to  their  values   _make(iterable) Class  method  that  makes  a  new   instance  from  an  exis6ng  sequence   or  iterable.  
  67. 67 Extending NamedTuple from  collections  import  namedtuple     _HotelBase

     =  namedtuple(              'HotelDescriptor',              ['cluster_id',  'trust_score',  'reviews_count',      'category_scores',  'intensity_factors'],   )     class  HotelDescriptor(_HotelBase):          def  compute_prior(self):                  if  not  self.trust_score  or  not  self.reviews_count:                          raise  NotEnoughDataForRanking("…")                  return  _compute_prior(self.trust_score,…)  
  68. 2. Tips & Tricks 2.1  Iterators  &  Iterables   68

  69. 69 l  =  [1,  2,  3,  4]   for  i

     in  x:    print(x)  
  70. 70 an  iterator   comprehension   (an)  iterable   produces

      typically  is   iter()   always  is   a  generator     expression   a  generator     funcCon   is   is   a  generator   container   (list,  dict,  etc)     next()   Lazily  produce   the  next  value   By Vincent Driessen - Source: http://nvie.com/posts/iterators-vs-generators/
  71. 71 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value  
  72. 72 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension  
  73. 73 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces  
  74. 74 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   container   (list,  dict,  etc)  
  75. 75 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   container   (list,  dict,  etc)   assert  1  in  [1,  2,  3]    
  76. 76 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   container   (list,  dict,  etc)     assert  1  in  {1,  2,  3}    
  77. 77 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   typically  is   container   (list,  dict,  etc)    
  78. 78 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   typically  is   container   (list,  dict,  etc)     l  =  [1,  2,  3,  4]   x  =  iter(l)   y  =  iter(l)      
  79. 79 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   typically  is   container   (list,  dict,  etc)     l  =  [1,  2,  3,  4]   x  =  iter(l)   y  =  iter(l)     type(l)   >>  <class  'list'>   type(x)   >>  <class  'list_iterator'>  
  80. 80 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   typically  is   container   (list,  dict,  etc)     l  =  [1,  2,  3,  4]   x  =  iter(l)   y  =  iter(l)     type(l)   >>  <class  'list'>   type(x)   >>  <class  'list_iterator'>     next(x)   >>  1   next(y)   >>  1   next(y)   >>  2  
  81. 81 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   typically  is   container   (list,  dict,  etc)     l  =  [1,  2,  3,  4]   for  e  in  l:    print(e)  
  82. •  An  iterable  is  any  object  that  can   return

     an  iterator   •  Containers,    files,  sockets,  etc.   •  Implement  __iter__().   •  Some  of  them  may  be  infinite   •  The  itertools  contain  many   helper  func6ons   82 Iterables
  83. 83 class  InverseReader(object):          def  __init__(self):  

                   with  open('file.txt')  as  f:                          self.lines  =  f.readlines()                  self.index  =  len(self.lines)  -­‐  1            def  __iter__(self):                  return  self            def  next(self):    #  Python  3  __next__                  self.index  -­‐=  1                  if  self.index  <  0:                          raise  StopIteration                  return  self.lines[self.index]  
  84. 84 ir  =  InverseReader()     for  line  in  ir:

             print  line  
  85. 85 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   typically  is   container   is   is   a  generator   always  is   a  generator     expression   a  generator     funcCon  
  86. 86 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   a  generator     expression  
  87. 87 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   a  generator     expression   a  generator     funcCon  
  88. 88 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   a  generator     expression   a  generator     funcCon  
  89. 89 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   is   is   a  generator   a  generator     expression   a  generator     funcCon  
  90. 90 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   is   is   a  generator   always  is   a  generator     expression   a  generator     funcCon  
  91. 91 is   a  generator     funcCon   is

      a  generator     expression   a  generator  
  92. 92 is   a  generator     expression   numbers

     =  [x  for  x  in  range(1,  10)]   squares  =  [x  *  x  for  x  in  numbers]   type(squares)   #  list   a  generator  
  93. 93 is   a  generator     expression   numbers

     =  [x  for  x  in  range(1,  10)]   squares  =  [x  *  x  for  x  in  numbers]   type(squares)   #  list     lazy_squares  =  (x  *  x  for  x  in  numbers)   lazy_squares   #  <generator  object  <genexpr>  at   0x104c6da00>   a  generator  
  94. 94 is   a  generator     expression   numbers

     =  [x  for  x  in  range(1,  10)]   squares  =  [x  *  x  for  x  in  numbers]   type(squares)   #  list     lazy_squares  =  (x  *  x  for  x  in  numbers)   lazy_squares   #  <generator  object  <genexpr>  at   0x104c6da00>     next(lazy_squares)   #  1   next(lazy_squares)   #  4   x   a  generator  
  95. 95 is   a  generator     expression   numbers

     =  [x  for  x  in  range(1,  10)]   squares  =  [x  *  x  for  x  in  numbers]   type(squares)   #  list     lazy_squares  =  (x  *  x  for  x  in  numbers)   lazy_squares   #  <generator  object  <genexpr>  at   0x104c6da00>     next(lazy_squares)   #  1   next(lazy_squares)   #  4     lazy_squares  =  (x  *  x  for  x  in  numbers)   for  x  in  lazy_squares:    print  x       a  generator  
  96. 96 a  generator   is   a  generator    

    funcCon  
  97. 97 a  generator   def  fib():      prev,  curr

     =  0,1      while  True:          yield  curr          prev,curr  =  curr,  prev+curr       is   a  generator     funcCon  
  98. 98 a  generator   def  fib():      prev,  curr

     =  0,1      while  True:          yield  curr          prev,curr  =  curr,  prev+curr     f  =  fib()     next(f)   #  1   next(f)   #  1   next(f)   #  2     is   a  generator     funcCon  
  99. 99 a  generator   def  fib():      prev,  curr

     =  0,1      while  True:          yield  curr          prev,curr  =  curr,  prev+curr       is   a  generator     funcCon   for  x  in  islice(fib(),  0,3):          print  x   # 1 # 1 # 2  
  100. 100 class  HdfsLineSentence(object):          def  __init__(self,  source):

                     self.source  =  source            def  __iter__(self):                  stream  =  self.source.open('r')                  for  line  in  stream:                          cid,  s  =  line.split('\t')                          #  decode  and  do  some  work  with  s                          yield  s     sentences  =  HdfsLineSentence(...)   for  s  in  setences:          print(s)  
  101. 101 an  iterator   (an)  iterable   iter()   always

     is   next()   Lazily  produce   the  next  value   comprehension   produces   typically  is   container   is   is   a  generator   always  is   a  generator     expression   a  generator     funcCon  
  102. What does it have to do with Data Processing?  

    102
  103. Unknown amout of data   103

  104. Not enough memory   104

  105. Data streaming via lazy evaluation   105

  106. Data processing pipelines through iterables   106

  107. Chaining iterables   107

  108. 108 class  HdfsLineSentence(object):          def  __init__(self,  source):

                     self.source  =  source            def  __iter__(self):                  stream  =  self.source.open('r')                  for  line  in  stream:                          cid,  s  =  line.split('\t')                          #  decode  and  do  some  work  with  s                          yield  s     sentences  =  HdfsLineSentence(...)   for  s  in  sentences:          print  s  
  109. 109 class  FilterComment(object):          def  __init__(self,  source):

                     self.source  =  source            def  __iter__(self):                  for  s  in  self.source:                          if  s[0]  !=  "#":                                  yield  s      
  110. 110 class  FilterComment(object):          def  __init__(self,  source):

                     self.source  =  source            def  __iter__(self):                  for  s  in  self.source:                          if  s[0]  !=  "#":                                  yield  s       sents  =  FilterComment(HdfsLineSentence(source))   for  s  in  sents:    print  s    
  111. 111 class  FilterComment(object):          def  __init__(self,  source):

                     self.source  =  source            def  __iter__(self):                  for  s  in  self.source:                          if  s[0]  !=  "#":                                  yield  s       sents  =  FilterComment(HdfsLineSentence(source))   for  s  in  sents:    print  s    
  112. 112 class  FilterComment(object):          def  __init__(self,  source):

                     self.source  =  source            def  __iter__(self):                  for  s  in  self.source:                          if  s[0]  !=  "#":                                  yield  s       sents  =  FilterComment(HdfsLineSentence(source))   for  s  in  sents:    print  s    
  113. 113 def  filter_comment(source):          for  s  in

     source:                          if  s[0]  !=  "#":                                  yield  s     sents  =  filter_comment(HdfsLineSentence(source))     for  s  in  sents:    print  s    
  114. Talks @ Europython 114   •  Itera6on,  itera6on,  itera6on  by

     John   Sutherland  (Friday  15:45  Barria  1)  
  115. 3. Conclusions 115

  116. Data Scientists / Engineers / ML Developers should learn… 116

  117. collections and itertools   modules 117

  118. Iterables  and  iterators  for  data   processing  pipelines   118

  119. Object  oriented  programming   119

  120. Good  soWware  engineering   prac6ces   120

  121. 121 Spaguetti: https://www.flickr.com/photos/129610671@N02/16633987421/ (CC BY- NC-ND 2.0) vision.communicate Autovification: Credit:

    AV Dezign https://www.flickr.com/photos/91345457@N07/22666878846/ (CC BY-NC-ND 2.0) Iterators and Iterables based on work of Vincent Driessen : http://nvie.com/posts/iterators-vs-generators/ Ideas from Iterables taken from RaRe Technologies blog; http://rare-technologies.com/data-streaming- in-python-generators-iterators-iterables/ Credits Counter image: Dean Hochman Source: https://www.flickr.com/photos/17997843@N02/24061690099/“ (CC BY-NC-ND 2.0) PMF Class based on Vik Paruchuri’s https://www.dataquest.io/blog/python-counter-class/ Cookies: Source Wikipedia https://en.wikipedia.org/wiki/File:R%C5%AFzn%C3%A9_druhy_cukrov%C3%AD_(2).jpg (CC BY 3.0) Counting things in Python: http://treyhunner.com/2015/11/counting-things-in-python/
  122. We are Hiring! Visit  our  table  for  more  info!  

  123. Thanks! Any  quesCons?   You  can  find  me  at:  

    @mfcabrera   mfcabrera@gmail.com     123