Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Europython 2016 - Things I wish I knew before using Python for Data Processing

Europython 2016 - Things I wish I knew before using Python for Data Processing

30 minute talk in Europyhthon 2016 based on the a lighting talk.

Miguel Cabrera

July 20, 2016
Tweet

More Decks by Miguel Cabrera

Other Decks in Technology

Transcript

  1. Things I wish I Knew
    Before Using Python for
    Data Processing
     
    Miguel  Cabrera  
    @mfcabrera  
     
    20  July  2016  

    View full-size slide

  2. Hello!
    I  am  Miguel!  
    Data  Engineer  /  Scien6st  @  TrustYou  
    Python  ~  2  years    
    Berlin  
     
     
    @mfcabrera  
    mfcabrera  at  gmail      
    hAp://mfcabrera.com  
      2

    View full-size slide

  3. ●  (Rela6vely)  New  to  Python,  mostly  Scien6fic  
    stack  
    ●  You  have  used  things  like  Numpy,  Scikit-­‐
    Learn,  Gensim,  etc…  
     
    ●  Your  job  6tle  includes  either  the  word  Data  
    or  “Machine  Learning”.  
     
    ●  Not  necessarily  a  trained  SoWware  Engineer  
    3
    Priors!

    View full-size slide

  4. ●  Data  Scien6st?  
    ●  Data  Analyst?  
    ●  Data  Engineer?  
    ●  Machine  Learning  Developer?  
    ●  SoWware  Developer?  
    ●  Other?  
    4
    Who is who?

    View full-size slide

  5. ●  Basic  Concepts  and  prac6ces  
    ●  Some  goodies  of  the  collec6on  
    module  
    ●  Iterators  and  Iterables  
    ●  Conclusion    
     
       
     
    5
    Agenda

    View full-size slide

  6. ●  Recent  university  grad  
    ●  Mostly  R  and  Matlab    
    ●  Writes  niWy  code  to  classify  
    documents  using  Jupyter  
    Notebooks  
    ●  Mostly  NltK  and  Scikit-­‐Learn  
     
    6
    David’s Story

    View full-size slide

  7. 10
    Data Science Spaghetti Code

    View full-size slide

  8. 1.
    Back to the basics
    11

    View full-size slide

  9. Code vs Software
    12
    Daniel  Moisset:  hAps://www.youtube.com/watch?v=4dlWg0B4ASw  

    View full-size slide

  10. “Code is something that
    runs on a Computer”
    13

    View full-size slide

  11. Code does not necessarily…
     
    ● Have  tests    
    ● Follow  conven6ons  
    ● Have  documenta6on  
    ● Follow  processes  
    14

    View full-size slide

  12. “Software is the
    programming text that
    is part of a deliverable”
    15

    View full-size slide

  13. You want to build Software…
     
    ●  Maintainable  
    ●  Testable  
    ●  Deployable  
    16

    View full-size slide


  14. Python  is  an  interpreted,  
    interac8ve,  object-­‐oriented  
    programming  language.  It  
    incorporates  modules,  excep8ons,  
    dynamic  typing,  very  high  level  
    dynamic  data  types,  and  classes.    
    17
    Python is…
    Source: https://docs.python.org/3/faq/general.html#what-is-python

    View full-size slide


  15. (OOP)  is  a  programming  paradigm  
    based  on  the  concept  of  "objects",  
    which  may  contain  data,  in  the  
    form  of  fields,  oGen  known  as  
    a0ributes;  and  code,  in  the  form  of  
    procedures,  oGen  known  as  
    methods.  
    18
    Object Oriented Programming
    (OOP)
    Source: https://en.wikipedia.org/wiki/Object-oriented_programming

    View full-size slide

  16. How does an object
    look in Python?
    19

    View full-size slide

  17. How does an object
    look in Python?
    20
    Cookie Cutters & Cookies

    View full-size slide

  18. How does an object
    look in Python?
    21
    Classes & Objects

    View full-size slide

  19. 22
    class  Cookie(object):  
     def  __init__(self,  sugar=5):  
       self.sugar  =  sugar  
     def  eat(self):  
       pass  
     def  split(self):  
       pass  
     

    View full-size slide

  20. 23
    class  Cookie(object):  
     def  __init__(self,  sugar=5):  
       self.sugar  =  sugar  
     def  eat(self):  
       pass  
     def  split(self):  
       pass  
     

    View full-size slide

  21. 24
    class  Cookie(object):  
     def  __init__(self,  sugar=5):  
       self.sugar  =  sugar  
     def  eat(self):  
       pass  
     def  split(self):  
       pass  
     

    View full-size slide

  22. 25
    class  Cookie(object):  
     def  __init__(self,  sugar=5):  
       self.sugar  =  sugar  
     def  eat(self):  
       pass  
     def  split(self):  
       pass  

    View full-size slide

  23. 26
    class  Cookie(object):  
     def  __init__(self,  sugar=5):  
       self.sugar  =  sugar  
     def  eat(self):  
       pass  
     def  split(self):  
       pass  
     
    c  =  Cookie(3)  

    View full-size slide

  24. 27
    class  Alfajor(Cookie):  
       def  __init__(self,  chocolate=10,  sugar=10):  
     super(Alfajor,  self).__init__(sugar=sugar)  
     self.chocolate  =  chocolate  
     
    a  =  Alfajor(chocolate=20,  sugar=30)  

    View full-size slide

  25. 28
    from  sklearn  import  svm  
    data    =  #  multiple  lines  to  load  the  data  
    X  =  #  multiples  lines  extract  the  features  
    y  =  #  ...  
    clf  =  svm.SVC()  
    clf.fit(X,  y)  
    clf.predict(...)  
    #  multiples  lines  store  the  results  

    View full-size slide

  26. How do I write good
    object oriented code?
    29

    View full-size slide

  27. How do I write good OO
    code?
     
    ●  DRY  
    ●  KISS  
    ●  SOLID  
    30

    View full-size slide


  28. Every  piece  of  knowledge  must  
    have  a  single,  unambiguous,  
    authorita8ve  representa8on  within  
    a  system  
    31
    Source: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
    DRY: Don’t Repeat Yourself

    View full-size slide


  29. Simplicity  is  the  ul8mate  
    sophis8ca8on  
    32
    Leonardo Davicni
    KISS: Keep it Simple Stupid

    View full-size slide

  30. ●  Single  responsibility  principle  
    ●  Open/closed  principle  
    ●  Liskov  subs6tu6on  principle  
    ●  Interface  segrega6on  principle  
    ●  Dependency  inversion  principle  
     
    33
    Be SOLID
    “Principles Of OOD”, Robert C. Martin Source: https://es.wikipedia.org/wiki/SOLID

    View full-size slide

  31. ●  Single  responsibility  principle  
    ●  Open/closed  principle  
    ●  Liskov  subs6tu6on  principle  
    ●  Interface  segrega6on  principle  
    ●  Dependency  inversion  principle  
     
    34
    Be SOLID
    “Principles Of OOD”, Robert C. Martin Source: https://es.wikipedia.org/wiki/SOLID

    View full-size slide

  32. Learn OOP in Python
    35

    View full-size slide

  33. 36
    Coding Conventions

    View full-size slide

  34. ●  “Readability  counts”  (PEP20)  
    ●  Spaces  vs.  Tabs  
    ●  Indenta6on  rules  
    ●  Code  organiza6on  
    ●  PEP-­‐8  is  the  de-­‐facto  style  
     
     
    37
    Coding Conventions

    View full-size slide

  35. 38
    PEP-8
    hAp://pep8.org  
     

    View full-size slide

  36. ●  Project  structure  
    ●  Tes6ng  (Check  out  py.test!)  
    ●  Versioning  and  branching    
    ●  Code  Reviews  
    ●  SoWware  Development  Life  Cycle  
    40
    Other Topics

    View full-size slide

  37. Talks @ Europython
    42
     
    ●  Clean  Code  in  Python  by  Mariano  Anaya    
    (Today  at  15:45  Barria  2)  
    ●  What’s  the  point  of  Object  Orienta6on?  
    by  Iwan  Vosloo  (Thursday  11:15  A2)  
     

    View full-size slide

  38. 2.
    Tips & Tricks
    43

    View full-size slide

  39. 2.
    Tips & Tricks
    The  Collec6on  Module  
    44

    View full-size slide

  40. First Attempt
    46

    View full-size slide

  41. 47
    items  =  ["a",  "b",  "a",  "x",  "x",  "y",  "c",  "c",  
    "a"]  
     
    item_counts  =  {}  
     
    for  i  in  items:  
           if  i  in  items:  
                   item_counts[i]  =  item_counts[i]  +  1  
           else:  
                   item_counts[i]  =  1  
    Using dicts!

    View full-size slide

  42. 48
    items  =  ["a",  "b",  "a",  "x",  "x",  "y",  "c",  "c",  
    "a"]  
     
    item_counts  =  {}  
     
    for  i  in  items:  
           try:  
                   item_counts[i]  =  item_counts[i]  +  1  
           except  KeyError:  
                   item_counts[i]  =  1  
    Using dicts (EAFP version
    )

    View full-size slide

  43. We can do better
    49

    View full-size slide

  44. Let’s use defaultdict  
    50

    View full-size slide

  45. 51
    from  collections  import  defaultdict  
     
    items  =  ["a",  "b",  "a",  "x",  "x",  "y",  "c",  "c",  
    "a"]  
     
    item_counts  =  defaultdict(int)  
     
    for  i  in  items:  
           item_counts[i]  =  item_counts[i]  +  1  
     
    defaultdict(int,  {'a':  3,  'b':  1,  'c':  2,  'x':  2,  
    'y':  1})  
     
    Using defaultdict

    View full-size slide

  46. Let’s use Counter  
    52

    View full-size slide

  47. 53
    from  collections  import  Counter  
     
    items  =  ["a",  "b",  "a",  "x",  "x",  "y",  "c",  "c",  
    "a"]  
     
    item_counts  =  Counter(items)  
    print(item_counts)  
     
    Using Counter
    Counter({'a':  3,  'x':  2,  'c':  2,  'b':  1,  'y':  1})  

    View full-size slide

  48. Counter’s extra goodies  
    54

    View full-size slide

  49. 55
    Extra goodies
    >>>  c  =  Counter(a=3,  b=1)  
    >>>  d  =  Counter(a=1,  b=2)  
    >>>  c.most_common()    
    >>>  c.values()  
    >>>  c  +  d                                                
    >>>  c  -­‐  d                                                
    >>>  c  &  d                  #  intersection:    min(c[x],  
    d[x])  
    >>>  c  |  d                  #  union:    max(c[x],  d[x])  

    View full-size slide

  50. Counter is a class  
    56

    View full-size slide

  51. Classes can be extended  
    57

    View full-size slide

  52. 58
    Counter based PMF
    from  collections  import  Counter  
     
     
    class  PMF(Counter):  
           def  normalize(self):  
                   total  =  float(sum(self.values()))  
                   for  key  in  self:  
                           self[key]  /=  total  

    View full-size slide

  53. 59
    Counter based PMF
    from  collections  import  Counter  
     
     
    class  PMF(Counter):  
           def  normalize(self):  
                   total  =  float(sum(self.values()))  
                   for  key  in  self:  
                           self[key]  /=  total  
     
           def  __init__(self,  *args,  **kwargs):  
                   super(PMF,  self).__init__(…)  
                   self.normalize()  

    View full-size slide

  54. ●  Use  the  and  extend  Counter  
    class  
    ●  Awesome  ar6cle  from  
    @TreyHunner  on  Coun6ng  [1].  
     
     
    60
    On Counting and Python
    [1] http://treyhunner.com/2015/11/counting-things-in-python/

    View full-size slide

  55. Named Tuples
    61

    View full-size slide

  56. ●  Code  around  the  dict,  tuples  or  
    lists  
    ●  Never  know  what  to  expect  
    ●  Code  becomes  hard  to  read  
     
    62
    Named Tuples

    View full-size slide

  57. 63
    Example
    from  math  import  sqrt  
    pt1  =  (1.0,  5.0)  
    pt2  =  (2.5,  1.5)  
     
     
    line_length  =  sqrt((pt1[0]  -­‐  pt2[0])**2  +  (pt1[1]  
    -­‐  pt2[1])**2)  
    Source:  hAp://stackoverflow.com/ques6ons/2970608/what-­‐are-­‐named-­‐tuples-­‐in-­‐python  

    View full-size slide

  58. 64
    Example
     
     
     
     
     
    line_length  =  sqrt((pt1[0]  -­‐  pt2[0])**2  +  (pt1[1]  
    -­‐  pt2[1])**2)  
    Source:  hAp://stackoverflow.com/ques6ons/2970608/what-­‐are-­‐named-­‐tuples-­‐in-­‐python  

    View full-size slide

  59. 65
    Example
    from  collections  import  namedtuple  
    from  math  import  sqrt  
    Point  =  namedtuple('Point',  'x  y')  
    pt1  =  Point(1.0,  5.0)  
    pt2  =  Point(2.5,  1.5)  
     
    line_length  =  sqrt((pt1.x  -­‐  pt2.x)**2  +  (pt1.y  -­‐  
    pt2.y)**2)  
    Source:  hAp://stackoverflow.com/ques6ons/2970608/what-­‐are-­‐named-­‐tuples-­‐in-­‐python  

    View full-size slide

  60. 66
    It has cool methods
    _asdict Return  a  new  OrderedDict  which  
    maps  field  names  to  their  values  
    _make(iterable) Class  method  that  makes  a  new  
    instance  from  an  exis6ng  sequence  
    or  iterable.  

    View full-size slide

  61. 67
    Extending NamedTuple
    from  collections  import  namedtuple  
     
    _HotelBase  =  namedtuple(  
               'HotelDescriptor',  
               ['cluster_id',  'trust_score',  'reviews_count',    
     'category_scores',  'intensity_factors'],  
    )  
     
    class  HotelDescriptor(_HotelBase):  
           def  compute_prior(self):  
                   if  not  self.trust_score  or  not  self.reviews_count:  
                           raise  NotEnoughDataForRanking("…")  
                   return  _compute_prior(self.trust_score,…)  

    View full-size slide

  62. 2.
    Tips & Tricks
    2.1  Iterators  &  Iterables  
    68

    View full-size slide

  63. 69
    l  =  [1,  2,  3,  4]  
    for  i  in  x:  
     print(x)  

    View full-size slide

  64. 70
    an  iterator  
    comprehension  
    (an)  iterable  
    produces  
    typically  is  
    iter()  
    always  is  
    a  generator    
    expression  
    a  generator    
    funcCon  
    is  
    is  
    a  generator  
    container  
    (list,  dict,  etc)  
     
    next()   Lazily  produce  
    the  next  value  
    By Vincent Driessen - Source: http://nvie.com/posts/iterators-vs-generators/

    View full-size slide

  65. 71
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  

    View full-size slide

  66. 72
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  

    View full-size slide

  67. 73
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  

    View full-size slide

  68. 74
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    container  
    (list,  dict,  etc)  

    View full-size slide

  69. 75
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    container  
    (list,  dict,  etc)  
    assert  1  in  [1,  2,  3]    

    View full-size slide

  70. 76
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    container  
    (list,  dict,  etc)  
     
    assert  1  in  {1,  2,  3}    

    View full-size slide

  71. 77
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    typically  is  
    container  
    (list,  dict,  etc)  
     

    View full-size slide

  72. 78
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    typically  is  
    container  
    (list,  dict,  etc)  
     
    l  =  [1,  2,  3,  4]  
    x  =  iter(l)  
    y  =  iter(l)  
     
     

    View full-size slide

  73. 79
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    typically  is  
    container  
    (list,  dict,  etc)  
     
    l  =  [1,  2,  3,  4]  
    x  =  iter(l)  
    y  =  iter(l)  
     
    type(l)  
    >>    
    type(x)  
    >>    

    View full-size slide

  74. 80
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    typically  is  
    container  
    (list,  dict,  etc)  
     
    l  =  [1,  2,  3,  4]  
    x  =  iter(l)  
    y  =  iter(l)  
     
    type(l)  
    >>    
    type(x)  
    >>    
     
    next(x)  
    >>  1  
    next(y)  
    >>  1  
    next(y)  
    >>  2  

    View full-size slide

  75. 81
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    typically  is  
    container  
    (list,  dict,  etc)  
     
    l  =  [1,  2,  3,  4]  
    for  e  in  l:  
     print(e)  

    View full-size slide

  76. ●  An  iterable  is  any  object  that  can  
    return  an  iterator  
    ●  Containers,    files,  sockets,  etc.  
    ●  Implement  __iter__().  
    ●  Some  of  them  may  be  infinite  
    ●  The  itertools  contain  many  
    helper  func6ons  
    82
    Iterables

    View full-size slide

  77. 83
    class  InverseReader(object):  
           def  __init__(self):  
                   with  open('file.txt')  as  f:  
                           self.lines  =  f.readlines()  
                   self.index  =  len(self.lines)  -­‐  1  
     
           def  __iter__(self):  
                   return  self  
     
           def  next(self):    #  Python  3  __next__  
                   self.index  -­‐=  1  
                   if  self.index  <  0:  
                           raise  StopIteration  
                   return  self.lines[self.index]  

    View full-size slide

  78. 84
    ir  =  InverseReader()  
     
    for  line  in  ir:  
           print  line  

    View full-size slide

  79. 85
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    typically  is  
    container  
    is  
    is  
    a  generator  
    always  is  
    a  generator    
    expression  
    a  generator    
    funcCon  

    View full-size slide

  80. 86
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    a  generator    
    expression  

    View full-size slide

  81. 87
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    a  generator    
    expression  
    a  generator    
    funcCon  

    View full-size slide

  82. 88
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    a  generator    
    expression  
    a  generator    
    funcCon  

    View full-size slide

  83. 89
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    is  
    is  
    a  generator  
    a  generator    
    expression  
    a  generator    
    funcCon  

    View full-size slide

  84. 90
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    is  
    is  
    a  generator  
    always  is  
    a  generator    
    expression  
    a  generator    
    funcCon  

    View full-size slide

  85. 91
    is  
    a  generator    
    funcCon  
    is  
    a  generator    
    expression  
    a  generator  

    View full-size slide

  86. 92
    is  
    a  generator    
    expression  
    numbers  =  [x  for  x  in  range(1,  10)]  
    squares  =  [x  *  x  for  x  in  numbers]  
    type(squares)  
    #  list  
    a  generator  

    View full-size slide

  87. 93
    is  
    a  generator    
    expression  
    numbers  =  [x  for  x  in  range(1,  10)]  
    squares  =  [x  *  x  for  x  in  numbers]  
    type(squares)  
    #  list  
     
    lazy_squares  =  (x  *  x  for  x  in  numbers)  
    lazy_squares  
    #    at  
    0x104c6da00>  
    a  generator  

    View full-size slide

  88. 94
    is  
    a  generator    
    expression  
    numbers  =  [x  for  x  in  range(1,  10)]  
    squares  =  [x  *  x  for  x  in  numbers]  
    type(squares)  
    #  list  
     
    lazy_squares  =  (x  *  x  for  x  in  numbers)  
    lazy_squares  
    #    at  
    0x104c6da00>  
     
    next(lazy_squares)  
    #  1  
    next(lazy_squares)  
    #  4  
    x  
    a  generator  

    View full-size slide

  89. 95
    is  
    a  generator    
    expression  
    numbers  =  [x  for  x  in  range(1,  10)]  
    squares  =  [x  *  x  for  x  in  numbers]  
    type(squares)  
    #  list  
     
    lazy_squares  =  (x  *  x  for  x  in  numbers)  
    lazy_squares  
    #    at  
    0x104c6da00>  
     
    next(lazy_squares)  
    #  1  
    next(lazy_squares)  
    #  4  
     
    lazy_squares  =  (x  *  x  for  x  in  numbers)  
    for  x  in  lazy_squares:  
     print  x  
     
     
    a  generator  

    View full-size slide

  90. 96
    a  generator  
    is  
    a  generator    
    funcCon  

    View full-size slide

  91. 97
    a  generator  
    def  fib():  
       prev,  curr  =  0,1  
       while  True:  
           yield  curr  
           prev,curr  =  curr,  prev+curr  
     
     
    is  
    a  generator    
    funcCon  

    View full-size slide

  92. 98
    a  generator  
    def  fib():  
       prev,  curr  =  0,1  
       while  True:  
           yield  curr  
           prev,curr  =  curr,  prev+curr  
     
    f  =  fib()  
     
    next(f)  
    #  1  
    next(f)  
    #  1  
    next(f)  
    #  2  
     
    is  
    a  generator    
    funcCon  

    View full-size slide

  93. 99
    a  generator  
    def  fib():  
       prev,  curr  =  0,1  
       while  True:  
           yield  curr  
           prev,curr  =  curr,  prev+curr  
     
     
    is  
    a  generator    
    funcCon  
    for  x  in  islice(fib(),  0,3):  
           print  x  
    # 1
    # 1
    # 2
     

    View full-size slide

  94. 100
    class  HdfsLineSentence(object):  
           def  __init__(self,  source):  
                   self.source  =  source  
     
           def  __iter__(self):  
                   stream  =  self.source.open('r')  
                   for  line  in  stream:  
                           cid,  s  =  line.split('\t')  
                           #  decode  and  do  some  work  with  s  
                           yield  s  
     
    sentences  =  HdfsLineSentence(...)  
    for  s  in  setences:  
           print(s)  

    View full-size slide

  95. 101
    an  iterator  
    (an)  iterable  
    iter()  
    always  is  
    next()   Lazily  produce  
    the  next  value  
    comprehension  
    produces  
    typically  is  
    container  
    is  
    is  
    a  generator  
    always  is  
    a  generator    
    expression  
    a  generator    
    funcCon  

    View full-size slide

  96. What does it have to do with
    Data Processing?  
    102

    View full-size slide

  97. Unknown amout of data  
    103

    View full-size slide

  98. Not enough memory  
    104

    View full-size slide

  99. Data streaming via lazy
    evaluation  
    105

    View full-size slide

  100. Data processing pipelines
    through iterables  
    106

    View full-size slide

  101. Chaining iterables  
    107

    View full-size slide

  102. 108
    class  HdfsLineSentence(object):  
           def  __init__(self,  source):  
                   self.source  =  source  
     
           def  __iter__(self):  
                   stream  =  self.source.open('r')  
                   for  line  in  stream:  
                           cid,  s  =  line.split('\t')  
                           #  decode  and  do  some  work  with  s  
                           yield  s  
     
    sentences  =  HdfsLineSentence(...)  
    for  s  in  sentences:  
           print  s  

    View full-size slide

  103. 109
    class  FilterComment(object):  
           def  __init__(self,  source):  
                   self.source  =  source  
     
           def  __iter__(self):  
                   for  s  in  self.source:  
                           if  s[0]  !=  "#":  
                                   yield  s  
     
     

    View full-size slide

  104. 110
    class  FilterComment(object):  
           def  __init__(self,  source):  
                   self.source  =  source  
     
           def  __iter__(self):  
                   for  s  in  self.source:  
                           if  s[0]  !=  "#":  
                                   yield  s  
     
     
    sents  =  FilterComment(HdfsLineSentence(source))  
    for  s  in  sents:  
     print  s  
     

    View full-size slide

  105. 111
    class  FilterComment(object):  
           def  __init__(self,  source):  
                   self.source  =  source  
     
           def  __iter__(self):  
                   for  s  in  self.source:  
                           if  s[0]  !=  "#":  
                                   yield  s  
     
     
    sents  =  FilterComment(HdfsLineSentence(source))  
    for  s  in  sents:  
     print  s  
     

    View full-size slide

  106. 112
    class  FilterComment(object):  
           def  __init__(self,  source):  
                   self.source  =  source  
     
           def  __iter__(self):  
                   for  s  in  self.source:  
                           if  s[0]  !=  "#":  
                                   yield  s  
     
     
    sents  =  FilterComment(HdfsLineSentence(source))  
    for  s  in  sents:  
     print  s  
     

    View full-size slide

  107. 113
    def  filter_comment(source):  
           for  s  in  source:  
                           if  s[0]  !=  "#":  
                                   yield  s  
     
    sents  =  filter_comment(HdfsLineSentence(source))  
     
    for  s  in  sents:  
     print  s  
     

    View full-size slide

  108. Talks @ Europython
    114
     
    ●  Itera6on,  itera6on,  itera6on  by  John  
    Sutherland  (Friday  15:45  Barria  1)  

    View full-size slide

  109. 3.
    Conclusions
    115

    View full-size slide

  110. Data Scientists /
    Engineers / ML Developers
    should learn…
    116

    View full-size slide

  111. collections and itertools  
    modules
    117

    View full-size slide

  112. Iterables  and  iterators  for  data  
    processing  pipelines  
    118

    View full-size slide

  113. Object  oriented  programming  
    119

    View full-size slide

  114. Good  soWware  engineering  
    prac6ces  
    120

    View full-size slide

  115. 121
    Spaguetti: https://www.flickr.com/photos/129610671@N02/16633987421/ (CC BY-
    NC-ND 2.0) vision.communicate
    Autovification: Credit: AV Dezign
    https://www.flickr.com/photos/91345457@N07/22666878846/ (CC BY-NC-ND 2.0)
    Iterators and Iterables based on work of Vincent Driessen : http://nvie.com/posts/iterators-vs-generators/
    Ideas from Iterables taken from RaRe Technologies blog; http://rare-technologies.com/data-streaming-
    in-python-generators-iterators-iterables/
    Credits
    Counter image: Dean Hochman Source:
    https://www.flickr.com/photos/17997843@N02/24061690099/“ (CC BY-NC-ND 2.0)
    PMF Class based on Vik Paruchuri’s https://www.dataquest.io/blog/python-counter-class/
    Cookies: Source Wikipedia
    https://en.wikipedia.org/wiki/File:R%C5%AFzn%C3%A9_druhy_cukrov%C3%AD_(2).jpg (CC BY 3.0)
    Counting things in Python: http://treyhunner.com/2015/11/counting-things-in-python/

    View full-size slide

  116. We are Hiring!
    Visit  our  table  for  more  info!  

    View full-size slide

  117. Thanks!
    Any  quesCons?  
    You  can  find  me  at:  
    @mfcabrera  
    [email protected]  
     
    123

    View full-size slide