Data Engineering for Data Scientists

Data Engineering for Data Scientists

AnacondaCON, Austin, Texas / April 9, 2018 at 4:10-5:00pm

A386d0978e7aa5ccdc0bf8d28c71e8ce?s=128

Max Humber

April 09, 2018
Tweet

Transcript

  1. None
  2. Data Engineering for Data Scientists Max Humber

  3. None
  4. When models and data applications are pushed to production, they

    become brittle black boxes that can and will break. In this talk you’ll learn how to one-up your data science workflow with a little engineering! Or more specifically, about how to improve the reliability and quality of your data applications... all so that your models won’t break (or at least won’t break as often)! Examples for this session will be in Python 3.6+ and will rely on: logging to allow us to debug and diagnose things while they’re running, Click to develop “beautiful” command line interfaces with minimal boiler-plating, and pytest to write short, elegant, and maintainable tests.
  5. None
  6. None
  7. None
  8. None
  9. you can't do this

  10. without this you can't do this

  11. None
  12. #1 .py #2 defence #3 log #4 cli #5

  13. #1 .py #2 defence #3 log #4 cli #5

  14. None
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. #1 Lose the Notebook

  26. None
  27. None
  28. None
  29. None
  30. .ipynb exploratory analysis visualizing ideas prototyping messy bad at versioning

    not ideal for production ✅ ✅ ✅ ❌ ❌ ❌
  31. .ipynb exploratory analysis visualizing ideas prototyping messy bad at versioning

    not ideal for production ✅ ✅ ✅ ❌ ❌ ❌
  32. .ipynb .py

  33. $ jupyter nbconvert --to script [NOTEBOOK_NAME].ipynb

  34. $ jupyter nbconvert --to script [NOTEBOOK_NAME].ipynb

  35. $ jupyter nbconvert --to script [NOTEBOOK_NAME].ipynb

  36. None
  37. None
  38. None
  39. cmd+enter

  40. None
  41. None
  42. None
  43. lose the notebook not the kernel

  44. lose the notebook not the kernel

  45. lose the notebook not the kernel

  46. #2 Get Defensive

  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. $ pip install sklearn-pandas

  56. DataFrameMapper CategoricalImputer

  57. from sklearn_pandas import DataFrameMapper, CategoricalImputer mapper = DataFrameMapper([ ('time', None),

    ('pick_up', None), ('last_drop_off', CategoricalImputer()), ('last_pick_up', CategoricalImputer()) ]) mapper.fit(X_train)
  58. from sklearn_pandas import DataFrameMapper, CategoricalImputer mapper = DataFrameMapper([ ('time', None),

    ('pick_up', None), ('last_drop_off', CategoricalImputer()), ('last_pick_up', CategoricalImputer()) ]) mapper.fit(X_train)
  59. from sklearn_pandas import DataFrameMapper, CategoricalImputer mapper = DataFrameMapper([ ('time', None),

    ('pick_up', None), ('last_drop_off', CategoricalImputer()), ('last_pick_up', CategoricalImputer()) ]) mapper.fit(X_train)
  60. None
  61. None
  62. None
  63. None
  64. None
  65. None
  66. from sklearn.base import TransformerMixin class DateEncoder(TransformerMixin): def fit(self, X, y=None):

    return self def transform(self, X): dt = X.dt return pd.concat([dt.month, dt.dayofweek, dt.hour], axis=1)
  67. from sklearn.base import TransformerMixin class DateEncoder(TransformerMixin): def fit(self, X, y=None):

    return self def transform(self, X): dt = X.dt return pd.concat([dt.month, dt.dayofweek, dt.hour], axis=1)
  68. from sklearn.base import TransformerMixin class DateEncoder(TransformerMixin): def fit(self, X, y=None):

    return self def transform(self, X): dt = X.dt return pd.concat([dt.month, dt.dayofweek, dt.hour], axis=1)
  69. None
  70. None
  71. month, dayofweek, hour

  72. None
  73. None
  74. None
  75. None
  76. None
  77. None
  78. #3 LOG ALL THE THINGS

  79. #3 LOG ALL THE THINGS

  80. None
  81. None
  82. Cerberus is a lightweight and extensible data validation library for

    Python
  83. Cerberus is a lightweight and extensible data validation library for

    Python $ pip install cerberus
  84. None
  85. None
  86. None
  87. None
  88. None
  89. None
  90. None
  91. None
  92. None
  93. from cerberus import Validator from copy import deepcopy class PandasValidator(Validator):

    def validate(self, document, schema, update=False, normalize=True): document = document.to_dict(orient='list') schema = self.transform_schema(schema) super().validate(document, schema, update=update, normalize=normalize) def transform_schema(self, schema): schema = deepcopy(schema) for k, v in schema.items(): schema[k] = {'type': 'list', 'schema': v} return schema
  94. from cerberus import Validator from copy import deepcopy class PandasValidator(Validator):

    def validate(self, document, schema, update=False, normalize=True): document = document.to_dict(orient='list') schema = self.transform_schema(schema) super().validate(document, schema, update=update, normalize=normalize) def transform_schema(self, schema): schema = deepcopy(schema) for k, v in schema.items(): schema[k] = {'type': 'list', 'schema': v} return schema
  95. from cerberus import Validator from copy import deepcopy class PandasValidator(Validator):

    def validate(self, document, schema, update=False, normalize=True): document = document.to_dict(orient='list') schema = self.transform_schema(schema) super().validate(document, schema, update=update, normalize=normalize) def transform_schema(self, schema): schema = deepcopy(schema) for k, v in schema.items(): schema[k] = {'type': 'list', 'schema': v} return schema
  96. None
  97. None
  98. None
  99. None
  100. None
  101. None
  102. 78asd86d876ad8678sdadsa687d

  103. 78asd86d876ad8678sdadsa687d

  104. None
  105. None
  106. #4 Learn how to CLI

  107. input output

  108. None
  109. None
  110. None
  111. None
  112. None
  113. None
  114. None
  115. None
  116. < refactor >

  117. None
  118. None
  119. None
  120. $ python model.py predict --file=max_bike_data.csv

  121. $ python model.py predict --file=max_bike_data.csv

  122. $ python model.py predict --file=max_bike_data.csv

  123. $ python model.py predict my_bike_data.csv

  124. $ python model.py predict sunny_bike_data.csv

  125. $ python model.py predict sunny_bike_data.csv

  126. #5

  127. #5 mummify

  128. you suck at git and logging but it’s not your

    fault
  129. you suck at git and logging but it’s not your

    fault
  130. you suck at git and logging but it’s not your

    fault
  131. None
  132. None
  133. None
  134. None
  135. None
  136. None
  137. None
  138. import pandas as pd import numpy as np from sklearn.model_selection

    import train_test_split from sklearn.preprocessing import LabelBinarizer from sklearn.pipeline import make_pipeline from sklearn_pandas import DataFrameMapper, CategoricalImputer from helpers import DateEncoder df = pd.read_csv('../max_bike_data.csv') df['time'] = pd.to_datetime(df['time']) df = df[(df['pick_up'].notnull()) & (df['drop_off'].notnull())] TARGET = 'drop_off' y = df[TARGET].values X = df.drop(TARGET, axis=1) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) mapper = DataFrameMapper([ ('time', DateEncoder(), {'input_df': True}), ('pick_up', LabelBinarizer()), ('last_drop_off', [CategoricalImputer(), LabelBinarizer()]), ('last_pick_up', [CategoricalImputer(), LabelBinarizer()]) ]) lb = LabelBinarizer() y_train = lb.fit_transform(y_train) model.py base
  139. model.py add from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier() pipe

    = make_pipeline(mapper, model) pipe.fit(X_train, y_train) acc_train = pipe.score(X_train, y_train) acc_test = pipe.score(X_test, lb.transform(y_test)) print(f'Training: {acc_train:.3f}, Testing: {acc_test:.3f}')
  140. model.py mummify import mummify mummify.log(f'Training: {acc_train:.3f}, Testing: {acc_test:.3f}’)

  141. from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() model.py model swap

    1
  142. from sklearn.neural_network import MLPClassifier model = MLPClassifier() model.py model swap

    2
  143. from sklearn.neural_network import MLPClassifier model = MLPClassifier(max_iter=2000) model.py model swap

    2 + max_iter
  144. mummify history mummify switch mummify history mummify command line

  145. git --git-dir=.mummify status mummify is just git

  146. from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier(n_neighbors=6) mummify adjust hypers

    on 1
  147. from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier(n_neighbors=4) mummify adjust hypers

    on 1
  148. from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=1000) mummify switch back

    to rf
  149. import pickle with open('rick.pkl', 'wb') as f: pickle.dump((pipe, lb), f)

    pickle model
  150. import pickle from fire import Fire import pandas as pd

    with open('rick.pkl', 'rb') as f: pipe, lb = pickle.load(f) def predict(file): df = pd.read_csv(file) df['time'] = pd.to_datetime(df['time']) y = pipe.predict(df) y = lb.inverse_transform(y)[0] return f'Max is probably going to {y}' if __name__ == '__main__': Fire(predict) predict.py $ git --git-dir=.mummify add . $ git --git-dir=.mummify commit -m 'add predict'
  151. time,pick_up,last_drop_off,last_pick_up 2018-04-09 9:15:52,home,other,home new_data.csv

  152. None
  153. https://github.com/maxhumber/mummify pip install mummify

  154. https://github.com/maxhumber/mummify pip install mummify conda install -c maxhumber mummify

  155. #END

  156. hydrogen sklearn sklearn-pandas cerberus

  157. None
  158. mummify https://leanpub.com/personal_finance_with_python/c/anaconda First 50 get it free!

  159. None