Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Engineering for Data Scientists

Data Engineering for Data Scientists

AnacondaCON, Austin, Texas / April 9, 2018 at 4:10-5:00pm

Max Humber

April 09, 2018
Tweet

More Decks by Max Humber

Other Decks in Programming

Transcript

  1. View Slide

  2. Data Engineering for Data Scientists
    Max Humber

    View Slide

  3. View Slide

  4. When models and data applications are pushed to production,
    they become brittle black boxes that can and will break. In this
    talk you’ll learn how to one-up your data science workflow with a
    little engineering! Or more specifically, about how to improve the
    reliability and quality of your data applications... all so that your
    models won’t break (or at least won’t break as often)! Examples
    for this session will be in Python 3.6+ and will rely on: logging to
    allow us to debug and diagnose things while they’re running,
    Click to develop “beautiful” command line interfaces with
    minimal boiler-plating, and pytest to write short, elegant, and
    maintainable tests.

    View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. you can't do this

    View Slide

  10. without this
    you can't do this

    View Slide

  11. View Slide

  12. #1 .py
    #2 defence
    #3 log
    #4 cli
    #5

    View Slide

  13. #1 .py
    #2 defence
    #3 log
    #4 cli
    #5

    View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. #1
    Lose the Notebook

    View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. .ipynb
    exploratory analysis

    visualizing ideas

    prototyping

    messy

    bad at versioning

    not ideal for production






    View Slide

  31. .ipynb
    exploratory analysis

    visualizing ideas

    prototyping

    messy

    bad at versioning

    not ideal for production






    View Slide

  32. .ipynb .py

    View Slide

  33. $ jupyter nbconvert --to script [NOTEBOOK_NAME].ipynb

    View Slide

  34. $ jupyter nbconvert --to script [NOTEBOOK_NAME].ipynb

    View Slide

  35. $ jupyter nbconvert --to script [NOTEBOOK_NAME].ipynb

    View Slide

  36. View Slide

  37. View Slide

  38. View Slide

  39. cmd+enter

    View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. lose the notebook
    not the kernel

    View Slide

  44. lose the notebook
    not the kernel

    View Slide

  45. lose the notebook
    not the kernel

    View Slide

  46. #2
    Get Defensive

    View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. View Slide

  53. View Slide

  54. View Slide

  55. $ pip install sklearn-pandas

    View Slide

  56. DataFrameMapper
    CategoricalImputer

    View Slide

  57. from sklearn_pandas import DataFrameMapper, CategoricalImputer
    mapper = DataFrameMapper([
    ('time', None),
    ('pick_up', None),
    ('last_drop_off', CategoricalImputer()),
    ('last_pick_up', CategoricalImputer())
    ])
    mapper.fit(X_train)

    View Slide

  58. from sklearn_pandas import DataFrameMapper, CategoricalImputer
    mapper = DataFrameMapper([
    ('time', None),
    ('pick_up', None),
    ('last_drop_off', CategoricalImputer()),
    ('last_pick_up', CategoricalImputer())
    ])
    mapper.fit(X_train)

    View Slide

  59. from sklearn_pandas import DataFrameMapper, CategoricalImputer
    mapper = DataFrameMapper([
    ('time', None),
    ('pick_up', None),
    ('last_drop_off', CategoricalImputer()),
    ('last_pick_up', CategoricalImputer())
    ])
    mapper.fit(X_train)

    View Slide

  60. View Slide

  61. View Slide

  62. View Slide

  63. View Slide

  64. View Slide

  65. View Slide

  66. from sklearn.base import TransformerMixin
    class DateEncoder(TransformerMixin):
    def fit(self, X, y=None):
    return self
    def transform(self, X):
    dt = X.dt
    return pd.concat([dt.month, dt.dayofweek, dt.hour],
    axis=1)

    View Slide

  67. from sklearn.base import TransformerMixin
    class DateEncoder(TransformerMixin):
    def fit(self, X, y=None):
    return self
    def transform(self, X):
    dt = X.dt
    return pd.concat([dt.month, dt.dayofweek, dt.hour],
    axis=1)

    View Slide

  68. from sklearn.base import TransformerMixin
    class DateEncoder(TransformerMixin):
    def fit(self, X, y=None):
    return self
    def transform(self, X):
    dt = X.dt
    return pd.concat([dt.month, dt.dayofweek, dt.hour],
    axis=1)

    View Slide

  69. View Slide

  70. View Slide

  71. month, dayofweek, hour

    View Slide

  72. View Slide

  73. View Slide

  74. View Slide

  75. View Slide

  76. View Slide

  77. View Slide

  78. #3
    LOG ALL THE THINGS

    View Slide

  79. #3
    LOG ALL THE THINGS

    View Slide

  80. View Slide

  81. View Slide

  82. Cerberus is a lightweight and extensible data validation library for Python

    View Slide

  83. Cerberus is a lightweight and extensible data validation library for Python
    $ pip install cerberus

    View Slide

  84. View Slide

  85. View Slide

  86. View Slide

  87. View Slide

  88. View Slide

  89. View Slide

  90. View Slide

  91. View Slide

  92. View Slide

  93. from cerberus import Validator
    from copy import deepcopy
    class PandasValidator(Validator):
    def validate(self, document, schema,
    update=False, normalize=True):
    document = document.to_dict(orient='list')
    schema = self.transform_schema(schema)
    super().validate(document, schema,
    update=update, normalize=normalize)
    def transform_schema(self, schema):
    schema = deepcopy(schema)
    for k, v in schema.items():
    schema[k] = {'type': 'list', 'schema': v}
    return schema

    View Slide

  94. from cerberus import Validator
    from copy import deepcopy
    class PandasValidator(Validator):
    def validate(self, document, schema,
    update=False, normalize=True):
    document = document.to_dict(orient='list')
    schema = self.transform_schema(schema)
    super().validate(document, schema,
    update=update, normalize=normalize)
    def transform_schema(self, schema):
    schema = deepcopy(schema)
    for k, v in schema.items():
    schema[k] = {'type': 'list', 'schema': v}
    return schema

    View Slide

  95. from cerberus import Validator
    from copy import deepcopy
    class PandasValidator(Validator):
    def validate(self, document, schema,
    update=False, normalize=True):
    document = document.to_dict(orient='list')
    schema = self.transform_schema(schema)
    super().validate(document, schema,
    update=update, normalize=normalize)
    def transform_schema(self, schema):
    schema = deepcopy(schema)
    for k, v in schema.items():
    schema[k] = {'type': 'list', 'schema': v}
    return schema

    View Slide

  96. View Slide

  97. View Slide

  98. View Slide

  99. View Slide

  100. View Slide

  101. View Slide

  102. 78asd86d876ad8678sdadsa687d

    View Slide

  103. 78asd86d876ad8678sdadsa687d

    View Slide

  104. View Slide

  105. View Slide

  106. #4
    Learn how to CLI

    View Slide

  107. input output

    View Slide

  108. View Slide

  109. View Slide

  110. View Slide

  111. View Slide

  112. View Slide

  113. View Slide

  114. View Slide

  115. View Slide

  116. < refactor >

    View Slide

  117. View Slide

  118. View Slide

  119. View Slide

  120. $ python model.py predict --file=max_bike_data.csv

    View Slide

  121. $ python model.py predict --file=max_bike_data.csv

    View Slide

  122. $ python model.py predict --file=max_bike_data.csv

    View Slide

  123. $ python model.py predict my_bike_data.csv

    View Slide

  124. $ python model.py predict sunny_bike_data.csv

    View Slide

  125. $ python model.py predict sunny_bike_data.csv

    View Slide

  126. #5

    View Slide

  127. #5
    mummify

    View Slide

  128. you suck at git
    and logging
    but it’s not your fault

    View Slide

  129. you suck at git
    and logging
    but it’s not your fault

    View Slide

  130. you suck at git
    and logging
    but it’s not your fault

    View Slide

  131. View Slide

  132. View Slide

  133. View Slide

  134. View Slide

  135. View Slide

  136. View Slide

  137. View Slide

  138. import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelBinarizer
    from sklearn.pipeline import make_pipeline
    from sklearn_pandas import DataFrameMapper, CategoricalImputer
    from helpers import DateEncoder
    df = pd.read_csv('../max_bike_data.csv')
    df['time'] = pd.to_datetime(df['time'])
    df = df[(df['pick_up'].notnull()) & (df['drop_off'].notnull())]
    TARGET = 'drop_off'
    y = df[TARGET].values
    X = df.drop(TARGET, axis=1)
    X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
    mapper = DataFrameMapper([
    ('time', DateEncoder(), {'input_df': True}),
    ('pick_up', LabelBinarizer()),
    ('last_drop_off', [CategoricalImputer(), LabelBinarizer()]),
    ('last_pick_up', [CategoricalImputer(), LabelBinarizer()])
    ])
    lb = LabelBinarizer()
    y_train = lb.fit_transform(y_train)
    model.py base

    View Slide

  139. model.py add
    from sklearn.neighbors import KNeighborsClassifier
    model = KNeighborsClassifier()
    pipe = make_pipeline(mapper, model)
    pipe.fit(X_train, y_train)
    acc_train = pipe.score(X_train, y_train)
    acc_test = pipe.score(X_test, lb.transform(y_test))
    print(f'Training: {acc_train:.3f}, Testing: {acc_test:.3f}')

    View Slide

  140. model.py mummify
    import mummify
    mummify.log(f'Training: {acc_train:.3f}, Testing: {acc_test:.3f}’)

    View Slide

  141. from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier()
    model.py model swap 1

    View Slide

  142. from sklearn.neural_network import MLPClassifier
    model = MLPClassifier()
    model.py model swap 2

    View Slide

  143. from sklearn.neural_network import MLPClassifier
    model = MLPClassifier(max_iter=2000)
    model.py model swap 2 + max_iter

    View Slide

  144. mummify history
    mummify switch
    mummify history
    mummify command line

    View Slide

  145. git --git-dir=.mummify status
    mummify is just git

    View Slide

  146. from sklearn.neighbors import KNeighborsClassifier
    model = KNeighborsClassifier(n_neighbors=6)
    mummify adjust hypers on 1

    View Slide

  147. from sklearn.neighbors import KNeighborsClassifier
    model = KNeighborsClassifier(n_neighbors=4)
    mummify adjust hypers on 1

    View Slide

  148. from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(n_estimators=1000)
    mummify switch back to rf

    View Slide

  149. import pickle
    with open('rick.pkl', 'wb') as f:
    pickle.dump((pipe, lb), f)
    pickle model

    View Slide

  150. import pickle
    from fire import Fire
    import pandas as pd
    with open('rick.pkl', 'rb') as f:
    pipe, lb = pickle.load(f)
    def predict(file):
    df = pd.read_csv(file)
    df['time'] = pd.to_datetime(df['time'])
    y = pipe.predict(df)
    y = lb.inverse_transform(y)[0]
    return f'Max is probably going to {y}'
    if __name__ == '__main__':
    Fire(predict)
    predict.py
    $ git --git-dir=.mummify add .
    $ git --git-dir=.mummify commit -m 'add predict'

    View Slide

  151. time,pick_up,last_drop_off,last_pick_up
    2018-04-09 9:15:52,home,other,home
    new_data.csv

    View Slide

  152. View Slide

  153. https://github.com/maxhumber/mummify
    pip install mummify

    View Slide

  154. https://github.com/maxhumber/mummify
    pip install mummify
    conda install -c maxhumber mummify

    View Slide

  155. #END

    View Slide

  156. hydrogen sklearn
    sklearn-pandas
    cerberus

    View Slide

  157. View Slide

  158. mummify
    https://leanpub.com/personal_finance_with_python/c/anaconda
    First 50 get
    it free!

    View Slide

  159. View Slide