Full-stack Data Science: How to be a One-Man Data Team

Full-stack Data Science: How to be a One-Man Data Team

Slides for the talk I gave at a CognitionX meetup.

Based on my 3 years of experience working at startups, going from web developer to data engineering, to data science. A ton of tips and tricks on fast ways of doing data science.

707d0934f77fc31048f004103e10c57f?s=128

Greg Goltsov

July 26, 2016
Tweet

Transcript

  1. 1.

    Full-stack data science how to be a one-man data team

    Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  2. 2.

    3+ years in startups Pythonista Built backends for 1 mil+

    users Delivered to Fortune 10 Engineering → science Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  3. 3.

    My journey Invest in tools that last Data is simple

    Explore literally Start fast, iterate faster Analysis is a DAG Don’t guard, empower instead What next?
  4. 5.

    CS + Physics Games dev Data analyst/ engineer/viz/* Data Hacker

    Data Scientist University Touch Surgery Appear Here
  5. 7.
  6. 8.
  7. 9.
  8. 12.
  9. 20.

    CREATE TABLE events ( name varchar(200), visitor_id varchar(200), properties jsonb,

    browser jsonb ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json “Hey, there’s MongoDB in my Postgres!”
  10. 21.

    INSERT INTO events VALUES ( 'pageview', '1', '{ "page": "/account"

    }', '{ "name": "Chrome", "os": "Mac", "resolution": { "x": 1440, "y": 900 } }' ); INSERT INTO events VALUES ( 'purchase', '5', '{ "amount": 10 }', '{ "name": "Firefox", "os": "Windows", "resolution": { "x": 1024, "y": 768 } }' ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  11. 22.

    SELECT browser->>'name' AS browser, count(browser) FROM events GROUP BY browser->>'name';

    browser | count ---------+------- Firefox | 3 Chrome | 2 Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  12. 23.
  13. 24.

    WITH new_users AS (...), unverified_users_ids AS (...) SELECT COUNT(new_user.id) FROM

    new_user WHERE new_user.id NOT IN unverified_users_ids; Postgres WITH
  14. 28.
  15. 30.

    http:/ /worrydream.com/LadderOfAbstraction # plain python col_C = [] for i,

    row in enumerate(col_A): c = row + col_B[i] col_C.append(c) # pandas df['C'] = df['A'] + df['B'] pandas What vs How
  16. 31.

    Like to clean data Slice & dice data fluently http:/

    /www.forbes.com/sites/gilpress/2016/03/23/ data-preparation-most-time-consuming-least- enjoyable-data-science-task-survey-says
  17. 35.
  18. 38.

    drivendata.github.io/cookiecutter-data-science Rails new for data science Notebooks are for exploration

    Sane structure for collaboration https:/ /drivendata.github.io/cookiecutter-data-science
  19. 39.

    !"" Makefile <- Makefile with commands like `make data` or

    `make train` !"" data # !"" external <- Data from third party sources. # !"" interim <- Intermediate data that has been transformed. # !"" processed <- The final, canonical data sets for modeling. # $"" raw <- The original, immutable data dump. !"" docs <- A default Sphinx project; see sphinx-doc.org for details !"" models <- Trained and serialized models, model predictions !"" notebooks <- Jupyter notebooks !"" references <- Data dictionaries, manuals, and all other explanatory materials. !"" reports <- Generated analysis as HTML, PDF, LaTeX, etc. !"" requirements.txt <- The requirements file for reproducing the env !"" src <- Source code for use in this project. # !"" data <- Scripts to download or generate data # # $"" make_dataset.py # !"" features <- Scripts to turn raw data into features for modeling # # $"" build_features.py # !"" models <- Scripts to train models and then use trained models to make # # # predictions # # !"" predict_model.py # # $"" train_model.py # $"" visualization <- Scripts to create exploratory and results oriented visualizations # $"" visualize.py
  20. 40.

    dataset.readthedocs.io Just write SQL # connect, return rows as objects

    with attributes db = dataset.connect('postgresql://u:p@localhost:5432/db', row_type=stuf) rows = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country') # print all the rows for row in result: print(row['country'], row['c']) # get data into pandas, that's where the fun begins! rows_df = pandas.DataFrame.from_records(rows)
  21. 41.

    # sklearn-pandas mapper = DataFrameMapper([ (['age'], [sklearn.preprocessing.Imputer(), sklearn.preprocessing.StandardScaler()]), ...]) pipeline

    = sklearn.pipeline.Pipeline([ ('featurise', mapper), ('feature_selection', feature_selection.SelectKBest(k=100)), ('random_forest', ensemble.RandomForestClassifier())]) cv_params = dict( feature_selection__k=[100, 200], random_forest__n_estimators=[50, 100, 200]) cv = grid_search.GridSearchCV(pipeline, param_grid=cv_params) best_model = best_estimator_
  22. 42.
  23. 47.

    The goal is to turn data into information, and information

    into insight. – Carly Fiorina, former HP CEO
  24. 49.
  25. 54.