Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Full-stack Data Science: How to be a One-Man Data Team

Full-stack Data Science: How to be a One-Man Data Team

Slides for the talk I gave at a CognitionX meetup.

Based on my 3 years of experience working at startups, going from web developer to data engineering, to data science. A ton of tips and tricks on fast ways of doing data science.

Greg Goltsov

July 26, 2016
Tweet

More Decks by Greg Goltsov

Other Decks in Technology

Transcript

  1. Full-stack data science how to be a one-man data team

    Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  2. 3+ years in startups Pythonista Built backends for 1 mil+

    users Delivered to Fortune 10 Engineering → science Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  3. My journey Invest in tools that last Data is simple

    Explore literally Start fast, iterate faster Analysis is a DAG Don’t guard, empower instead What next?
  4. CS + Physics Games dev Data analyst/ engineer/viz/* Data Hacker

    Data Scientist University Touch Surgery Appear Here
  5. CREATE TABLE events ( name varchar(200), visitor_id varchar(200), properties jsonb,

    browser jsonb ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json “Hey, there’s MongoDB in my Postgres!”
  6. INSERT INTO events VALUES ( 'pageview', '1', '{ "page": "/account"

    }', '{ "name": "Chrome", "os": "Mac", "resolution": { "x": 1440, "y": 900 } }' ); INSERT INTO events VALUES ( 'purchase', '5', '{ "amount": 10 }', '{ "name": "Firefox", "os": "Windows", "resolution": { "x": 1024, "y": 768 } }' ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  7. SELECT browser->>'name' AS browser, count(browser) FROM events GROUP BY browser->>'name';

    browser | count ---------+------- Firefox | 3 Chrome | 2 Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  8. WITH new_users AS (...), unverified_users_ids AS (...) SELECT COUNT(new_user.id) FROM

    new_user WHERE new_user.id NOT IN unverified_users_ids; Postgres WITH
  9. http:/ /worrydream.com/LadderOfAbstraction # plain python col_C = [] for i,

    row in enumerate(col_A): c = row + col_B[i] col_C.append(c) # pandas df['C'] = df['A'] + df['B'] pandas What vs How
  10. Like to clean data Slice & dice data fluently http:/

    /www.forbes.com/sites/gilpress/2016/03/23/ data-preparation-most-time-consuming-least- enjoyable-data-science-task-survey-says
  11. drivendata.github.io/cookiecutter-data-science Rails new for data science Notebooks are for exploration

    Sane structure for collaboration https:/ /drivendata.github.io/cookiecutter-data-science
  12. !"" Makefile <- Makefile with commands like `make data` or

    `make train` !"" data # !"" external <- Data from third party sources. # !"" interim <- Intermediate data that has been transformed. # !"" processed <- The final, canonical data sets for modeling. # $"" raw <- The original, immutable data dump. !"" docs <- A default Sphinx project; see sphinx-doc.org for details !"" models <- Trained and serialized models, model predictions !"" notebooks <- Jupyter notebooks !"" references <- Data dictionaries, manuals, and all other explanatory materials. !"" reports <- Generated analysis as HTML, PDF, LaTeX, etc. !"" requirements.txt <- The requirements file for reproducing the env !"" src <- Source code for use in this project. # !"" data <- Scripts to download or generate data # # $"" make_dataset.py # !"" features <- Scripts to turn raw data into features for modeling # # $"" build_features.py # !"" models <- Scripts to train models and then use trained models to make # # # predictions # # !"" predict_model.py # # $"" train_model.py # $"" visualization <- Scripts to create exploratory and results oriented visualizations # $"" visualize.py
  13. dataset.readthedocs.io Just write SQL # connect, return rows as objects

    with attributes db = dataset.connect('postgresql://u:p@localhost:5432/db', row_type=stuf) rows = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country') # print all the rows for row in result: print(row['country'], row['c']) # get data into pandas, that's where the fun begins! rows_df = pandas.DataFrame.from_records(rows)
  14. # sklearn-pandas mapper = DataFrameMapper([ (['age'], [sklearn.preprocessing.Imputer(), sklearn.preprocessing.StandardScaler()]), ...]) pipeline

    = sklearn.pipeline.Pipeline([ ('featurise', mapper), ('feature_selection', feature_selection.SelectKBest(k=100)), ('random_forest', ensemble.RandomForestClassifier())]) cv_params = dict( feature_selection__k=[100, 200], random_forest__n_estimators=[50, 100, 200]) cv = grid_search.GridSearchCV(pipeline, param_grid=cv_params) best_model = best_estimator_
  15. The goal is to turn data into information, and information

    into insight. – Carly Fiorina, former HP CEO