Full-stack Data Science: How to be a One-Man Data Team

Full-stack Data Science: How to be a One-Man Data Team

Slides for the talk I gave at a CognitionX meetup.

Based on my 3 years of experience working at startups, going from web developer to data engineering, to data science. A ton of tips and tricks on fast ways of doing data science.

707d0934f77fc31048f004103e10c57f?s=128

Greg Goltsov

July 26, 2016
Tweet

Transcript

  1. Full-stack data science how to be a one-man data team

    Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  2. 3+ years in startups Pythonista Built backends for 1 mil+

    users Delivered to Fortune 10 Engineering → science Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov
  3. My journey Invest in tools that last Data is simple

    Explore literally Start fast, iterate faster Analysis is a DAG Don’t guard, empower instead What next?
  4. Small/medium data Python Concepts > code

  5. CS + Physics Games dev Data analyst/ engineer/viz/* Data Hacker

    Data Scientist University Touch Surgery Appear Here
  6. CS + Physics Games dev

  7. None
  8. Web dev

  9. Web dev

  10. Web dev Full-stack dev DBA SysAdmin DevOps Data Analyst Data

    Engineer Team Lead
  11. 60,000 to 1,000,000

  12. None
  13. First contract, 1-man team Engineering → science Learning. Fast.

  14. Invest in tools that last

  15. Basic Postgres Pandas Scikit-learn Notebooks Luigi/Airflow Bash Git Extra Flask

    AWS EC2 AWS Redshift d3.js ElasticSearch Spark
  16. Data is simple*

  17. SQL is simple

  18. SQL is everywhere

  19. If you can, use Postgres

  20. CREATE TABLE events ( name varchar(200), visitor_id varchar(200), properties jsonb,

    browser jsonb ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json “Hey, there’s MongoDB in my Postgres!”
  21. INSERT INTO events VALUES ( 'pageview', '1', '{ "page": "/account"

    }', '{ "name": "Chrome", "os": "Mac", "resolution": { "x": 1440, "y": 900 } }' ); INSERT INTO events VALUES ( 'purchase', '5', '{ "amount": 10 }', '{ "name": "Firefox", "os": "Windows", "resolution": { "x": 1024, "y": 768 } }' ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  22. SELECT browser->>'name' AS browser, count(browser) FROM events GROUP BY browser->>'name';

    browser | count ---------+------- Firefox | 3 Chrome | 2 Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json
  23. your_db=# \o 'path/to/export.csv' your_db=# COPY ( SELECT * ... )

    TO STDOUT WITH CSV HEADER; Postgres CSV
  24. WITH new_users AS (...), unverified_users_ids AS (...) SELECT COUNT(new_user.id) FROM

    new_user WHERE new_user.id NOT IN unverified_users_ids; Postgres WITH
  25. Postgres products

  26. Postgres products

  27. Postgres products

  28. Pandas

  29. “R in Python” DataFrame Simple I/O Plotting Split/apply/combine

  30. http:/ /worrydream.com/LadderOfAbstraction # plain python col_C = [] for i,

    row in enumerate(col_A): c = row + col_B[i] col_C.append(c) # pandas df['C'] = df['A'] + df['B'] pandas What vs How
  31. Like to clean data Slice & dice data fluently http:/

    /www.forbes.com/sites/gilpress/2016/03/23/ data-preparation-most-time-consuming-least- enjoyable-data-science-task-survey-says
  32. Scikit-learn

  33. Fit, transform, predict Train/test split NumPy

  34. Explore literally

  35. None
  36. Start fast, iterate faster

  37. Apache Zeppelin Spark Notebook on steroids http:/ /zeppelin-project.org/

  38. drivendata.github.io/cookiecutter-data-science Rails new for data science Notebooks are for exploration

    Sane structure for collaboration https:/ /drivendata.github.io/cookiecutter-data-science
  39. !"" Makefile <- Makefile with commands like `make data` or

    `make train` !"" data # !"" external <- Data from third party sources. # !"" interim <- Intermediate data that has been transformed. # !"" processed <- The final, canonical data sets for modeling. # $"" raw <- The original, immutable data dump. !"" docs <- A default Sphinx project; see sphinx-doc.org for details !"" models <- Trained and serialized models, model predictions !"" notebooks <- Jupyter notebooks !"" references <- Data dictionaries, manuals, and all other explanatory materials. !"" reports <- Generated analysis as HTML, PDF, LaTeX, etc. !"" requirements.txt <- The requirements file for reproducing the env !"" src <- Source code for use in this project. # !"" data <- Scripts to download or generate data # # $"" make_dataset.py # !"" features <- Scripts to turn raw data into features for modeling # # $"" build_features.py # !"" models <- Scripts to train models and then use trained models to make # # # predictions # # !"" predict_model.py # # $"" train_model.py # $"" visualization <- Scripts to create exploratory and results oriented visualizations # $"" visualize.py
  40. dataset.readthedocs.io Just write SQL # connect, return rows as objects

    with attributes db = dataset.connect('postgresql://u:p@localhost:5432/db', row_type=stuf) rows = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country') # print all the rows for row in result: print(row['country'], row['c']) # get data into pandas, that's where the fun begins! rows_df = pandas.DataFrame.from_records(rows)
  41. # sklearn-pandas mapper = DataFrameMapper([ (['age'], [sklearn.preprocessing.Imputer(), sklearn.preprocessing.StandardScaler()]), ...]) pipeline

    = sklearn.pipeline.Pipeline([ ('featurise', mapper), ('feature_selection', feature_selection.SelectKBest(k=100)), ('random_forest', ensemble.RandomForestClassifier())]) cv_params = dict( feature_selection__k=[100, 200], random_forest__n_estimators=[50, 100, 200]) cv = grid_search.GridSearchCV(pipeline, param_grid=cv_params) best_model = best_estimator_
  42. None
  43. Analysis is a DAG

  44. get_data.sh && process_data.sh && publish_data.sh

  45. Luigi/Airflow Describe the pipeline

  46. Don’t guard, empower instead http:/ /jhtart.deviantart.com/art/Castle-171525835

  47. The goal is to turn data into information, and information

    into insight. – Carly Fiorina, former HP CEO
  48. What next?

  49. None
  50. Building a data science portfolio: Storytelling with data Awesome public

    datasets
  51. Meetups Meet Speak Give back

  52. Read Keep up DataTau DataScienceWeekly O’Reilly Data Newsletter KDnuggets

  53. Kaggle Learn Study Compete

  54. None
  55. Thanks! Questions? Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov