Full-stack Data Science: How to be a One-Man Data Team

Slide 1

Slide 1 text

Full-stack data science how to be a one-man data team Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov

Slide 2

Slide 2 text

3+ years in startups Pythonista Built backends for 1 mil+ users Delivered to Fortune 10 Engineering → science Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov

Slide 3

Slide 3 text

My journey Invest in tools that last Data is simple Explore literally Start fast, iterate faster Analysis is a DAG Don’t guard, empower instead What next?

Slide 4

Slide 4 text

Small/medium data Python Concepts > code

Slide 5

Slide 5 text

CS + Physics Games dev Data analyst/ engineer/viz/* Data Hacker Data Scientist University Touch Surgery Appear Here

Slide 6

Slide 6 text

CS + Physics Games dev

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Web dev

Slide 9

Slide 9 text

Web dev

Slide 10

Slide 10 text

Web dev Full-stack dev DBA SysAdmin DevOps Data Analyst Data Engineer Team Lead

Slide 11

Slide 11 text

60,000 to 1,000,000

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

First contract, 1-man team Engineering → science Learning. Fast.

Slide 14

Slide 14 text

Invest in tools that last

Slide 15

Slide 15 text

Basic Postgres Pandas Scikit-learn Notebooks Luigi/Airﬂow Bash Git Extra Flask AWS EC2 AWS Redshift d3.js ElasticSearch Spark

Slide 16

Slide 16 text

Data is simple*

Slide 17

Slide 17 text

SQL is simple

Slide 18

Slide 18 text

SQL is everywhere

Slide 19

Slide 19 text

If you can, use Postgres

Slide 20

Slide 20 text

CREATE TABLE events ( name varchar(200), visitor_id varchar(200), properties jsonb, browser jsonb ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json “Hey, there’s MongoDB in my Postgres!”

Slide 21

Slide 21 text

INSERT INTO events VALUES ( 'pageview', '1', '{ "page": "/account" }', '{ "name": "Chrome", "os": "Mac", "resolution": { "x": 1440, "y": 900 } }' ); INSERT INTO events VALUES ( 'purchase', '5', '{ "amount": 10 }', '{ "name": "Firefox", "os": "Windows", "resolution": { "x": 1024, "y": 768 } }' ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json

Slide 22

Slide 22 text

SELECT browser->>'name' AS browser, count(browser) FROM events GROUP BY browser->>'name'; browser | count ---------+------- Firefox | 3 Chrome | 2 Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json

Slide 23

Slide 23 text

your_db=# \o 'path/to/export.csv' your_db=# COPY ( SELECT * ... ) TO STDOUT WITH CSV HEADER; Postgres CSV

Slide 24

Slide 24 text

WITH new_users AS (...), unverified_users_ids AS (...) SELECT COUNT(new_user.id) FROM new_user WHERE new_user.id NOT IN unverified_users_ids; Postgres WITH

Slide 25

Slide 25 text

Postgres products

Slide 26

Slide 26 text

Postgres products

Slide 27

Slide 27 text

Postgres products

Slide 28

Slide 28 text

Pandas

Slide 29

Slide 29 text

“R in Python” DataFrame Simple I/O Plotting Split/apply/combine

Slide 30

Slide 30 text

http:/ /worrydream.com/LadderOfAbstraction # plain python col_C = [] for i, row in enumerate(col_A): c = row + col_B[i] col_C.append(c) # pandas df['C'] = df['A'] + df['B'] pandas What vs How

Slide 31

Slide 31 text

Like to clean data Slice & dice data ﬂuently http:/ /www.forbes.com/sites/gilpress/2016/03/23/ data-preparation-most-time-consuming-least- enjoyable-data-science-task-survey-says

Slide 32

Slide 32 text

Scikit-learn

Slide 33

Slide 33 text

Fit, transform, predict Train/test split NumPy

Slide 34

Slide 34 text

Explore literally

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Start fast, iterate faster

Slide 37

Slide 37 text

Apache Zeppelin Spark Notebook on steroids http:/ /zeppelin-project.org/

Slide 38

Slide 38 text

drivendata.github.io/cookiecutter-data-science Rails new for data science Notebooks are for exploration Sane structure for collaboration https:/ /drivendata.github.io/cookiecutter-data-science

Slide 39

Slide 39 text

!"" Makefile <- Makefile with commands like `make data` or `make train` !"" data # !"" external <- Data from third party sources. # !"" interim <- Intermediate data that has been transformed. # !"" processed <- The final, canonical data sets for modeling. # $"" raw <- The original, immutable data dump. !"" docs <- A default Sphinx project; see sphinx-doc.org for details !"" models <- Trained and serialized models, model predictions !"" notebooks <- Jupyter notebooks !"" references <- Data dictionaries, manuals, and all other explanatory materials. !"" reports <- Generated analysis as HTML, PDF, LaTeX, etc. !"" requirements.txt <- The requirements file for reproducing the env !"" src <- Source code for use in this project. # !"" data <- Scripts to download or generate data # # $"" make_dataset.py # !"" features <- Scripts to turn raw data into features for modeling # # $"" build_features.py # !"" models <- Scripts to train models and then use trained models to make # # # predictions # # !"" predict_model.py # # $"" train_model.py # $"" visualization <- Scripts to create exploratory and results oriented visualizations # $"" visualize.py

Slide 40

Slide 40 text

dataset.readthedocs.io Just write SQL # connect, return rows as objects with attributes db = dataset.connect('postgresql://u:p@localhost:5432/db', row_type=stuf) rows = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country') # print all the rows for row in result: print(row['country'], row['c']) # get data into pandas, that's where the fun begins! rows_df = pandas.DataFrame.from_records(rows)

Slide 41

Slide 41 text

# sklearn-pandas mapper = DataFrameMapper([ (['age'], [sklearn.preprocessing.Imputer(), sklearn.preprocessing.StandardScaler()]), ...]) pipeline = sklearn.pipeline.Pipeline([ ('featurise', mapper), ('feature_selection', feature_selection.SelectKBest(k=100)), ('random_forest', ensemble.RandomForestClassifier())]) cv_params = dict( feature_selection__k=[100, 200], random_forest__n_estimators=[50, 100, 200]) cv = grid_search.GridSearchCV(pipeline, param_grid=cv_params) best_model = best_estimator_

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

Analysis is a DAG

Slide 44

Slide 44 text

get_data.sh && process_data.sh && publish_data.sh

Slide 45

Slide 45 text

Luigi/Airﬂow Describe the pipeline

Slide 46

Slide 46 text

Don’t guard, empower instead http:/ /jhtart.deviantart.com/art/Castle-171525835

Slide 47

Slide 47 text

The goal is to turn data into information, and information into insight. – Carly Fiorina, former HP CEO

Slide 48

Slide 48 text

What next?

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

Building a data science portfolio: Storytelling with data Awesome public datasets