Full-stack Data Science: How to be a One-Man Data Team

Full-stack data science how to be a one-man data team
Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov

3+ years in startups Pythonista Built backends for 1 mil+
users Delivered to Fortune 10 Engineering → science Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov

My journey Invest in tools that last Data is simple
Explore literally Start fast, iterate faster Analysis is a DAG Don’t guard, empower instead What next?

Small/medium data Python Concepts > code

CS + Physics Games dev Data analyst/ engineer/viz/* Data Hacker
Data Scientist University Touch Surgery Appear Here

CS + Physics Games dev

Web dev

Web dev Full-stack dev DBA SysAdmin DevOps Data Analyst Data
Engineer Team Lead

60,000 to 1,000,000

First contract, 1-man team Engineering → science Learning. Fast.

Invest in tools that last

Basic Postgres Pandas Scikit-learn Notebooks Luigi/Airﬂow Bash Git Extra Flask
AWS EC2 AWS Redshift d3.js ElasticSearch Spark

Data is simple*

SQL is simple

SQL is everywhere

If you can, use Postgres

CREATE TABLE events ( name varchar(200), visitor_id varchar(200), properties jsonb,
browser jsonb ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json “Hey, there’s MongoDB in my Postgres!”

INSERT INTO events VALUES ( 'pageview', '1', '{ "page": "/account"
}', '{ "name": "Chrome", "os": "Mac", "resolution": { "x": 1440, "y": 900 } }' ); INSERT INTO events VALUES ( 'purchase', '5', '{ "amount": 10 }', '{ "name": "Firefox", "os": "Windows", "resolution": { "x": 1024, "y": 768 } }' ); Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json

SELECT browser->>'name' AS browser, count(browser) FROM events GROUP BY browser->>'name';
browser | count ---------+------- Firefox | 3 Chrome | 2 Postgres JSONB http:/ /clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json

your_db=# \o 'path/to/export.csv' your_db=# COPY ( SELECT * ... )
TO STDOUT WITH CSV HEADER; Postgres CSV

WITH new_users AS (...), unverified_users_ids AS (...) SELECT COUNT(new_user.id) FROM
new_user WHERE new_user.id NOT IN unverified_users_ids; Postgres WITH

Postgres products

Pandas

“R in Python” DataFrame Simple I/O Plotting Split/apply/combine

http:/ /worrydream.com/LadderOfAbstraction # plain python col_C = [] for i,
row in enumerate(col_A): c = row + col_B[i] col_C.append(c) # pandas df['C'] = df['A'] + df['B'] pandas What vs How

Like to clean data Slice & dice data ﬂuently http:/
/www.forbes.com/sites/gilpress/2016/03/23/ data-preparation-most-time-consuming-least- enjoyable-data-science-task-survey-says

Scikit-learn

Fit, transform, predict Train/test split NumPy

Explore literally

Start fast, iterate faster

Apache Zeppelin Spark Notebook on steroids http:/ /zeppelin-project.org/

drivendata.github.io/cookiecutter-data-science Rails new for data science Notebooks are for exploration
Sane structure for collaboration https:/ /drivendata.github.io/cookiecutter-data-science

!"" Makefile <- Makefile with commands like `make data` or
`make train` !"" data # !"" external <- Data from third party sources. # !"" interim <- Intermediate data that has been transformed. # !"" processed <- The final, canonical data sets for modeling. # $"" raw <- The original, immutable data dump. !"" docs <- A default Sphinx project; see sphinx-doc.org for details !"" models <- Trained and serialized models, model predictions !"" notebooks <- Jupyter notebooks !"" references <- Data dictionaries, manuals, and all other explanatory materials. !"" reports <- Generated analysis as HTML, PDF, LaTeX, etc. !"" requirements.txt <- The requirements file for reproducing the env !"" src <- Source code for use in this project. # !"" data <- Scripts to download or generate data # # $"" make_dataset.py # !"" features <- Scripts to turn raw data into features for modeling # # $"" build_features.py # !"" models <- Scripts to train models and then use trained models to make # # # predictions # # !"" predict_model.py # # $"" train_model.py # $"" visualization <- Scripts to create exploratory and results oriented visualizations # $"" visualize.py

dataset.readthedocs.io Just write SQL # connect, return rows as objects
with attributes db = dataset.connect('postgresql://u:p@localhost:5432/db', row_type=stuf) rows = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country') # print all the rows for row in result: print(row['country'], row['c']) # get data into pandas, that's where the fun begins! rows_df = pandas.DataFrame.from_records(rows)

# sklearn-pandas mapper = DataFrameMapper([ (['age'], [sklearn.preprocessing.Imputer(), sklearn.preprocessing.StandardScaler()]), ...]) pipeline
= sklearn.pipeline.Pipeline([ ('featurise', mapper), ('feature_selection', feature_selection.SelectKBest(k=100)), ('random_forest', ensemble.RandomForestClassifier())]) cv_params = dict( feature_selection__k=[100, 200], random_forest__n_estimators=[50, 100, 200]) cv = grid_search.GridSearchCV(pipeline, param_grid=cv_params) best_model = best_estimator_

Analysis is a DAG

get_data.sh && process_data.sh && publish_data.sh

Luigi/Airﬂow Describe the pipeline

Don’t guard, empower instead http:/ /jhtart.deviantart.com/art/Castle-171525835

The goal is to turn data into information, and information
into insight. – Carly Fiorina, former HP CEO

What next?

Building a data science portfolio: Storytelling with data Awesome public
datasets

Meetups Meet Speak Give back

Read Keep up DataTau DataScienceWeekly O’Reilly Data Newsletter KDnuggets

Kaggle Learn Study Compete

Thanks! Questions? Greg Goltsov Data Hacker gregory.goltsov.info @gregoltsov

Full-stack Data Science: How to be a One-Man Da...

Full-stack Data Science: How to be a One-Man Data Team

More Decks by Greg Goltsov

Other Decks in Technology

Featured

Transcript