Ship It! [PyDataLondon 2015]

Ship It! Turning raw data into valuable services PyDataLondon 2015
Conference Ian Ozsvald @IanOzsvald ModelInsight.io

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Who Am I? •
“Industrial Data Science” for 15 years • Consultant • O'Reilly Author • Teacher at PyCons

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Which projects succeed? •
Explain existing data (visualisation!) • Automate repetitive/slow processes (higher accuracy, more repeatable) • Augment data to make new data (e.g. for search engines and ML) • Predict the future (e.g. replace human intuition or use subtler relationships)

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Why is it valuable?

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Visualising data • Most
data isn't interesting... • Requires human curation + detective skills to get the good stuff • Probably needs an engineer/researcher + business person

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Extracting data from binary
files • Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract - unified interface • Apache's Tika (maybe) better • Specialised tools e.g. Sovren • Think on pipelines of transforms • This might take months!

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Optical Character Recognition

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Augmenting data • Identifying
people, places, brands, sentiment • “i love my apple phone” • Context-sensitive (e.g movies vs products) • Accurately count mentions & sentiment

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Medical data (anti-allergy) ML
prediction using food+alcohol, pollen, pollution, location, cats, ...

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Machine Learning • PyMC
(Markov Chain Monte Carlo) Please cite these projects! (it helps their funding)

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? •
Thoughts from you? • No obvious tools to show me: • these examples were well-fitted • these always wrongly-fitted • these always uncertain • No data-diagnostics to validate inputs (e.g. for Logistic Regression) • No visualisers for most of the models • Your hard-won knowledge->new debug tools? (PLEASE!)

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? “How
“good” is your model, and how can you make it better?” Chih-Chun Chen, Elena Chatzimichali at PyDataLondon 2015

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? Roelof
Pieters PyDataLondon2015

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Delivery: Keep It Simple
(Stupid!) • We're (probably) not publishing the best result • Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity • “cult of the imperfect” Watson-Watt • Dumb models + clean data beat other combinations

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Don't Kill It! •
Your data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality tools & reports • Note! More data->desynchronisation • DESYNCH IS BAAAAAAD! • R&D != Engineering • Discovery-based • Iterative • Success and failure equally useful dsadd

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Internal deployment • Scripts
to drive report • CSVs/Reports • Database updates • IPython Notebook (not secure though!) • Bokeh

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Deploying live systems •
Spyre (locked-down) • Microservices • Flask is my go-to tool • Swagger docs • (git pull / fabric / provisioned machines) • Docker + Amazon ECS

[email protected] @IanOzsvald PyDataLondon Conf June 2015 flask-restful-swagger

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Reproducibility

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Reproducibility (Notebooks)

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Python Deployment • Make
Python modules (setup.py) • python setup.py develop # symlink • Unit tests + coverage • Use a config system (e.g. my github.com /ianozsvald/python_template_with_config) • Use git with a deployment branch • Post-commit git hooks for unit testing • Keep Separation of Concerns! • “12 Factor App” useful ideas

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Some common gotchas •
MySQL UTF8 is 3 byte by default #sigh • JavaScript months are 0-based (not 1) • Never compromise on datetimes (ISO 8601) • iOS NSDate's epoch is 2001 • Windows CP1252 text (strongly prefer UTF8) • MongoDB no_timeout_cursor=True • Github's 100MB file limits (new Large File Support) • Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi) • Data duplication bites you in the end...

[email protected] @IanOzsvald PyDataLondon Conf June 2015 (Perhaps) Avoid Big Data
• Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Closing • Tell me
your dirty data stories (I want to automate some of this) • Keep It Simple • Come talk on your projects at our PyDataLondon monthly meetup... • It isn't what you know but who you know • (I do coaching on this stuff in ModelInsight.io)

Ship It! [PyDataLondon 2015]

Ship It! [PyDataLondon 2015]

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

Ship It! Turning raw data into valuable services PyDataLondon 2015

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Who Am I? •

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Which projects succeed? •

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Why is it valuable?

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Visualising data • Most

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Extracting data from binary

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Optical Character Recognition

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Optical Character Recognition

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Augmenting data • Identifying

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Medical data (anti-allergy) ML

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Machine Learning • PyMC

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? •

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? “How

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? Roelof

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Delivery: Keep It Simple

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Don't Kill It! •

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Internal deployment • Scripts

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Deploying live systems •

[email protected] @IanOzsvald PyDataLondon Conf June 2015 flask-restful-swagger

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Reproducibility

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Reproducibility (Notebooks)

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Python Deployment • Make

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Some common gotchas •

[email protected] @IanOzsvald PyDataLondon Conf June 2015 (Perhaps) Avoid Big Data

[email protected] @IanOzsvald PyDataLondon Conf June 2015 Closing • Tell me