Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ship It! [PyDataLondon 2015]

Ship It! [PyDataLondon 2015]

Shipping data science products is hard! Learn from 10 years of my experience for how to visualise, extract, annotate and machine learn on data sets through to successful strategies for deployment and reproducibility.

ianozsvald

June 21, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Ship It! Turning raw data into valuable services PyDataLondon 2015

    Conference Ian Ozsvald @IanOzsvald ModelInsight.io
  2. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Who Am I? •

    “Industrial Data Science” for 15 years • Consultant • O'Reilly Author • Teacher at PyCons
  3. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Which projects succeed? •

    Explain existing data (visualisation!) • Automate repetitive/slow processes (higher accuracy, more repeatable) • Augment data to make new data (e.g. for search engines and ML) • Predict the future (e.g. replace human intuition or use subtler relationships)
  4. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Visualising data • Most

    data isn't interesting... • Requires human curation + detective skills to get the good stuff • Probably needs an engineer/researcher + business person
  5. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Extracting data from binary

    files • Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract - unified interface • Apache's Tika (maybe) better • Specialised tools e.g. Sovren • Think on pipelines of transforms • This might take months!
  6. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Augmenting data • Identifying

    people, places, brands, sentiment • “i love my apple phone” • Context-sensitive (e.g movies vs products) • Accurately count mentions & sentiment
  7. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Medical data (anti-allergy) ML

    prediction using food+alcohol, pollen, pollution, location, cats, ...
  8. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Machine Learning • PyMC

    (Markov Chain Monte Carlo) Please cite these projects! (it helps their funding)
  9. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? •

    Thoughts from you? • No obvious tools to show me: • these examples were well-fitted • these always wrongly-fitted • these always uncertain • No data-diagnostics to validate inputs (e.g. for Logistic Regression) • No visualisers for most of the models • Your hard-won knowledge->new debug tools? (PLEASE!)
  10. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Debugging Machine Learning? “How

    “good” is your model, and how can you make it better?” Chih-Chun Chen, Elena Chatzimichali at PyDataLondon 2015
  11. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Delivery: Keep It Simple

    (Stupid!) • We're (probably) not publishing the best result • Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity • “cult of the imperfect” Watson-Watt • Dumb models + clean data beat other combinations
  12. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Don't Kill It! •

    Your data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality tools & reports • Note! More data->desynchronisation • DESYNCH IS BAAAAAAD! • R&D != Engineering • Discovery-based • Iterative • Success and failure equally useful dsadd
  13. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Internal deployment • Scripts

    to drive report • CSVs/Reports • Database updates • IPython Notebook (not secure though!) • Bokeh
  14. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Deploying live systems •

    Spyre (locked-down) • Microservices • Flask is my go-to tool • Swagger docs • (git pull / fabric / provisioned machines) • Docker + Amazon ECS
  15. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Python Deployment • Make

    Python modules (setup.py) • python setup.py develop # symlink • Unit tests + coverage • Use a config system (e.g. my github.com /ianozsvald/python_template_with_config) • Use git with a deployment branch • Post-commit git hooks for unit testing • Keep Separation of Concerns! • “12 Factor App” useful ideas
  16. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Some common gotchas •

    MySQL UTF8 is 3 byte by default #sigh • JavaScript months are 0-based (not 1) • Never compromise on datetimes (ISO 8601) • iOS NSDate's epoch is 2001 • Windows CP1252 text (strongly prefer UTF8) • MongoDB no_timeout_cursor=True • Github's 100MB file limits (new Large File Support) • Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi) • Data duplication bites you in the end...
  17. [email protected] @IanOzsvald PyDataLondon Conf June 2015 (Perhaps) Avoid Big Data

    • Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr
  18. [email protected] @IanOzsvald PyDataLondon Conf June 2015 Closing • Tell me

    your dirty data stories (I want to automate some of this) • Keep It Simple • Come talk on your projects at our PyDataLondon monthly meetup... • It isn't what you know but who you know • (I do coaching on this stuff in ModelInsight.io)