Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ship Data Science Products! (PyConUK 2015)

ianozsvald
September 20, 2015

Ship Data Science Products! (PyConUK 2015)

Building and shipping working Data Science and scientific products is hard - learn from 10 years of Ian's experience at ModelInsight.io to find efficient ways through the mess of bad data, complicated data workflows and weakly designed code through to successfully deployed projects.

http://www.pyconuk.org/talks/ship-data-science-products/

ianozsvald

September 20, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Ship Data Science Products Turning raw data into valuable services

    PyConUK 2015 Conference Ian Ozsvald @IanOzsvald ModelInsight.io
  2. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Who Am I? •

    “Industrial Data Science” for 15 years • Consultant • O'Reilly Author • Teacher at PyCons
  3. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Which projects succeed? •

    Explain existing data (visualisation!) • Automate repetitive/slow processes (higher accuracy, more repeatable) • Augment data to make new data (e.g. for search engines and ML) • Predict the future (e.g. replace human intuition or use subtler relationships)
  4. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Visualising data • Most

    data isn't interesting... • Requires human curation + detective skills to get the good stuff • Probably needs an engineer/researcher + business person
  5. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Extracting data from binary

    files • Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract - unified interface • Apache's Tika (maybe) better • Specialised tools e.g. Sovren • This might take months!
  6. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Augmenting data • Identifying

    people, places, brands, sentiment • “i love my apple phone” • Context-sensitive (e.g movies vs products) • Accurately count mentions & sentiment
  7. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Medical data (anti-allergy) ML

    prediction using food+alcohol, pollen, pollution, location, cats, ...
  8. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Machine Learning • PyMC

    (Markov Chain Monte Carlo) Please cite these projects! (it helps their funding)
  9. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Delivery: Keep It Simple

    (Stupid!) • We're (probably) not publishing the best result • Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity • “cult of the imperfect” Watson-Watt • Dumb models + clean data beat other combinations
  10. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Don't Kill It! •

    Your data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality tools & reports • Note! More data->desynchronisation • DESYNCH IS BAAAAAAD! • R&D != Engineering • Discovery-based • Iterative • Success and failure equally useful engarde
  11. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Internal deployment • CSVs/Reports

    • Database updates • IPython Notebook (not secure though!) • Bokeh
  12. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Deploying live systems •

    Spyre (locked-down) • Microservices • Docker + Amazon ECS • Flask+Swagger
  13. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Python Deployment • Make

    Python modules (setup.py) • python setup.py develop # symlink • Unit tests + coverage • Use a config system (e.g. my github.com /ianozsvald/python_template_with_config) • Use git with a deployment branch • Post-commit git hooks for unit testing • Keep Separation of Concerns! • “12 Factor App” useful ideas
  14. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Some common gotchas •

    MySQL UTF8 is 3 byte by default #sigh • JavaScript months are 0-based (not 1) • Never compromise on datetimes (ISO 8601) • iOS NSDate's epoch is 2001 • Windows CP1252 text (strongly prefer UTF8) • MongoDB no_timeout_cursor=True • Github's 100MB file limits (new Large File Support) • Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi) • Data duplication bites you in the end...
  15. [email protected] @IanOzsvald PyConUK Conf Sept 2015 (Perhaps) Avoid Big Data

    • Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr
  16. [email protected] @IanOzsvald PyConUK Conf Sept 2015 Closing • Tell me

    your dirty data stories (I'm starting to automate some of this) • Keep It Simple • Come talk on your projects at our PyDataLondon monthly meetup... • It isn't what you know but who you know • (I do coaching and consulting on this stuff in ModelInsight.io)