Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Shipping Data Science Products

ianozsvald
October 14, 2015

Shipping Data Science Products

A pragmatic guide to shipping data science products, this is focused (but not limited to) Python. Mostly it is a collection of 15 years of my lessons trying to figure out how to do these jobs efficiently and maintainably! This is the opening plenary talk at: https://budapestbi2015.sched.org/

ianozsvald

October 14, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Shipping Data Science Products! Turning raw data into valuable services

    BudapestBI Forum 2015 License: CC By Attribution Ian Ozsvald @IanOzsvald ModelInsight.io
  2. [email protected] @IanOzsvald BudapestBI Forum October 2015 Who Am I? •

    “Industrial Data Science” for 15 years • Data Product Builder • O'Reilly Author • Teacher at PyCons
  3. [email protected] @IanOzsvald BudapestBI Forum October 2015 Who are you? •

    Type A(nalysis) or B(building) • Robert Chang - “Doing Data Science at Twitter”
  4. [email protected] @IanOzsvald BudapestBI Forum October 2015 What frustrations do we

    share? • Lack of useful data • Biggest time sink - cleaning & transforming • Conservative management • How can we derisk projects? • Medium Data • luckily we have Wes in the room
  5. [email protected] @IanOzsvald BudapestBI Forum October 2015 Which projects succeed? •

    Explain existing data (visualisation!) • Automate repetitive/slow processes (higher accuracy, more repeatable) • Augment data to make new data (e.g. for search engines and ML) • Predict the future (e.g. replace human intuition or use subtler relationships)
  6. [email protected] @IanOzsvald BudapestBI Forum October 2015 Visualising data • Most

    data isn't interesting... • Requires human curation + detective skills to get the good stuff • Couple a researcher + a business person
  7. [email protected] @IanOzsvald BudapestBI Forum October 2015 Medical data (anti-allergy) Predict

    using: • food • alcohol • pollen • pollution • location • cats • ...
  8. [email protected] @IanOzsvald BudapestBI Forum October 2015 Extracting data from binary

    files • Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract/Tika - unified interface • Specialised tools e.g. Sovren • This might take months!
  9. [email protected] @IanOzsvald BudapestBI Forum October 2015 Augmenting data • Identifying

    people, places, brands, sentiment • “i love my apple phone” • Context-sensitive (e.g movies vs products) • Build custom machine-learned tools • Augment job titles • Reconcile the same order in 2 tables
  10. [email protected] @IanOzsvald BudapestBI Forum October 2015 Machine Learning • PyMC

    (Markov Chain Monte Carlo) Please cite these projects! (it helps their funding)
  11. [email protected] @IanOzsvald BudapestBI Forum October 2015 Debugging Machine Learning? •

    Thoughts from you? • No obvious tools to show me: • these examples were well-fitted • these always wrongly-fitted • these always uncertain • No data-diagnostics to validate inputs (e.g. for Logistic Regression) • No visualisers for most of the models • Your hard-won knowledge->new debug tools? (PLEASE!)
  12. [email protected] @IanOzsvald BudapestBI Forum October 2015 Delivery: Keep It Simple

    (Stupid!) • We're (probably) not publishing the best result • Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity • “cult of the imperfect” Watson-Watt • Dumb models + clean data beat other combinations
  13. [email protected] @IanOzsvald BudapestBI Forum October 2015 Don't Kill It! •

    Your data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality tools & reports • More data->desynchronisation • R&D != Engineering • Discovery-based • Success and failure equally useful engarde
  14. [email protected] @IanOzsvald BudapestBI Forum October 2015 Internal deployment • CSVs/Reports

    • Database updates • IPython Notebook (not secure though!)
  15. [email protected] @IanOzsvald BudapestBI Forum October 2015 Deploying live systems •

    Spyre (locked-down) • Microservices • Flask is my go-to tool • Swagger docs • (git pull / fabric / provisioned machines) • Docker + Amazon ECS
  16. [email protected] @IanOzsvald BudapestBI Forum October 2015 Python Deployment • Make

    Python modules (setup.py) • python setup.py develop # symlink • Unit tests + coverage • Use a config system (e.g. github.com/ianozsvald/ python_template_with_config) • Keep Separation of Concerns! • “12 Factor App” useful ideas
  17. [email protected] @IanOzsvald BudapestBI Forum October 2015 Some common gotchas •

    MySQL UTF8 is 3 byte by default #sigh • JavaScript months are 0-based (not 1) • Never compromise on datetimes (ISO 8601) • iOS NSDate's epoch is 2001 • Windows CP1252 text (strongly prefer UTF8) • MongoDB no_timeout_cursor=True • Github's 100MB file limits (new Large File Support) • Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi) • Data duplication bites you in the end...
  18. [email protected] @IanOzsvald BudapestBI Forum October 2015 (Perhaps) Avoid Big Data

    • Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr
  19. [email protected] @IanOzsvald BudapestBI Forum October 2015 “Data Science Delivered” •

    New mini project / pamphlet • Includes dirty data strategies, ways to debug ML, thoughts on managing projects - 15 yrs experience (please critique and file bugs!) • https://github.com/ianozsvald/ data_science_delivered • Please give me your feedback
  20. [email protected] @IanOzsvald BudapestBI Forum October 2015 Closing • Tell me

    your dirty data stories, perhaps in a Ruin Pub? (I am automating some of this) • Takehome - Keep it clean, keep it simple • Come talk on your projects at our PyDataLondon monthly meetup or start your own!