Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Shipping Data Science Products

Avatar for ianozsvald ianozsvald
October 14, 2015

Shipping Data Science Products

A pragmatic guide to shipping data science products, this is focused (but not limited to) Python. Mostly it is a collection of 15 years of my lessons trying to figure out how to do these jobs efficiently and maintainably! This is the opening plenary talk at: https://budapestbi2015.sched.org/

Avatar for ianozsvald

ianozsvald

October 14, 2015
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Shipping Data Science Products! Turning raw data into valuable services

    BudapestBI Forum 2015 License: CC By Attribution Ian Ozsvald @IanOzsvald ModelInsight.io
  2. [email protected] @IanOzsvald BudapestBI Forum October 2015 Who Am I? •

    “Industrial Data Science” for 15 years • Data Product Builder • O'Reilly Author • Teacher at PyCons
  3. [email protected] @IanOzsvald BudapestBI Forum October 2015 Who are you? •

    Type A(nalysis) or B(building) • Robert Chang - “Doing Data Science at Twitter”
  4. [email protected] @IanOzsvald BudapestBI Forum October 2015 What frustrations do we

    share? • Lack of useful data • Biggest time sink - cleaning & transforming • Conservative management • How can we derisk projects? • Medium Data • luckily we have Wes in the room
  5. [email protected] @IanOzsvald BudapestBI Forum October 2015 Which projects succeed? •

    Explain existing data (visualisation!) • Automate repetitive/slow processes (higher accuracy, more repeatable) • Augment data to make new data (e.g. for search engines and ML) • Predict the future (e.g. replace human intuition or use subtler relationships)
  6. [email protected] @IanOzsvald BudapestBI Forum October 2015 Visualising data • Most

    data isn't interesting... • Requires human curation + detective skills to get the good stuff • Couple a researcher + a business person
  7. [email protected] @IanOzsvald BudapestBI Forum October 2015 Medical data (anti-allergy) Predict

    using: • food • alcohol • pollen • pollution • location • cats • ...
  8. [email protected] @IanOzsvald BudapestBI Forum October 2015 Extracting data from binary

    files • Copy/pasting PDF/PNG data is laborious • How can we scale it? • textract/Tika - unified interface • Specialised tools e.g. Sovren • This might take months!
  9. [email protected] @IanOzsvald BudapestBI Forum October 2015 Augmenting data • Identifying

    people, places, brands, sentiment • “i love my apple phone” • Context-sensitive (e.g movies vs products) • Build custom machine-learned tools • Augment job titles • Reconcile the same order in 2 tables
  10. [email protected] @IanOzsvald BudapestBI Forum October 2015 Machine Learning • PyMC

    (Markov Chain Monte Carlo) Please cite these projects! (it helps their funding)
  11. [email protected] @IanOzsvald BudapestBI Forum October 2015 Debugging Machine Learning? •

    Thoughts from you? • No obvious tools to show me: • these examples were well-fitted • these always wrongly-fitted • these always uncertain • No data-diagnostics to validate inputs (e.g. for Logistic Regression) • No visualisers for most of the models • Your hard-won knowledge->new debug tools? (PLEASE!)
  12. [email protected] @IanOzsvald BudapestBI Forum October 2015 Delivery: Keep It Simple

    (Stupid!) • We're (probably) not publishing the best result • Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity • “cult of the imperfect” Watson-Watt • Dumb models + clean data beat other combinations
  13. [email protected] @IanOzsvald BudapestBI Forum October 2015 Don't Kill It! •

    Your data is missing, it is poor and it lies • Missing data kills projects! • Log everything! • Make data quality tools & reports • More data->desynchronisation • R&D != Engineering • Discovery-based • Success and failure equally useful engarde
  14. [email protected] @IanOzsvald BudapestBI Forum October 2015 Internal deployment • CSVs/Reports

    • Database updates • IPython Notebook (not secure though!)
  15. [email protected] @IanOzsvald BudapestBI Forum October 2015 Deploying live systems •

    Spyre (locked-down) • Microservices • Flask is my go-to tool • Swagger docs • (git pull / fabric / provisioned machines) • Docker + Amazon ECS
  16. [email protected] @IanOzsvald BudapestBI Forum October 2015 Python Deployment • Make

    Python modules (setup.py) • python setup.py develop # symlink • Unit tests + coverage • Use a config system (e.g. github.com/ianozsvald/ python_template_with_config) • Keep Separation of Concerns! • “12 Factor App” useful ideas
  17. [email protected] @IanOzsvald BudapestBI Forum October 2015 Some common gotchas •

    MySQL UTF8 is 3 byte by default #sigh • JavaScript months are 0-based (not 1) • Never compromise on datetimes (ISO 8601) • iOS NSDate's epoch is 2001 • Windows CP1252 text (strongly prefer UTF8) • MongoDB no_timeout_cursor=True • Github's 100MB file limits (new Large File Support) • Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi) • Data duplication bites you in the end...
  18. [email protected] @IanOzsvald BudapestBI Forum October 2015 (Perhaps) Avoid Big Data

    • Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data • 244GB RAM EC2+many Xeons $2.80/hr
  19. [email protected] @IanOzsvald BudapestBI Forum October 2015 “Data Science Delivered” •

    New mini project / pamphlet • Includes dirty data strategies, ways to debug ML, thoughts on managing projects - 15 yrs experience (please critique and file bugs!) • https://github.com/ianozsvald/ data_science_delivered • Please give me your feedback
  20. [email protected] @IanOzsvald BudapestBI Forum October 2015 Closing • Tell me

    your dirty data stories, perhaps in a Ruin Pub? (I am automating some of this) • Takehome - Keep it clean, keep it simple • Come talk on your projects at our PyDataLondon monthly meetup or start your own!