Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applied Data Science

lanzani
April 09, 2017

Applied Data Science

Data science is the rising star of the 21st century. But many organizations are struggling with implementing data products that generate revenue. This presentation addresses the common pitfalls and how to tackle them.
It also talks about the Kaggle “curse”, i.e. the obsession with accuracy over practicality and what it means for your data products, the data science worflow, and, last but not least, data contracts.

Video available here ggd.li/gio-pyams2017

lanzani

April 09, 2017
Tweet

More Decks by lanzani

Other Decks in Science

Transcript

  1. WHO AM I • Imported from Italy ca. 2006 •

    Master & PhD in Theoretical Physics in Leiden (2006-2012) • Consultant Software Quality @ KPMG (2012-2013) • Data Whisperer @ GoDataDriven (2013-2016) • Chief Science Officer @ GoDataDriven (2016-…) • PyData Amsterdam chair 2016, organizer PyData meetup • Father of 5 (yup)
  2. BEST MODEL • Which one would you choose here? •

    It’s about making a tradeoff • This trade off is the most important job of the PO • A 100% correct answer might not exist!!!
  3. HELPFUL GODATADRIVEN DEFINITION • It creates value using data! •

    Using machine learning • Or visualization and analytics
  4. WHEN YOU SAY DATA SCIENCE, COMPANIES UNDERSTAND • All the

    things big data • Predictive modeling & Advanced Analytics • More money • Do all the cool things the others are doing
  5. BEYOND THE DATA WAREHOUSE Traditional operational data sources EDW Data

    consumer Web app Dashboard / Reporting Traditional Business app
  6. DATA PLATFORM Machine Learning Data pipelines Appropriate storage Scale! Data

    consumer Web app Dashboard / Reporting Traditional Business app API External API Logs Chat/transcripts Scraping Unstructured data Traditional operational data sources
  7. WHAT COMPANIES GOT • A lot of POCs • A

    lot of screenshots/presentations/dashboards on a laptop • Extra mouths to feed with no returns • Nice stories to tell to their network, about those screenshots and especially those dashboards • Headaches with data and infra even more scattered
  8. BUT… • We got a data scientist working on trees,

    and forests • Neural networks! • Deep learning!!!
  9. WHAT DO COMPANIES ACTUALLY NEED • Put things into production

    • They don’t teach that in any data science MOOC (that I know)
  10. BUILDING A DATA PRODUCT Predictive Model Input Data Data Product

    Smart Product add data sources improve model productionize model feature engineering & modeling Usage Behavior Product Backlog expose & integrate analyze & update requirements extend model API monitor & measure Iterative Development Cycle
  11. Data Platform DATA CONTRACTS BU BU BU Data Lake Data

    Products Business Business Business IT IT IT App App App external data collection internal data collection
  12. JOB MARKET 2016: US • Ask HN: What's the state

    of the job market in data science and machine learning? • https://news.ycombinator.com/item?id=13232883 • The supply-demand dynamics have changed a lot in the last couple years. • Two groups: people with work experience + strong software development skills, and those without • The first group is in higher demand than ever • The second group has gotten extremely crowded [from people] […] who have completed MOOCs or bootcamps • Supply keeps growing while demand is flat or shrinking • especially as executives get burned by “data scientists” who don't know how to help them build things of value
  13. JOB MARKET 2016: US • The biggest differentiator I've seen

    is to be able to participate in actually building production quality systems vs being proficient enough in R or python to hack together a prototype on a very small dataset
  14. JOB MARKET 2017: NL • I am seeing the same

    things happening • We (GoDataDriven) are definitely only interested in these profiles (people who are already there, or that are getting there) • Many of our clients are in the same position
  15. GOOD SOFTWARE • Testable (and tested) • Modular (otherwise you

    cannot test it) • DRY • Efficient • Performant • Maintainable (clear code!)
  16. I’VE TOLD YOU SO • https://blog.godatadriven.com/production-ready-ds • Many data scientists

    approach the problem at hand with a Kaggle-like mentality: delivering the best model in absolute terms, no matter what the practical implications are. • In reality it's not the best model that we implement, but the one that combines quality and practicality. • Business case for True Positives, True Negatives & Cost of False Positives and False Negatives • Netflix competition