Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science: From Lab to Factory by Sean Owen at Big Data Spain 2013

Data Science: From Lab to Factory by Sean Owen at Big Data Spain 2013

Big Data is about enabling new data-driven services that were not feasible before, and machine learning iskey to many of them. In this conference, we find out about the gap between the lab and production and what the data science industry needs to do to close it.
Session presented at Big Data Spain 2013 Conference
7th Nov 2013
Kinépolis Madrid
Event promoted by http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2013/conference/data-science-from-lab-to-factory


Big Data Spain

November 14, 2013


  1. Data Science: From Lab to Factory Sean Owen

  2. None
  3. What’s the Big Data Big Deal? www.dataversity.net/data-buzzwords-defined-for-business-us

  4. Just Cheaper Extract-Transform- Load? blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-expl

  5. … Or Safer Drugs? Cloudera analysis of FDA drug data:

    “Our analysis revealed a few drug pairs with surprisingly high correlations with adverse events that did not show up in a search of the academic literature: gabapentin (a seizure medication) taken in conjunction with hydrocodone/paracetamol was correlated with memory impairment, and haloperidol in conjunction with lorazepam was correlated with the patient entering into a coma.” http://blog.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-ev
  6. New Value Projects are Feasible Cheap Expensive Absurd Little Cost

    to Productionize Value More New Kinds Now Then Newly Feasible Data Projects
  7. Big Data Dream?

  8. I Dream of … Telematics Every week, my car uploads

    driving summary to my insurance company Every night, every car uploads all sensor data to my insurance company Data Big Data
  9. I Dream of … Telematics Stop-start extremely accident-prone when icy

    Brake failure preceded many accidents in claims Auto e-mail stop-start drivers in forecast snowy areas f braking power < 80% normal, alert customer / dealer Insight Integrated
  10. I Dream of … Telematics Intersection ahead, past curve Real-Time

    In the past, cars brake hard: caution Now, many cars stopped: brake soon … hot brakes, 70% wear: brake now!
  11. The Gap

  12. ? The Gap Collec t Transfor m Store Data Value

    Mod el Deploy Insig ht
  13. Lab To Factory

  14. Data Science tist

  15. Data Scientist (n.): Person who is better at statistics than

    any software engineer and better at software engineering than any statistician. “ ” @josh_wills
  16. A New Problem?

  17. It Used To Be So Solved…

  18. Data Science Flow

  19. Big Data Reopened the Gap

  20. Big Data Science Flow

  21. R • Powerful statistical environment • Mature, Open Source •

    One machine • Not integrated with run-time systems
  22. SciPy / sklearn • Machine learning for Python • Quality,

    Open Source • Popular for prototyping, contests • Parallel, but one machine
  23. Apache Mahout • Machine learning on Hadoop • Open Source

    • Popular basis for large-scale machine learning • Code, not a product
  24. Bridging the Gap

  25. New Answers • Sheer Data Volume • Drowns out noise

    • Right Algorithms • Easy parallel scale (e.g. decision forests) • Generalize to diverse input (e.g. matrix factorization) • Hadoop • Scalable load, build • Deploy Infrastructure • Auto tuning and eval • Real-time update
  26. Dos and Don’ts for 2014

  27. Build your big data warehouse now Do:

  28. Worry about data format and quality yet Don’t:

  29. Collect as much data as could be relevant Do:

  30. Wait to start collecting potentially useful data Don’t:

  31. Invest in developin g data science capability Do:

  32. Invest in proofs-of- concept now Do:

  33. None