Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science: From Lab to Factory by Sean Owen ...

Data Science: From Lab to Factory by Sean Owen at Big Data Spain 2013

Big Data is about enabling new data-driven services that were not feasible before, and machine learning iskey to many of them. In this conference, we find out about the gap between the lab and production and what the data science industry needs to do to close it.
Session presented at Big Data Spain 2013 Conference
7th Nov 2013
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2013/conference/data-science-from-lab-to-factory

Big Data Spain

November 14, 2013
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. … Or Safer Drugs? Cloudera analysis of FDA drug data:

    “Our analysis revealed a few drug pairs with surprisingly high correlations with adverse events that did not show up in a search of the academic literature: gabapentin (a seizure medication) taken in conjunction with hydrocodone/paracetamol was correlated with memory impairment, and haloperidol in conjunction with lorazepam was correlated with the patient entering into a coma.” http://blog.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-ev
  2. New Value Projects are Feasible Cheap Expensive Absurd Little Cost

    to Productionize Value More New Kinds Now Then Newly Feasible Data Projects
  3. I Dream of … Telematics Every week, my car uploads

    driving summary to my insurance company Every night, every car uploads all sensor data to my insurance company Data Big Data
  4. I Dream of … Telematics Stop-start extremely accident-prone when icy

    Brake failure preceded many accidents in claims Auto e-mail stop-start drivers in forecast snowy areas f braking power < 80% normal, alert customer / dealer Insight Integrated
  5. I Dream of … Telematics Intersection ahead, past curve Real-Time

    In the past, cars brake hard: caution Now, many cars stopped: brake soon … hot brakes, 70% wear: brake now!
  6. Data Scientist (n.): Person who is better at statistics than

    any software engineer and better at software engineering than any statistician. “ ” @josh_wills
  7. R • Powerful statistical environment • Mature, Open Source •

    One machine • Not integrated with run-time systems
  8. SciPy / sklearn • Machine learning for Python • Quality,

    Open Source • Popular for prototyping, contests • Parallel, but one machine
  9. Apache Mahout • Machine learning on Hadoop • Open Source

    • Popular basis for large-scale machine learning • Code, not a product
  10. New Answers • Sheer Data Volume • Drowns out noise

    • Right Algorithms • Easy parallel scale (e.g. decision forests) • Generalize to diverse input (e.g. matrix factorization) • Hadoop • Scalable load, build • Deploy Infrastructure • Auto tuning and eval • Real-time update