Slide 1

Slide 1 text

Data Science: From Lab to Factory Sean Owen

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

What’s the Big Data Big Deal? www.dataversity.net/data-buzzwords-defined-for-business-us

Slide 4

Slide 4 text

Just Cheaper Extract-Transform- Load? blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-expl

Slide 5

Slide 5 text

… Or Safer Drugs? Cloudera analysis of FDA drug data: “Our analysis revealed a few drug pairs with surprisingly high correlations with adverse events that did not show up in a search of the academic literature: gabapentin (a seizure medication) taken in conjunction with hydrocodone/paracetamol was correlated with memory impairment, and haloperidol in conjunction with lorazepam was correlated with the patient entering into a coma.” http://blog.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-ev

Slide 6

Slide 6 text

New Value Projects are Feasible Cheap Expensive Absurd Little Cost to Productionize Value More New Kinds Now Then Newly Feasible Data Projects

Slide 7

Slide 7 text

Big Data Dream?

Slide 8

Slide 8 text

I Dream of … Telematics Every week, my car uploads driving summary to my insurance company Every night, every car uploads all sensor data to my insurance company Data Big Data

Slide 9

Slide 9 text

I Dream of … Telematics Stop-start extremely accident-prone when icy Brake failure preceded many accidents in claims Auto e-mail stop-start drivers in forecast snowy areas f braking power < 80% normal, alert customer / dealer Insight Integrated

Slide 10

Slide 10 text

I Dream of … Telematics Intersection ahead, past curve Real-Time In the past, cars brake hard: caution Now, many cars stopped: brake soon … hot brakes, 70% wear: brake now!

Slide 11

Slide 11 text

The Gap

Slide 12

Slide 12 text

? The Gap Collec t Transfor m Store Data Value Mod el Deploy Insig ht

Slide 13

Slide 13 text

Lab To Factory

Slide 14

Slide 14 text

Data Science tist

Slide 15

Slide 15 text

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” @josh_wills

Slide 16

Slide 16 text

A New Problem?

Slide 17

Slide 17 text

It Used To Be So Solved…

Slide 18

Slide 18 text

Data Science Flow

Slide 19

Slide 19 text

Big Data Reopened the Gap

Slide 20

Slide 20 text

Big Data Science Flow

Slide 21

Slide 21 text

R • Powerful statistical environment • Mature, Open Source • One machine • Not integrated with run-time systems

Slide 22

Slide 22 text

SciPy / sklearn • Machine learning for Python • Quality, Open Source • Popular for prototyping, contests • Parallel, but one machine

Slide 23

Slide 23 text

Apache Mahout • Machine learning on Hadoop • Open Source • Popular basis for large-scale machine learning • Code, not a product

Slide 24

Slide 24 text

Bridging the Gap

Slide 25

Slide 25 text

New Answers • Sheer Data Volume • Drowns out noise • Right Algorithms • Easy parallel scale (e.g. decision forests) • Generalize to diverse input (e.g. matrix factorization) • Hadoop • Scalable load, build • Deploy Infrastructure • Auto tuning and eval • Real-time update

Slide 26

Slide 26 text

Dos and Don’ts for 2014

Slide 27

Slide 27 text

Build your big data warehouse now Do:

Slide 28

Slide 28 text

Worry about data format and quality yet Don’t:

Slide 29

Slide 29 text

Collect as much data as could be relevant Do:

Slide 30

Slide 30 text

Wait to start collecting potentially useful data Don’t:

Slide 31

Slide 31 text

Invest in developin g data science capability Do:

Slide 32

Slide 32 text

Invest in proofs-of- concept now Do:

Slide 33

Slide 33 text

No content