What’s the Big Data Big Deal?
www.dataversity.net/data-buzzwords-defined-for-business-us
Slide 4
Slide 4 text
Just Cheaper Extract-Transform-
Load?
blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-expl
Slide 5
Slide 5 text
… Or Safer Drugs?
Cloudera analysis of FDA
drug data: “Our analysis
revealed a few drug pairs
with surprisingly high
correlations with adverse
events that did not show up
in a search of the academic
literature: gabapentin (a
seizure medication) taken in
conjunction with
hydrocodone/paracetamol
was correlated with memory
impairment, and haloperidol
in conjunction with
lorazepam was correlated
with the patient entering
into a coma.”
http://blog.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-ev
Slide 6
Slide 6 text
New Value Projects are Feasible
Cheap Expensive Absurd
Little Cost to Productionize
Value
More
New Kinds
Now
Then
Newly
Feasible Data
Projects
Slide 7
Slide 7 text
Big Data
Dream?
Slide 8
Slide 8 text
I Dream of … Telematics
Every week, my car uploads
driving summary
to my insurance company
Every night, every car uploads
all sensor data
to my insurance company
Data
Big Data
Slide 9
Slide 9 text
I Dream of … Telematics
Stop-start extremely
accident-prone when icy
Brake failure preceded many
accidents in claims
Auto e-mail stop-start drivers
in forecast snowy areas
f braking power < 80% normal,
alert customer / dealer
Insight
Integrated
Slide 10
Slide 10 text
I Dream of … Telematics
Intersection ahead, past curve
Real-Time
In the past, cars brake hard: caution
Now, many cars stopped: brake soon
… hot brakes, 70% wear: brake now!
Slide 11
Slide 11 text
The Gap
Slide 12
Slide 12 text
?
The Gap
Collec
t
Transfor
m
Store
Data
Value
Mod
el
Deploy
Insig
ht
Slide 13
Slide 13 text
Lab To Factory
Slide 14
Slide 14 text
Data Science
tist
Slide 15
Slide 15 text
Data Scientist (n.):
Person who is better at
statistics than any
software engineer and
better at software
engineering than any
statistician.
“
”
@josh_wills
Slide 16
Slide 16 text
A New Problem?
Slide 17
Slide 17 text
It Used To Be So Solved…
Slide 18
Slide 18 text
Data Science Flow
Slide 19
Slide 19 text
Big Data Reopened the Gap
Slide 20
Slide 20 text
Big Data Science Flow
Slide 21
Slide 21 text
R
• Powerful statistical
environment
• Mature, Open
Source
• One machine
• Not integrated with
run-time systems
Slide 22
Slide 22 text
SciPy / sklearn
• Machine learning for
Python
• Quality, Open
Source
• Popular for
prototyping,
contests
• Parallel, but one
machine
Slide 23
Slide 23 text
Apache Mahout
• Machine learning on
Hadoop
• Open Source
• Popular basis for
large-scale machine
learning
• Code, not a product
Slide 24
Slide 24 text
Bridging the Gap
Slide 25
Slide 25 text
New Answers
• Sheer Data Volume
• Drowns out noise
• Right Algorithms
• Easy parallel scale
(e.g. decision forests)
• Generalize to diverse input
(e.g. matrix factorization)
• Hadoop
• Scalable load, build
• Deploy
Infrastructure
• Auto tuning and eval
• Real-time update
Slide 26
Slide 26 text
Dos and Don’ts for 2014
Slide 27
Slide 27 text
Build your
big data
warehouse
now
Do:
Slide 28
Slide 28 text
Worry
about data
format
and
quality yet
Don’t:
Slide 29
Slide 29 text
Collect as
much data
as could
be
relevant
Do:
Slide 30
Slide 30 text
Wait to
start
collecting
potentially
useful
data
Don’t: