Outline
• Part
1:
Industrial
Machine
Learning
• Part
2:
ML
and
Hadoop:
The
State
of
the
World
• Part
3:
ML
and
Hadoop:
Where
Things
are
Headed
Copyright
2012
Cloudera
Inc.
All
rights
reserved
“Machine
learning
is
sta/s/cs
minus
any
checking
of
models
and
assump/ons.”
-‐-‐
Brian
Ripley,
UseR!
2004
(provoca/vely
paraphrased)
Copyright
2012
Cloudera
Inc.
All
rights
reserved
Outline
• Part
1:
Industrial
Machine
Learning
• Part
2:
ML
and
Hadoop:
The
State
of
the
World
• Part
3:
ML
and
Hadoop:
Where
Things
are
Headed
Copyright
2012
Cloudera
Inc.
All
rights
reserved
Hadoop
PlaWorm:
Substrate
• Commodity
servers
• Open
source
operaFng
system
• “”
ConfiguraFon
Management
• “”
CoordinaFon
Service
• “”
File
System
API
• “”
Efficient
and
Extensible
File
Formats
• “”
Efficient
and
Extensible
RPC
Libraries
Copyright
2012
Cloudera
Inc.
All
rights
reserved
MapReduce
• Great
for:
• Data
PreparaFon
• Feature
Engineering
• Model
ValidaFon/EvaluaFon
• Works
Well
For
Certain
Model
Fi\ng
Problems
• CollaboraFve
Filtering
Algorithms
• ExpectaFon
MaximizaFon
• Decision
Trees
(PLANET;
Gradient
Boosted
Decision
Trees)
• Not
A
PracIcal
OpIon
for
Many
Kinds
of
Problems
• Way
More
Detail
in
the
KDD
2011
Talk
Copyright
2012
Cloudera
Inc.
All
rights
reserved
Apache
Mahout
• The
starFng
place
for
MapReduce-‐based
machine
learning
algorithms
• Not
machine-‐learning-‐in-‐a-‐box
• Custom
tweaks/modificaFons
are
the
rule
• A
disparate
collecFon
of
algorithms
for:
• RecommendaFons
• Clustering
• ClassificaFon
• Frequent
Itemset
Mining
Copyright
2012
Cloudera
Inc.
All
rights
reserved
Apache
Mahout
(cont.)
• Best
Library:
Taste
Recommender
• Oldest
project,
most
widely-‐deployed
in
producFon
• SVD
implementaFon
is
parFcularly
acFve
• Good
Libraries:
Online
SGD
• Does
not
use
MapReduce
• Vowpal
Rabbit
is
faster,
has
L-‐BFGS
opFon
• Roll
Your
Own
Instead:
Naïve
Bayes
AllReduce
• Developed
at
Yahoo!
Research
• Defines
the
allreduce
operaFon
• N
machines
each
have
a
number
=>
each
machine
has
the
sum
of
the
numbers
• At
the
heart
of
Vowpal
Wabbit’s
performance
• Implemented
in
C++
• Can
be
patched
into
Apache
Hadoop
and
used
today
Copyright
2012
Cloudera
Inc.
All
rights
reserved
Spark
• Developed
at
Berkeley’s
AMP
Lab
• Defines
operaFons
on
distributed
in-‐memory
collecFons
• Wriken
in
Scala
• Supports
reading
to
and
wriFng
from
HDFS
Copyright
2012
Cloudera
Inc.
All
rights
reserved
GraphLab
• Developed
at
CMU
• Lower-‐level
primiFves
• (but
higher
than
MPI)
• Map/Reduce
=>
Update/Sort
• Flexible,
allows
for
asynchronous
computaFons
• Reads
from
HDFS
Copyright
2012
Cloudera
Inc.
All
rights
reserved