SETI@Home
Search
for
Extra-‐Terrestrial
Intelligence
• Prove
the
viability
of
the
distributed
grid
compu
Slide 4
Slide 4 text
What
problem
are
we
trying
to
solve?
Distributed
Compu6ng
Slide 5
Slide 5 text
Counts
of
all
the
dis6nct
word
• in
a
file?
• in
a
directory?
• on
the
Web?
Slide 6
Slide 6 text
We
need
to
process
100TB
datasets
• On
1
node:
o Scanning
@
50MB/s
=
23
days
• On
1000
node
cluster:
o Scanning
@
50MB/s
=
33
min
Slide 7
Slide 7 text
We
need
a
framework
for
distribu
Slide 8
Slide 8 text
We
need
a
new
paradigm
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
Hadoop
is
an
open-‐source
Java
framework
for
running
applica
Slide 11
Slide 11 text
Scalable
Hadoop
can
reliably
store
and
process
petabytes
of
data.
Economical
Hadoop
distributes
the
data
and
processing
across
clusters
of
commonly
available
computers.
These
clusters
can
number
into
the
thousands
of
nodes.
Efficient
Hadoop
can
process
the
distributed
data
in
parallel
on
the
nodes
where
the
data
is
located.
Reliable
Hadoop
automa
Slide 12
Slide 12 text
Hadoop
Components
Hadoop
Distributed
File
System
(HDFS)
•
Java,
Shell,
C
and
HTTP
API’s
Hadoop
MapReduce
•
Java
and
Streaming
API’s
Hadoop
on
Demand
• Tools
to
manage
dynamic
setup
and
teardown
of
Hadoop
nodes
Slide 13
Slide 13 text
HBase
Table
storage
on
top
of
HDFS,
modeled
a=er
Google’s
Big
Table
Pig
Language
for
dataflow
programming
Hive
SQL
interface
to
structured
data
stored
in
HDFS
Other
Tools
Slide 14
Slide 14 text
• Mappers
and
Reducers
are
allocated
• Code
is
shipped
to
nodes
• Mappers
and
Reducers
are
run
on
same
machines
as
DataNodes
• Two
major
daemons:
JobTracker
and
TaskTracker
Hadoop
MapReduce
Slide 15
Slide 15 text
JobTracker
•
Long-‐lived
master
daemon
which
distributes
tasks
•
Maintains
a
job
history
of
job
execu
Slide 16
Slide 16 text
• Setup
a
mul<-‐node
Hadoop
cluster
using
the
Hadoop
Distributed
File
System
(HDFS)
• Create
a
hierarchical
HDFS
with
directories
and
files.
• Use
Hadoop
API
to
store
a
large
text
file.
• Create
a
MapReduce
applica
Slide 17
Slide 17 text
• Mapper
takes
input
key/value
pair
• Does
something
to
its
input
• Emits
intermediate
key/value
pair
• One
call
per
input
record
• Fully
data-‐parallel
Map
• Input
is
all
list
of
intermediate
values
for
a
given
key
• Reducer
aggregates
list
of
intermediate
values
• Returns
a
final
key/value
pair
for
output
Reduce
Adobe
-‐
Use
for
data
storage
and
processing
-‐
30
nodes
Facebook
-‐
Use
for
repor
Slide 22
Slide 22 text
• Video
and
Image
processing
• Log
analysis
• Spam/BOT
analysis
• Behavioral
analy
Slide 23
Slide 23 text
Commodity
servers
• 1
RU
• 2
x
4
core
CPU
• 4-‐8GB
of
RAM
using
ECC
memory
• 4
x
1TB
SATA
drives
• 1-‐5TB
external
storage
Typically
arranged
in
2
level
architecture
• 30/40
nodes
per
rack
Recommended
Hardware
Slide 24
Slide 24 text
• No
version
and
dependency
management.
• Configura