SETI@Home
Search
for
Extra-‐Terrestrial
Intelligence
• Prove
the
viability
of
the
distributed
grid
compuconcept
(succeeded)
• Detect
intelligent
life
outside
Earth
(failed)
Scalable
Hadoop
can
reliably
store
and
process
petabytes
of
data.
Economical
Hadoop
distributes
the
data
and
processing
across
clusters
of
commonly
available
computers.
These
clusters
can
number
into
the
thousands
of
nodes.
Efficient
Hadoop
can
process
the
distributed
data
in
parallel
on
the
nodes
where
the
data
is
located.
Reliable
Hadoop
automaautoma
Hadoop
Components
Hadoop
Distributed
File
System
(HDFS)
•
Java,
Shell,
C
and
HTTP
API’s
Hadoop
MapReduce
•
Java
and
Streaming
API’s
Hadoop
on
Demand
• Tools
to
manage
dynamic
setup
and
teardown
of
Hadoop
nodes
HBase
Table
storage
on
top
of
HDFS,
modeled
a=er
Google’s
Big
Table
Pig
Language
for
dataflow
programming
Hive
SQL
interface
to
structured
data
stored
in
HDFS
Other
Tools
• Mappers
and
Reducers
are
allocated
• Code
is
shipped
to
nodes
• Mappers
and
Reducers
are
run
on
same
machines
as
DataNodes
• Two
major
daemons:
JobTracker
and
TaskTracker
Hadoop
MapReduce
JobTracker
•
Long-‐lived
master
daemon
which
distributes
tasks
•
Maintains
a
job
history
of
job
execuTaskTrackers
• Long-‐lived
client
daemon
which
executes
Map
and
Reduce
tasks
Hadoop
MapReduce
• Setup
a
mul<-‐node
Hadoop
cluster
using
the
Hadoop
Distributed
File
System
(HDFS)
• Create
a
hierarchical
HDFS
with
directories
and
files.
• Use
Hadoop
API
to
store
a
large
text
file.
• Create
a
MapReduce
applicaHadoop
MapReduce
• Mapper
takes
input
key/value
pair
• Does
something
to
its
input
• Emits
intermediate
key/value
pair
• One
call
per
input
record
• Fully
data-‐parallel
Map
• Input
is
all
list
of
intermediate
values
for
a
given
key
• Reducer
aggregates
list
of
intermediate
values
• Returns
a
final
key/value
pair
for
output
Reduce
Adobe
-‐
Use
for
data
storage
and
processing
-‐
30
nodes
Facebook
-‐
Use
for
repor-‐
320
nodes
FOX
-‐
Use
for
log
analysis
and
data
mining
-‐
140
nodes
Last.fm
-‐
Use
for
chart
calcula-‐
27
nodes
New
York
Times
-‐
Use
for
large
scale
image
conversion
-‐
100
nodes
Yahoo!
-‐
Use
for
Ad
systems
and
Web
search
-‐
10.000
nodes
Who
is
using
it?
Commodity
servers
• 1
RU
• 2
x
4
core
CPU
• 4-‐8GB
of
RAM
using
ECC
memory
• 4
x
1TB
SATA
drives
• 1-‐5TB
external
storage
Typically
arranged
in
2
level
architecture
• 30/40
nodes
per
rack
Recommended
Hardware
• No
version
and
dependency
management.
• Configura• No
security
against
accidents.
User
iden<ficaLast.fm
deleted
a
fileystem
by
accident.
• HDFS
is
primarily
designed
for
streaming
access
of
large
files.
Reading
through
small
files
normally
causes
lots
of
seeks
and
lots
of
hopping
from
datanode
to
datanode
to
retrieve
each
small
file.
• Steep
learning
curve.
According
to
Facebook,
using
Hadoop
was
not
easy
for
end
users,
especially
for
the
ones
who
were
not
familiar
with
MapReduce.
Challenges