& Distributed CompuHadoop
One project in parconcept works extremely well.
Search for Extra-‐Terrestrial Intelligence
• Prove the viability of the distributed grid compuconcept (succeeded)
• Detect intelligent life outside Earth (failed)
• in a ﬁle?
• in a directory?
• on the Web?
• On 1 node:
o Scanning @ 50MB/s = 23 days
• On 1000 node cluster:
o Scanning @ 50MB/s = 33 min
We need a framework for distribu
Hadoop can reliably store and process petabytes of data.
Hadoop distributes the data and processing across clusters of commonly
available computers. These clusters can number into the thousands of
Hadoop can process the distributed data in parallel on the nodes where
the data is located.
Hadoop Distributed File System (HDFS)
• Java, Shell, C and HTTP API’s
• Java and Streaming API’s
Hadoop on Demand
• Tools to manage dynamic setup and teardown of Hadoop
Table storage on top of HDFS, modeled a=er Google’s Big Table
Language for dataﬂow programming
SQL interface to structured data stored in HDFS
• Code is shipped to nodes
• Mappers and Reducers are run on same machines
• Two major daemons: JobTracker and TaskTracker
• Long-‐lived master daemon which distributes tasks
• Maintains a job history of job execuTaskTrackers
• Long-‐lived client daemon which executes Map and
Distributed File System (HDFS)
• Create a hierarchical HDFS with directories and ﬁles.
• Use Hadoop API to store a large text ﬁle.
• Create a MapReduce applicaHadoop MapReduce
• Does something to its input
• Emits intermediate key/value pair
• One call per input record
• Fully data-‐parallel
• Reducer aggregates list of intermediate values
• Returns a ﬁnal key/value pair for output
-‐ Use for data storage and
-‐ 30 nodes
-‐ Use for repor-‐ 320 nodes
-‐ Use for log analysis and data
-‐ 140 nodes
-‐ Use for chart calcula-‐ 27 nodes
New York Times
-‐ Use for large scale image conversion
-‐ 100 nodes
-‐ Use for Ad systems and Web search
-‐ 10.000 nodes
Who is using it?
• Log analysis
• Spam/BOT analysis
• Behavioral analy• Sequencustomer buying behavior for cross selling and target
• 1 RU
• 2 x 4 core CPU
• 4-‐8GB of RAM using ECC memory
• 4 x 1TB SATA drives
• 1-‐5TB external storage
Typically arranged in 2 level architecture
• 30/40 nodes per rack
• Conﬁgura• No security against accidents. User iden<ﬁcaLast.fm deleted a ﬁleystem by accident.
• HDFS is primarily designed for streaming access of large ﬁles.
Reading through small ﬁles normally causes lots of seeks and lots
of hopping from datanode to datanode to retrieve each small
• Steep learning curve. According to Facebook, using Hadoop was
not easy for end users, especially for the ones who were not
familiar with MapReduce.