data. Economical Hadoop distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. Efficient Hadoop can process the distributed data in parallel on the nodes where the data is located. Reliable Hadoop automa<cally maintains mul<ple copies of data and automa<cally redeploys compu<ng tasks based on failures.
Java, Shell, C and HTTP API’s Hadoop MapReduce • Java and Streaming API’s Hadoop on Demand • Tools to manage dynamic setup and teardown of Hadoop nodes
Distributed File System (HDFS) • Create a hierarchical HDFS with directories and files. • Use Hadoop API to store a large text file. • Create a MapReduce applica<on. Hadoop MapReduce
-‐ 30 nodes Facebook -‐ Use for repor<ng and analy<cs -‐ 320 nodes FOX -‐ Use for log analysis and data mining -‐ 140 nodes Last.fm -‐ Use for chart calcula<on and log analysis -‐ 27 nodes New York Times -‐ Use for large scale image conversion -‐ 100 nodes Yahoo! -‐ Use for Ad systems and Web search -‐ 10.000 nodes Who is using it?
4 core CPU • 4-‐8GB of RAM using ECC memory • 4 x 1TB SATA drives • 1-‐5TB external storage Typically arranged in 2 level architecture • 30/40 nodes per rack Recommended Hardware
than 150 parameters. • No security against accidents. User iden<fica<on added a=er Last.fm deleted a fileystem by accident. • HDFS is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file. • Steep learning curve. According to Facebook, using Hadoop was not easy for end users, especially for the ones who were not familiar with MapReduce. Challenges