The Performance of MapReduce: An In-depth Study

Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, School
of Computing, NUS Presented by Tang Kai

 Introduction  Factors affecting Performance of MR  Pruning
search space  Implementation  Benchmark

 MapReduce-based systems are increasingly being used. ◦ Simple yet
impressive interface  Map() Reduce() ◦ Flexible  Storage system independence ◦ Scalable ◦ Fine-grain fault tolerance

 Previous study ◦ Fundamental difference  Schema support 
Data access  Fault tolerance ◦ Benchmark  Parallel DB >> MR-based

 Is it not possible to have a flexible, scalable
and efficient MapReduce-based systems?  Works ◦ Identify several performance bottlenecks ◦ manage bottlenecks and tune performance  well-known engineering and database techniques  Conclusion ◦ 2.5x-3.5x

 7 steps of a MapReduce job 1) Map 2)
Parse 3) Process 4) Sort 5) Shuffle 6) Merge 7) Reduce

 I/O mode  Indexing  Parsing  Sorting

 Direct I/O ◦ read data from the disk directly
◦ Local  Streaming I/O ◦ streaming data from the storage system by an inter-process communication scheme,  such as TCP/IP or JDBC. ◦ Local and remote  Direct I/O > Streaming I/O by 10%-15%

 Input of a MapReduce job ◦ a set of
files stored in a distributed file system, i.e. HDFS  Ranged-indexes ◦ input HDFS files are not sorted but each data chunk in the files are indexed by keys  Block-level indexes ◦ tables stored in database servers  Database indexed tables Boost selection task 2x-10x depending on the selectivity

 Raw data -> <k,v> pair  Immutable decoding ◦
Read-only records (set once)  Mutable decoding  Mutable decoder is 10x faster. ◦ boost selection task 2x overall

 Map-side sorting affects performance of aggregation ◦ Cost of
key comparison is non-trivial.  Example ◦ SourceIP in UserVisits Table ◦ Sort intermediate records. ◦ sourceIP variable-length string  String compare (byte-to-byte)  Fingerprint compare (integer)  Fingerprint-based is 4x-5x faster. ◦ 20%-25% overall

 Why ◦ 4 factors  Resulting in large search
space (2*2*3*2) ◦ Budget limit on Amazon EC2  Greedy

 Greedy Stategy I/O mode Parser Different sort schemes In
various architecture Direct I/O Stream I/O Hadoop Writable Google’s ProtocolBuffer Berkeley DB 3 datasets 4 queries Bench mark

 Hadoop 0.19.2 as code base  Direct I/O ◦
Modification of data node implementation  Text decoder ◦ Immutable same as Dewitt ◦ Mutable by ourselves  Binary decoder ◦ Hadoop  Immutable Writable decoder  Mutable using hadoop API by ourselves ◦ Google Protocol buffer  Build-in compiler->mutable  Immutable by ourselves ◦ Berkeley DB  BDB binding API (mutable)

 Amazon EC2 (Elastic computing cloud) ◦ 7.5GB memory ◦
2 virtual cores ◦ 64-bits Fedora 8  Tuning EC2 disk I/O by shifting peak time.  Hadoop Setting ◦ Block size of HDFS: 512MB ◦ Heap size of JVM: 1024MB

 Results for different I/O mode ◦ Single node ◦
No-op job w/ map w/o reduce

 Results for record parsing ◦ Run in Java process
instead of MapReduce job ◦ Time start after loading into memory  Mutable > Immutable ◦ Mutable text> mutable binary

 In between hadoop-based system ◦ Cache factor  In
between hadoop-based and Parallel DB ◦ Close

 Selection task -> scan -> Index  Caching 
Indexing

 Parsing: 2x faster  Sorting: 20%-25% faster ◦ Not
significant in small size aggregation task

 On decoding scheme  Comparison of tuned MR-based &
Parallel DB

 Cons ◦ Need to be committed/forked to Hadoop source
code tree ◦ A complete framework is needed instead of miscellaneous patches. ◦ Various API support: CLI, Web rather than Java.  Future work ◦ Provide query parser, optimizer etc to build a complete solution ◦ Elastic power-aware data intensive Cloud  http://www.comp.nus.edu.sg/~epic/download/MapRe duceBenchmark.tar.gz Tenzing: A SQL Implemetation On The MapReduce Framework

The Performance of MapReduce: An In-depth Study

The Performance of MapReduce: An In-depth Study

Kevin Tong

More Decks by Kevin Tong

Other Decks in Technology

Featured

Transcript

Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, School

 Introduction  Factors affecting Performance of MR  Pruning

 MapReduce-based systems are increasingly being used. ◦ Simple yet

 Previous study ◦ Fundamental difference  Schema support 

 Is it not possible to have a flexible, scalable

 Introduction  Factors affecting Performance of MR  Pruning

 7 steps of a MapReduce job 1) Map 2)

 I/O mode  Indexing  Parsing  Sorting

 Direct I/O ◦ read data from the disk directly

 Input of a MapReduce job ◦ a set of

 Raw data -> <k,v> pair  Immutable decoding ◦

 Map-side sorting affects performance of aggregation ◦ Cost of

 Why ◦ 4 factors  Resulting in large search

 Greedy Stategy I/O mode Parser Different sort schemes In

 Introduction  Factors affecting Performance of MR  Pruning

 Hadoop 0.19.2 as code base  Direct I/O ◦

 Amazon EC2 (Elastic computing cloud) ◦ 7.5GB memory ◦

 Introduction  Factors affecting Performance of MR  Pruning

 Results for different I/O mode ◦ Single node ◦

 Results for record parsing ◦ Run in Java process

 In between hadoop-based system ◦ Cache factor  In

 Selection task -> scan -> Index  Caching 

 Parsing: 2x faster  Sorting: 20%-25% faster ◦ Not

 On decoding scheme  Comparison of tuned MR-based &

 Cons ◦ Need to be committed/forked to Hadoop source