Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Performance of MapReduce: An In-depth Study

The Performance of MapReduce: An In-depth Study

Kevin Tong

May 13, 2013
Tweet

More Decks by Kevin Tong

Other Decks in Technology

Transcript

  1. Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, School

    of Computing, NUS Presented by Tang Kai
  2.  Introduction  Factors affecting Performance of MR  Pruning

    search space  Implementation  Benchmark
  3.  MapReduce-based systems are increasingly being used. ◦ Simple yet

    impressive interface  Map() Reduce() ◦ Flexible  Storage system independence ◦ Scalable ◦ Fine-grain fault tolerance
  4.  Previous study ◦ Fundamental difference  Schema support 

    Data access  Fault tolerance ◦ Benchmark  Parallel DB >> MR-based
  5.  Is it not possible to have a flexible, scalable

    and efficient MapReduce-based systems?  Works ◦ Identify several performance bottlenecks ◦ manage bottlenecks and tune performance  well-known engineering and database techniques  Conclusion ◦ 2.5x-3.5x
  6.  Introduction  Factors affecting Performance of MR  Pruning

    search space  Implementation  Benchmark
  7.  7 steps of a MapReduce job 1) Map 2)

    Parse 3) Process 4) Sort 5) Shuffle 6) Merge 7) Reduce
  8.  Direct I/O ◦ read data from the disk directly

    ◦ Local  Streaming I/O ◦ streaming data from the storage system by an inter-process communication scheme,  such as TCP/IP or JDBC. ◦ Local and remote  Direct I/O > Streaming I/O by 10%-15%
  9.  Input of a MapReduce job ◦ a set of

    files stored in a distributed file system, i.e. HDFS  Ranged-indexes ◦ input HDFS files are not sorted but each data chunk in the files are indexed by keys  Block-level indexes ◦ tables stored in database servers  Database indexed tables Boost selection task 2x-10x depending on the selectivity
  10.  Raw data -> <k,v> pair  Immutable decoding ◦

    Read-only records (set once)  Mutable decoding  Mutable decoder is 10x faster. ◦ boost selection task 2x overall
  11.  Map-side sorting affects performance of aggregation ◦ Cost of

    key comparison is non-trivial.  Example ◦ SourceIP in UserVisits Table ◦ Sort intermediate records. ◦ sourceIP variable-length string  String compare (byte-to-byte)  Fingerprint compare (integer)  Fingerprint-based is 4x-5x faster. ◦ 20%-25% overall
  12.  Why ◦ 4 factors  Resulting in large search

    space (2*2*3*2) ◦ Budget limit on Amazon EC2  Greedy
  13.  Greedy Stategy I/O mode Parser Different sort schemes In

    various architecture Direct I/O Stream I/O Hadoop Writable Google’s ProtocolBuffer Berkeley DB 3 datasets 4 queries Bench mark
  14.  Introduction  Factors affecting Performance of MR  Pruning

    search space  Implementation  Benchmark
  15.  Hadoop 0.19.2 as code base  Direct I/O ◦

    Modification of data node implementation  Text decoder ◦ Immutable same as Dewitt ◦ Mutable by ourselves  Binary decoder ◦ Hadoop  Immutable Writable decoder  Mutable using hadoop API by ourselves ◦ Google Protocol buffer  Build-in compiler->mutable  Immutable by ourselves ◦ Berkeley DB  BDB binding API (mutable)
  16.  Amazon EC2 (Elastic computing cloud) ◦ 7.5GB memory ◦

    2 virtual cores ◦ 64-bits Fedora 8  Tuning EC2 disk I/O by shifting peak time.  Hadoop Setting ◦ Block size of HDFS: 512MB ◦ Heap size of JVM: 1024MB
  17.  Introduction  Factors affecting Performance of MR  Pruning

    search space  Implementation  Benchmark
  18.  Results for record parsing ◦ Run in Java process

    instead of MapReduce job ◦ Time start after loading into memory  Mutable > Immutable ◦ Mutable text> mutable binary
  19.  In between hadoop-based system ◦ Cache factor  In

    between hadoop-based and Parallel DB ◦ Close
  20.  Parsing: 2x faster  Sorting: 20%-25% faster ◦ Not

    significant in small size aggregation task
  21.  Cons ◦ Need to be committed/forked to Hadoop source

    code tree ◦ A complete framework is needed instead of miscellaneous patches. ◦ Various API support: CLI, Web rather than Java.  Future work ◦ Provide query parser, optimizer etc to build a complete solution ◦ Elastic power-aware data intensive Cloud  http://www.comp.nus.edu.sg/~epic/download/MapRe duceBenchmark.tar.gz Tenzing: A SQL Implemetation On The MapReduce Framework