Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu,
School of Computing, NUS
Presented by Tang Kai
Slide 2
Slide 2 text
Introduction
Factors affecting Performance of MR
Pruning search space
Implementation
Benchmark
Slide 3
Slide 3 text
MapReduce-based systems are increasingly
being used.
◦ Simple yet impressive interface
Map() Reduce()
◦ Flexible
Storage system independence
◦ Scalable
◦ Fine-grain fault tolerance
Slide 4
Slide 4 text
Previous study
◦ Fundamental difference
Schema support
Data access
Fault tolerance
◦ Benchmark
Parallel DB >> MR-based
Slide 5
Slide 5 text
Is it not possible to have a flexible, scalable
and efficient MapReduce-based systems?
Works
◦ Identify several performance bottlenecks
◦ manage bottlenecks and tune performance
well-known engineering and database techniques
Conclusion
◦ 2.5x-3.5x
Slide 6
Slide 6 text
Introduction
Factors affecting Performance of MR
Pruning search space
Implementation
Benchmark
Slide 7
Slide 7 text
7 steps of a MapReduce job
1) Map
2) Parse
3) Process
4) Sort
5) Shuffle
6) Merge
7) Reduce
Slide 8
Slide 8 text
I/O mode
Indexing
Parsing
Sorting
Slide 9
Slide 9 text
Direct I/O
◦ read data from the disk directly
◦ Local
Streaming I/O
◦ streaming data from the storage system by an
inter-process communication scheme,
such as TCP/IP or JDBC.
◦ Local and remote
Direct I/O > Streaming I/O by 10%-15%
Slide 10
Slide 10 text
Input of a MapReduce job
◦ a set of files stored in a distributed file system, i.e.
HDFS
Ranged-indexes
◦ input HDFS files are not sorted but each data chunk
in the files are indexed by keys
Block-level indexes
◦ tables stored in database servers
Database indexed tables
Boost selection task 2x-10x
depending on the selectivity
Slide 11
Slide 11 text
Raw data -> pair
Immutable decoding
◦ Read-only records (set once)
Mutable decoding
Mutable decoder is 10x faster.
◦ boost selection task 2x overall
Slide 12
Slide 12 text
Map-side sorting affects performance of
aggregation
◦ Cost of key comparison is non-trivial.
Example
◦ SourceIP in UserVisits Table
◦ Sort intermediate records.
◦ sourceIP variable-length string
String compare (byte-to-byte)
Fingerprint compare (integer)
Fingerprint-based is 4x-5x faster.
◦ 20%-25% overall
Slide 13
Slide 13 text
Why
◦ 4 factors
Resulting in large search space (2*2*3*2)
◦ Budget limit on Amazon EC2
Greedy
Slide 14
Slide 14 text
Greedy Stategy
I/O mode
Parser
Different sort schemes
In various architecture
Direct I/O
Stream I/O
Hadoop Writable
Google’s
ProtocolBuffer
Berkeley DB
3 datasets
4 queries
Bench
mark
Slide 15
Slide 15 text
Introduction
Factors affecting Performance of MR
Pruning search space
Implementation
Benchmark
Slide 16
Slide 16 text
Hadoop 0.19.2 as code base
Direct I/O
◦ Modification of data node implementation
Text decoder
◦ Immutable same as Dewitt
◦ Mutable by ourselves
Binary decoder
◦ Hadoop
Immutable Writable decoder
Mutable using hadoop API by ourselves
◦ Google Protocol buffer
Build-in compiler->mutable
Immutable by ourselves
◦ Berkeley DB
BDB binding API (mutable)
Introduction
Factors affecting Performance of MR
Pruning search space
Implementation
Benchmark
Slide 19
Slide 19 text
Results for different I/O mode
◦ Single node
◦ No-op job w/ map w/o reduce
Slide 20
Slide 20 text
Results for record parsing
◦ Run in Java process instead of MapReduce job
◦ Time start after loading into memory
Mutable > Immutable
◦ Mutable text> mutable binary
Slide 21
Slide 21 text
In between hadoop-based system
◦ Cache factor
In between hadoop-based and Parallel DB
◦ Close
Parsing: 2x faster
Sorting: 20%-25% faster
◦ Not significant in small size aggregation task
Slide 24
Slide 24 text
On decoding scheme
Comparison of tuned MR-based & Parallel DB
Slide 25
Slide 25 text
Cons
◦ Need to be committed/forked to Hadoop source
code tree
◦ A complete framework is needed instead of
miscellaneous patches.
◦ Various API support: CLI, Web rather than Java.
Future work
◦ Provide query parser, optimizer etc to build a
complete solution
◦ Elastic power-aware data intensive Cloud
http://www.comp.nus.edu.sg/~epic/download/MapRe
duceBenchmark.tar.gz
Tenzing: A SQL Implemetation On The MapReduce Framework