Shark: a better adhoc query engine faster than hive

Shark: a better adhoc query engine faster than hive ——Zhu
Guangbin

Agenda What is Shark? What does Shark do? How does
Shark perform? further work

What is Shark? A data analysis (warehouse) system that builds
on Spark (MapReduce deterministic, idempotent tasks), scales out and is fault-tolerant, supports low-latency, interactive queries through in-memory computation, supports both SQL and complex analytics such as machine learning, is compatible with Apache Hive (storage, serdes, UDFs, types, metadata).

Shark is Hive on Spark

What does Shark do? adhoc and interactive query at least
10x faster than hive combine HQL and complex ML analysis compatible with HQL, full supported hive udf an alternitive of Hive for adhoc query

most queries are simple, but using hive is too expensive
our query on hiveweb

Why Hive is so slow? Disk-based intermediate outputs Inferior data
format and layout (no control of data co-partitioning) Execution strategies (lack of optimization based on data statistics) Task scheduling and launch overhead!

MapReduce Model shuffle is expensive 1. mapout pesistent (disk io)
2. reduceside fetch mapout (network io )

MapReduce Model

Why Shark is faster ? In-memory computing cache intermediate data
in memory rather than persistent on disk Columnar Memory Store Compact storage JVM garbage collection friendly CPU-efficient compression

Spark Model

How does DPer use shark? we have integrate Shark to
HiveWeb just select shark as query engine write hql like hive and then submit

DEMO TIME

How does Shark perform? Cluster comparison: Hive/MapReduce Cluster： 37 Node,
436 MapSlot & 436 ReduceSlot total, 1 slot per task Shark/Spark Cluster： 3 node, 48 core & 48G mem total, 12 core max & 4G mem per node for each client

Test 1 SQL: our benchmark sql, use hive testcase data
Test Result see more on my blog

Test2 SQL: we sampled 7 hql from HiveWeb history Test
Result: see more on googlesheet

Conclusion Shark is full compatible with Hive Shark is faster
every has less nodes

Our Work on Shark deploy online bugfix for authorization integration
with hiveweb performance tuning

Further work Scale out Spark Cluster Bugfix for Hive concurrency
& kerberos issues migrate adhoc query to Shark

Thanks Q&A see more on myblog: zhuguangbin.github.io

Shark: a better adhoc query engine faster than ...

Shark: a better adhoc query engine faster than hive

zhuguangbin

More Decks by zhuguangbin

Other Decks in Programming

Featured

Transcript