Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Shark: a better adhoc query engine faster than hive

Shark: a better adhoc query engine faster than hive

zhuguangbin

July 24, 2013
Tweet

More Decks by zhuguangbin

Other Decks in Programming

Transcript

  1. What is Shark? A data analysis (warehouse) system that builds

    on Spark (MapReduce deterministic, idempotent tasks), scales out and is fault-tolerant, supports low-latency, interactive queries through in-memory computation, supports both SQL and complex analytics such as machine learning, is compatible with Apache Hive (storage, serdes, UDFs, types, metadata).
  2. What does Shark do? adhoc and interactive query at least

    10x faster than hive combine HQL and complex ML analysis compatible with HQL, full supported hive udf an alternitive of Hive for adhoc query
  3. Why Hive is so slow? Disk-based intermediate outputs Inferior data

    format and layout (no control of data co-partitioning) Execution strategies (lack of optimization based on data statistics) Task scheduling and launch overhead!
  4. Why Shark is faster ? In-memory computing cache intermediate data

    in memory rather than persistent on disk Columnar Memory Store Compact storage JVM garbage collection friendly CPU-efficient compression
  5. How does DPer use shark? we have integrate Shark to

    HiveWeb just select shark as query engine write hql like hive and then submit
  6. How does Shark perform? Cluster comparison: Hive/MapReduce Cluster: 37 Node,

    436 MapSlot & 436 ReduceSlot total, 1 slot per task Shark/Spark Cluster: 3 node, 48 core & 48G mem total, 12 core max & 4G mem per node for each client
  7. Further work Scale out Spark Cluster Bugfix for Hive concurrency

    & kerberos issues migrate adhoc query to Shark