Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OakTable World 2014: The Current State of SQL + Hadoop

Greg Rahn
September 29, 2014

OakTable World 2014: The Current State of SQL + Hadoop

Greg Rahn

September 29, 2014
Tweet

More Decks by Greg Rahn

Other Decks in Technology

Transcript

  1. From #NoSQL to #KnowSQL The Current State of SQL +

    Hadoop Greg Rahn | @Greg Rahn 29 September 2014 OakTable World
  2. The database community views MapReduce as ❖ A giant step

    backward in the programming paradigm for large-scale data intensive applications ❖ A sub-optimal implementation, in that it uses brute force instead of indexing ❖ Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago ❖ Missing most of the features that are routinely included in current DBMS ❖ Incompatible with all of the tools DBMS users have come to depend on
  3. Google Dumps MapReduce “We don’t really use MapReduce anymore,” Urs

    Hölzle said in his keynote presentation at the Google I/O conference in San Francisco Wednesday. The company stopped using the system “years ago.”
  4. Apache Hive ❖ Originally developed by Facebook ❖ SQL to

    MapReduce ❖ Has been notoriously slow ❖ Hortonworks currently leading development (Stinger)
  5. Project Stinger ❖ Move from MapReduce to Tez ❖ ORC

    file format & Vectorization ❖ In-memory hash joins (broadcast join) ❖ Window functions ❖ Decimal, Varchar, Date ❖ Limited subquery support ❖ No anti-join support
  6. Stinger.next ❖ ACID transactions ❖ Cost-based query optimization via Apache

    Optiq ❖ Non-equi joins ❖ More subquery support ❖ Materialized views (DIMMQ) ❖ LLAP (Live Long and Process)
  7. Impala ❖ Open sourced by Cloudera, October 2012 ❖ Does

    not build on top of MapReduce ❖ MPP engine for data in HDFS ❖ Execution engine written in C++ (LLVM) ❖ Uses Parquet file format ❖ Currently the fastest OSS SQL for Hadoop
  8. Impala 1.x Additions ❖ UDFs & UDAFs ❖ Admission Control

    – allows prioritization and queueing of queries ❖ DECIMAL data type ❖ Cost-based join reordering ❖ In-memory HDFS caching
  9. Impala 2.0 Features ❖ Window functions ❖ Subqueries in WHERE

    clause ❖ Disk-based Joins ❖ Incremental statistics ❖ More datatypes and built-in functions
  10. Impala 2.1+ Road Map ❖ Nested data ❖ MERGE ❖

    ROLLUP, CUBE, GROUPING SET ❖ Sets - MINUS, INTERSECT ❖ Apache HBase CRUD ❖ UDTFs ❖ Intra-node parallelized aggregations and joins ❖ Parquet enhancements including index pages ❖ Amazon S3 integration
  11. Presto ❖ Announced by Facebook November 2012 ❖ Runs as

    daemon, not SQL to MapReduce ❖ Written in Java ❖ Bytecode compilation ❖ Connectors ❖ Hive, MySQL, PostgreSQL, Kafka, Cassandra
  12. Apache Drill ❖ Proposal announced August 2012 ❖ MapR driving

    development ❖ Most immature of SQL projects ❖ Currently in a 0.5 (pre-production-ready) beta release ❖ Uses Optiq for CBO ❖ Geared more toward nested / semi- structured data — no metadata definition needed
  13. Shark / Spark SQL ❖ Shark == AMPLab’s Hive 0.9.0

    on Spark ❖ Spark == Alternative to MapReduce ❖ July 2014 announcements ❖ EOL for Shark ❖ Spark SQL ❖ New Hive on Spark (HIVE-7292) ❖ Spark SQL driven by Databricks
  14. Pivotal HAWQ ❖ Announced February 2013 ❖ Port of Greenplum

    DB to run on HDFS ❖ Rich SQL support, but less than GPDB ❖ Recent release added support for Parquet ❖ Requires Pivotal HD
  15. IBM Big SQL 3.0 ❖ Announced April 2014 ❖ Complete

    rewrite of Big SQL ❖ No longer uses MapReduce ❖ Utilizes native and Java open source–based readers/writers ❖ Rich SQL support ❖ Modern query planner, but new
  16. Actian Vector[wise] (Project Vortex) ❖ Announced June 2014 at Hadoop

    Summit ❖ X100 engine ported to run distributed on Hadoop ❖ Rich SQL support ❖ Integration with YARN ❖ Scale up/down elasticity ❖ Fastest SQL engine ❖ Custom file format
  17. Oracle Big Data SQL ❖ Announced July 2014 ❖ Smart

    Scan for HDFS ❖ Requires Exadata + 12.1.0.2 + BDA ❖ Query only, does not write data
  18. Overall Status of SQL + Hadoop ❖ SQL outside Hadoop

    ❖ DBMS ports to Hadoop ❖ Integrated but emerging