OakTable World 2014: The Current State of SQL + Hadoop

From #NoSQL to #KnowSQL The Current State of SQL +
Hadoop Greg Rahn | @Greg Rahn 29 September 2014 OakTable World

The database community views MapReduce as ❖ A giant step
backward in the programming paradigm for large-scale data intensive applications ❖ A sub-optimal implementation, in that it uses brute force instead of indexing ❖ Not novel at all -- it represents a speciﬁc implementation of well known techniques developed nearly 25 years ago ❖ Missing most of the features that are routinely included in current DBMS ❖ Incompatible with all of the tools DBMS users have come to depend on

Google Dumps MapReduce “We don’t really use MapReduce anymore,” Urs
Hölzle said in his keynote presentation at the Google I/O conference in San Francisco Wednesday. The company stopped using the system “years ago.”

From #NoSQL to #KnowSQL

Open Source Options

Apache Hive ❖ Originally developed by Facebook ❖ SQL to
MapReduce ❖ Has been notoriously slow ❖ Hortonworks currently leading development (Stinger)

Project Stinger ❖ Move from MapReduce to Tez ❖ ORC
ﬁle format & Vectorization ❖ In-memory hash joins (broadcast join) ❖ Window functions ❖ Decimal, Varchar, Date ❖ Limited subquery support ❖ No anti-join support

Stinger.next ❖ ACID transactions ❖ Cost-based query optimization via Apache
Optiq ❖ Non-equi joins ❖ More subquery support ❖ Materialized views (DIMMQ) ❖ LLAP (Live Long and Process)

Impala ❖ Open sourced by Cloudera, October 2012 ❖ Does
not build on top of MapReduce ❖ MPP engine for data in HDFS ❖ Execution engine written in C++ (LLVM) ❖ Uses Parquet ﬁle format ❖ Currently the fastest OSS SQL for Hadoop

Impala 1.x Additions ❖ UDFs & UDAFs ❖ Admission Control
– allows prioritization and queueing of queries ❖ DECIMAL data type ❖ Cost-based join reordering ❖ In-memory HDFS caching

Impala 2.0 Features ❖ Window functions ❖ Subqueries in WHERE
clause ❖ Disk-based Joins ❖ Incremental statistics ❖ More datatypes and built-in functions

Impala 2.1+ Road Map ❖ Nested data ❖ MERGE ❖
ROLLUP, CUBE, GROUPING SET ❖ Sets - MINUS, INTERSECT ❖ Apache HBase CRUD ❖ UDTFs ❖ Intra-node parallelized aggregations and joins ❖ Parquet enhancements including index pages ❖ Amazon S3 integration

Presto ❖ Announced by Facebook November 2012 ❖ Runs as
daemon, not SQL to MapReduce ❖ Written in Java ❖ Bytecode compilation ❖ Connectors ❖ Hive, MySQL, PostgreSQL, Kafka, Cassandra

Presto ❖ Approximate queries (BlinkDB) ❖ Distinct-limit optimization ❖ Window
functions ❖ Amazon S3 support ❖ HyperLogLog

Apache Drill ❖ Proposal announced August 2012 ❖ MapR driving
development ❖ Most immature of SQL projects ❖ Currently in a 0.5 (pre-production-ready) beta release ❖ Uses Optiq for CBO ❖ Geared more toward nested / semi- structured data — no metadata deﬁnition needed

Shark / Spark SQL ❖ Shark == AMPLab’s Hive 0.9.0
on Spark ❖ Spark == Alternative to MapReduce ❖ July 2014 announcements ❖ EOL for Shark ❖ Spark SQL ❖ New Hive on Spark (HIVE-7292) ❖ Spark SQL driven by Databricks

Closed Source Options

Pivotal HAWQ ❖ Announced February 2013 ❖ Port of Greenplum
DB to run on HDFS ❖ Rich SQL support, but less than GPDB ❖ Recent release added support for Parquet ❖ Requires Pivotal HD

IBM Big SQL 3.0 ❖ Announced April 2014 ❖ Complete
rewrite of Big SQL ❖ No longer uses MapReduce ❖ Utilizes native and Java open source–based readers/writers ❖ Rich SQL support ❖ Modern query planner, but new

Actian Vector[wise] (Project Vortex) ❖ Announced June 2014 at Hadoop
Summit ❖ X100 engine ported to run distributed on Hadoop ❖ Rich SQL support ❖ Integration with YARN ❖ Scale up/down elasticity ❖ Fastest SQL engine ❖ Custom ﬁle format

Oracle Big Data SQL ❖ Announced July 2014 ❖ Smart
Scan for HDFS ❖ Requires Exadata + 12.1.0.2 + BDA ❖ Query only, does not write data

Overall Status of SQL + Hadoop ❖ SQL outside Hadoop
❖ DBMS ports to Hadoop ❖ Integrated but emerging

SELECT question FROM audience WHERE isAwesome(question); @GregRahn

OakTable World 2014: The Current State of SQL +...

OakTable World 2014: The Current State of SQL + Hadoop

Greg Rahn

More Decks by Greg Rahn

Other Decks in Technology

Featured

Transcript

From #NoSQL to #KnowSQL The Current State of SQL +

The database community views MapReduce as ❖ A giant step

Google Dumps MapReduce “We don’t really use MapReduce anymore,” Urs

From #NoSQL to #KnowSQL

Open Source Options

Apache Hive ❖ Originally developed by Facebook ❖ SQL to

Project Stinger ❖ Move from MapReduce to Tez ❖ ORC

Stinger.next ❖ ACID transactions ❖ Cost-based query optimization via Apache

Impala ❖ Open sourced by Cloudera, October 2012 ❖ Does

Impala 1.x Additions ❖ UDFs & UDAFs ❖ Admission Control

Impala 2.0 Features ❖ Window functions ❖ Subqueries in WHERE

Impala 2.1+ Road Map ❖ Nested data ❖ MERGE ❖

Presto ❖ Announced by Facebook November 2012 ❖ Runs as

Presto ❖ Approximate queries (BlinkDB) ❖ Distinct-limit optimization ❖ Window

Apache Drill ❖ Proposal announced August 2012 ❖ MapR driving

Shark / Spark SQL ❖ Shark == AMPLab’s Hive 0.9.0

Closed Source Options

Pivotal HAWQ ❖ Announced February 2013 ❖ Port of Greenplum

IBM Big SQL 3.0 ❖ Announced April 2014 ❖ Complete

Actian Vector[wise] (Project Vortex) ❖ Announced June 2014 at Hadoop

Oracle Big Data SQL ❖ Announced July 2014 ❖ Smart

Overall Status of SQL + Hadoop ❖ SQL outside Hadoop

SELECT question FROM audience WHERE isAwesome(question); @GregRahn