From #NoSQL to #KnowSQL
The Current State of
SQL + Hadoop
Greg Rahn | @Greg Rahn
29 September 2014
OakTable World
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
The database community views MapReduce as
❖ A giant step backward in the programming paradigm for large-scale data
intensive applications
❖ A sub-optimal implementation, in that it uses brute force instead of indexing
❖ Not novel at all -- it represents a specific implementation of well known
techniques developed nearly 25 years ago
❖ Missing most of the features that are routinely included in current DBMS
❖ Incompatible with all of the tools DBMS users have come to depend on
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
No content
Slide 11
Slide 11 text
Google Dumps MapReduce
“We don’t really use MapReduce
anymore,” Urs Hölzle said in his
keynote presentation at the Google
I/O conference in San Francisco
Wednesday. The company stopped
using the system “years ago.”
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
No content
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
From #NoSQL
to #KnowSQL
Slide 20
Slide 20 text
Open Source Options
Slide 21
Slide 21 text
Apache Hive
❖ Originally developed by Facebook
❖ SQL to MapReduce
❖ Has been notoriously slow
❖ Hortonworks currently leading
development (Stinger)
Slide 22
Slide 22 text
Project Stinger
❖ Move from MapReduce to Tez
❖ ORC file format & Vectorization
❖ In-memory hash joins (broadcast join)
❖ Window functions
❖ Decimal, Varchar, Date
❖ Limited subquery support
❖ No anti-join support
Slide 23
Slide 23 text
No content
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
Stinger.next
❖ ACID transactions
❖ Cost-based query optimization via
Apache Optiq
❖ Non-equi joins
❖ More subquery support
❖ Materialized views (DIMMQ)
❖ LLAP (Live Long and Process)
Slide 26
Slide 26 text
Impala
❖ Open sourced by Cloudera, October
2012
❖ Does not build on top of MapReduce
❖ MPP engine for data in HDFS
❖ Execution engine written in C++
(LLVM)
❖ Uses Parquet file format
❖ Currently the fastest OSS SQL for
Hadoop
Slide 27
Slide 27 text
Impala 1.x Additions
❖ UDFs & UDAFs
❖ Admission Control – allows
prioritization and queueing of
queries
❖ DECIMAL data type
❖ Cost-based join reordering
❖ In-memory HDFS caching
Slide 28
Slide 28 text
Impala 2.0 Features
❖ Window functions
❖ Subqueries in WHERE clause
❖ Disk-based Joins
❖ Incremental statistics
❖ More datatypes and built-in
functions
Slide 29
Slide 29 text
Impala 2.1+ Road Map
❖ Nested data
❖ MERGE
❖ ROLLUP, CUBE, GROUPING SET
❖ Sets - MINUS, INTERSECT
❖ Apache HBase CRUD
❖ UDTFs
❖ Intra-node parallelized
aggregations and joins
❖ Parquet enhancements including
index pages
❖ Amazon S3 integration
Slide 30
Slide 30 text
Presto
❖ Announced by Facebook November
2012
❖ Runs as daemon, not SQL to
MapReduce
❖ Written in Java
❖ Bytecode compilation
❖ Connectors
❖ Hive, MySQL, PostgreSQL, Kafka,
Cassandra
Apache Drill
❖ Proposal announced August 2012
❖ MapR driving development
❖ Most immature of SQL projects
❖ Currently in a 0.5 (pre-production-ready)
beta release
❖ Uses Optiq for CBO
❖ Geared more toward nested / semi-
structured data — no metadata definition
needed
Slide 33
Slide 33 text
Shark / Spark SQL
❖ Shark == AMPLab’s Hive 0.9.0 on
Spark
❖ Spark == Alternative to MapReduce
❖ July 2014 announcements
❖ EOL for Shark
❖ Spark SQL
❖ New Hive on Spark (HIVE-7292)
❖ Spark SQL driven by Databricks
Slide 34
Slide 34 text
Closed Source Options
Slide 35
Slide 35 text
Pivotal HAWQ
❖ Announced February 2013
❖ Port of Greenplum DB to run on
HDFS
❖ Rich SQL support, but less than
GPDB
❖ Recent release added support for
Parquet
❖ Requires Pivotal HD
Slide 36
Slide 36 text
IBM Big SQL 3.0
❖ Announced April 2014
❖ Complete rewrite of Big SQL
❖ No longer uses MapReduce
❖ Utilizes native and Java open
source–based readers/writers
❖ Rich SQL support
❖ Modern query planner, but new
Slide 37
Slide 37 text
Actian Vector[wise] (Project Vortex)
❖ Announced June 2014 at Hadoop Summit
❖ X100 engine ported to run distributed on
Hadoop
❖ Rich SQL support
❖ Integration with YARN
❖ Scale up/down elasticity
❖ Fastest SQL engine
❖ Custom file format
Slide 38
Slide 38 text
Oracle Big Data SQL
❖ Announced July 2014
❖ Smart Scan for HDFS
❖ Requires Exadata + 12.1.0.2 + BDA
❖ Query only, does not write data
Slide 39
Slide 39 text
Overall Status of SQL + Hadoop
❖ SQL outside Hadoop
❖ DBMS ports to Hadoop
❖ Integrated but emerging
Slide 40
Slide 40 text
SELECT question
FROM audience
WHERE isAwesome(question);
@GregRahn