Slide 1

Slide 1 text

From #NoSQL to #KnowSQL The Current State of SQL + Hadoop Greg Rahn | @Greg Rahn 29 September 2014 OakTable World

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

The database community views MapReduce as ❖ A giant step backward in the programming paradigm for large-scale data intensive applications ❖ A sub-optimal implementation, in that it uses brute force instead of indexing ❖ Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago ❖ Missing most of the features that are routinely included in current DBMS ❖ Incompatible with all of the tools DBMS users have come to depend on

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Google Dumps MapReduce “We don’t really use MapReduce anymore,” Urs Hölzle said in his keynote presentation at the Google I/O conference in San Francisco Wednesday. The company stopped using the system “years ago.”

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

From #NoSQL to #KnowSQL

Slide 20

Slide 20 text

Open Source Options

Slide 21

Slide 21 text

Apache Hive ❖ Originally developed by Facebook ❖ SQL to MapReduce ❖ Has been notoriously slow ❖ Hortonworks currently leading development (Stinger)

Slide 22

Slide 22 text

Project Stinger ❖ Move from MapReduce to Tez ❖ ORC file format & Vectorization ❖ In-memory hash joins (broadcast join) ❖ Window functions ❖ Decimal, Varchar, Date ❖ Limited subquery support ❖ No anti-join support

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Stinger.next ❖ ACID transactions ❖ Cost-based query optimization via Apache Optiq ❖ Non-equi joins ❖ More subquery support ❖ Materialized views (DIMMQ) ❖ LLAP (Live Long and Process)

Slide 26

Slide 26 text

Impala ❖ Open sourced by Cloudera, October 2012 ❖ Does not build on top of MapReduce ❖ MPP engine for data in HDFS ❖ Execution engine written in C++ (LLVM) ❖ Uses Parquet file format ❖ Currently the fastest OSS SQL for Hadoop

Slide 27

Slide 27 text

Impala 1.x Additions ❖ UDFs & UDAFs ❖ Admission Control – allows prioritization and queueing of queries ❖ DECIMAL data type ❖ Cost-based join reordering ❖ In-memory HDFS caching

Slide 28

Slide 28 text

Impala 2.0 Features ❖ Window functions ❖ Subqueries in WHERE clause ❖ Disk-based Joins ❖ Incremental statistics ❖ More datatypes and built-in functions

Slide 29

Slide 29 text

Impala 2.1+ Road Map ❖ Nested data ❖ MERGE ❖ ROLLUP, CUBE, GROUPING SET ❖ Sets - MINUS, INTERSECT ❖ Apache HBase CRUD ❖ UDTFs ❖ Intra-node parallelized aggregations and joins ❖ Parquet enhancements including index pages ❖ Amazon S3 integration

Slide 30

Slide 30 text

Presto ❖ Announced by Facebook November 2012 ❖ Runs as daemon, not SQL to MapReduce ❖ Written in Java ❖ Bytecode compilation ❖ Connectors ❖ Hive, MySQL, PostgreSQL, Kafka, Cassandra

Slide 31

Slide 31 text

Presto ❖ Approximate queries (BlinkDB) ❖ Distinct-limit optimization ❖ Window functions ❖ Amazon S3 support ❖ HyperLogLog

Slide 32

Slide 32 text

Apache Drill ❖ Proposal announced August 2012 ❖ MapR driving development ❖ Most immature of SQL projects ❖ Currently in a 0.5 (pre-production-ready) beta release ❖ Uses Optiq for CBO ❖ Geared more toward nested / semi- structured data — no metadata definition needed

Slide 33

Slide 33 text

Shark / Spark SQL ❖ Shark == AMPLab’s Hive 0.9.0 on Spark ❖ Spark == Alternative to MapReduce ❖ July 2014 announcements ❖ EOL for Shark ❖ Spark SQL ❖ New Hive on Spark (HIVE-7292) ❖ Spark SQL driven by Databricks

Slide 34

Slide 34 text

Closed Source Options

Slide 35

Slide 35 text

Pivotal HAWQ ❖ Announced February 2013 ❖ Port of Greenplum DB to run on HDFS ❖ Rich SQL support, but less than GPDB ❖ Recent release added support for Parquet ❖ Requires Pivotal HD

Slide 36

Slide 36 text

IBM Big SQL 3.0 ❖ Announced April 2014 ❖ Complete rewrite of Big SQL ❖ No longer uses MapReduce ❖ Utilizes native and Java open source–based readers/writers ❖ Rich SQL support ❖ Modern query planner, but new

Slide 37

Slide 37 text

Actian Vector[wise] (Project Vortex) ❖ Announced June 2014 at Hadoop Summit ❖ X100 engine ported to run distributed on Hadoop ❖ Rich SQL support ❖ Integration with YARN ❖ Scale up/down elasticity ❖ Fastest SQL engine ❖ Custom file format

Slide 38

Slide 38 text

Oracle Big Data SQL ❖ Announced July 2014 ❖ Smart Scan for HDFS ❖ Requires Exadata + 12.1.0.2 + BDA ❖ Query only, does not write data

Slide 39

Slide 39 text

Overall Status of SQL + Hadoop ❖ SQL outside Hadoop ❖ DBMS ports to Hadoop ❖ Integrated but emerging

Slide 40

Slide 40 text

SELECT question FROM audience WHERE isAwesome(question); @GregRahn