Apache Drill - Low Latency ANSI SQL on Hadoop Data & NoSQL at the same time

Slide 1

Slide 1 text

Slide 2

Slide 2 text

® Pre-Slideware Summary Low Latency ANSI SQL on Hadoop Data & NoSQL At the same time

Slide 3

Slide 3 text

® Top Ranked 500+ Customers Cloud Leaders MapR Enterprise Hadoop

Slide 4

Slide 4 text

® WORLDWIDE PRESENCE & CUSTOMER SUPPORT HQ

Slide 5

Slide 5 text

® One Of Our Key Strengths.. We Innovate

Slide 6

Slide 6 text

® Hadoop Distributions Open Source Open Source Distribu9on A Distribu9on C MANAGEMENT Open Source MANAGEMENT ARCHITECTURAL INNOVATIONS

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

® Silos make analysis very difficult •  How do I iden>fy a unique {customer, trade} across data sets? •  How can I guarantee the lack of anomalous behavior if I can’t see all data?

Slide 10

Slide 10 text

® Here’s an idea Give Users The Power To Query Across Silos ..Irrespective of Data Types

Slide 11

Slide 11 text

® Rethink SQL for Big Data Preserve • ANSI SQL • Familiar and ubiquitous • Performance • Interac>ve nature crucial for BI/Analy>cs • One technology • Painful to manage diﬀerent technologies • Enterprise ready • System-‐of-‐record, HA, DR, Security, Mul>-‐ tenancy, … Invent • Flexible data-‐model • Allow schemas to evolve rapidly • Support semi-‐structured data types • Agility • Self-‐service possible when developer and DBA is same • Scalability • In all dimensions: data, speed, schemas, processes, management

Slide 12

Slide 12 text

® SQL is here to stay

Slide 13

Slide 13 text

® Hadoop is here to stay

Slide 14

Slide 14 text

® Self-Describing Data Ubiquitous Centralised schema -‐ Sta>c -‐ Managed by the DBAs -‐ In a centralised repository Long, me>culous data prepara>on process (ETL, create/alter schema, etc.) Self-‐describing, or schema-‐less, data -‐  Dynamic/evolving -‐  Managed by the applica>ons -‐  Embedded in the data Less schema, more suitable for data that has higher volume, variety and velocity Apache Drill

Slide 15

Slide 15 text

® Drill ●  Apache open source project ●  Scale-‐out execu>on engine for low-‐latency SQL queries ●  Uniﬁed SQL-‐based API for zero day analy>cs & opera>onal applica>ons ●  Flexible data sources

Slide 16

Slide 16 text

® Drill & Dremel ●  Inspired by Google Tech ●  SQL querying of Google data over GFS & BigTable ●  In use produc>on use since 2006 -‐ 8 YEARS! ●  Tens of thousand of concurrent users over PB of data ●  Dremel paper released 2010

Slide 17

Slide 17 text

® Drill Zookeeper DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execu>on plan based on query op>miza>on & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node c c c

Slide 18

Slide 18 text

® A Drill Database •  What is a database with Drill/MapR? There isn’t one •  Just a directory, with a bunch of related ﬁles or other sources ~/work/bugs symptom version date bugid dump-‐name app crash 3.1.1 14/7/14 12345 cust1.tgz app slow 3.1.0 12/7/14 45678 cust2.tgz Customers BugList name rep se dump-‐name xxxx dkim junhyuk cust1.tgz yyyy yoshi aki cust2.tgz

Slide 19

Slide 19 text

® Data Source is in the Query !select timestamp, message! !from dfs1.logs.`AppServerLogs/2014/Jan/ p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill -‐  DFS -‐  HBase -‐  Hive meta-‐store A work-‐space -‐  Typically a sub-‐ directory -‐  HIVE database A table -‐  pathnames -‐  Hbase table -‐  Hive table

Slide 20

Slide 20 text

® Can be an entire directory tree // On a file! select errorLevel, count(*)  from dfs.logs.`/AppServerLogs/2014/Jan/ part0001.parquet` group by errorLevel;! ! // On the entire data collection: all years, all months! select errorLevel, count(*)  from dfs.logs.`/AppServerLogs`  group by errorLevel!

Slide 21

Slide 21 text

® Combine data sources on the fly •  JSON •  CSV •  ORC (ie, all Hive types) •  Parquet •  HBase tables •  … can combine them Select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/proﬁles.json` USERS, where LOGS.uid = USERS.uid and errorLevel > 5 order by count(*);

Slide 22

Slide 22 text

® Queries are simple select b.bugid, b.symptom, b.date from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-‐name = b.dump-‐name Let’s say I want to cross-‐reference against your list: select bugid, symptom from dfs.bugs.’/Buglist’ b, dfs.yourbugs.’/YourBugFile’ b2 where b.bugid = b2.xxx

Slide 23

Slide 23 text

® What does it mean? •  No ETL •  Reach out directly to the par>cular table/ﬁle •  As long as the permissions are ﬁne, you can do it •  No need to have the meta-‐data – None needed

Slide 24

Slide 24 text

® a •  Schema can change over course of query •  Operators are able to reconﬁgure themselves on schema change events – Minimize ﬂexibility overhead – Support more advanced execu>on op>miza>on based on actual data characteris>cs

Slide 25

Slide 25 text

® Querying JSON { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]} donuts.json

Slide 26

Slide 26 text

® Another example !select d.name, count( d.fillings),! !from (select convert_from( cf1.donut-json, json)as d ! ! from hbase.user.`donuts` ); •  convert_from( xx, json) invokes the json parser inside Drill •  What if you could plug in any parser –  XML? –  Another NoSQL Database format –  Any other ﬁle format

Slide 27

Slide 27 text

® No ETL •  Basically, Drill is querying the raw data directly •  Joining with processed data •  NO ETL •  Folks, this is very, very powerful •  NO ETL

Slide 28

Slide 28 text

® Seamless integration with Apache Hive •  Low latency queries on Hive tables •  Support for 100s of Hive ﬁle formats •  Ability to reuse Hive UDFs •  Support for mul>ple Hive metastores in a single query

Slide 29

Slide 29 text

® A Quick Tour through Apache Drill

Slide 30

Slide 30 text

® Apache Drill FLEXIBLE SCHEMA MANAGEMENT FRICTIONLESS ANALYTICS ON NESTED DATA PLUG AND PLAY WITH EXISTING Analyze data, self-‐ described or central metadata Reuse investments in SQL/ BI tools and Apache Hive Analyze semi structured & nested data … and with an architecture built ground up for Low Latency queries at Scale

Slide 31

Slide 31 text

® Apache Drill Roadmap • Low-latency SQL • Schema-less execution • Files & HBase/M7 support • Hive integration • BI and SQL tool support via ODBC/JDBC Data exploration/ad-hoc queries 1.0 • HBase query speedup • Nested data functions • Advanced SQL functionality Advanced analytics and operational data 1.1 • Ultra low latency queries • Single row insert/update/ delete • Workload management Operational SQL 2.0

Slide 32

Slide 32 text

® Apache Drill Resources •  Drill 0.5 •  Ge{ng started with Drill is easy –  Download Drill Sandbox from mapr.com •  Mailing lists –  drill-‐[email protected] –  drill-‐[email protected] •  Docs: h}ps://cwiki.apache.org/conﬂuence/display/DRILL/Apache+Drill+Wiki •  Fork us on GitHub: h}p://github.com/apache/incubator-‐drill/ •  Create a JIRA: h}ps://issues.apache.org/jira/browse/DRILL

Slide 33

Slide 33 text

® Active Drill Community •  Large community, growing rapidly – 35-‐40 contributors, 16 commi}ers – Microso•, Linked-‐in, Oracle, Facebook, Visa, Lucidworks, Concurrent, many universi>es •  In 2014 – over 20 meet-‐ups, many more coming soon – 2 hackathons, with 40+ par>cipants •  Encourage you to join, learn, contribute and have fun …

Slide 34

Slide 34 text

® Drill at MapR •  World-‐class SQL team, ~20 people •  150+ years combined experience building commercial databases •  Oracle, DB2, ParAccel, Teradata, SQLServer, Ver>ca •  Team works on Drill, Hive, Impala •  Fixed some of the toughest problems in Apache Hive

Slide 35

Slide 35 text

® Thank you! Richard Shaw [email protected] @aggress