Apache Drill status 04-2013

Apache Drill status Michael Hausenblas, Chief Data Engineer EMEA,
MapR HUG Munich, 2013-‐04-‐19

Kudos to hEp://cmx.io/

Workloads •  Batch processing (MapReduce) •  Light-‐weight OLTP
(HBase, Cassandra, etc.) •  Stream processing (Storm, S4) •  Search (Solr, ElasVcsearch) •  Interac1ve, ad-‐hoc query and analysis (?)

Impala InteracVve Query at Scale low-‐latency

Use Case I •  Jane, a markeVng analyst
•  Determine target segments •  Data from diﬀerent sources

Use Case II •  LogisVcs – supplier status
•  Queries – How many shipments from supplier X? – How many shipments in region Y? SUPPLIER_ID NAME REGION ACM ACME Corp US GAL GotALot Inc US BAP Bits and Pieces Ltd Europe ZUP Zu Pli Asia { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …

Today’s SoluVons •  RDBMS-‐focused –  ETL data from
MongoDB and Hadoop –  Query data using SQL •  MapReduce-‐focused –  ETL from RDBMS and MongoDB –  Use Hive, etc.

Requirements •  Support for diﬀerent data sources • 
Support for diﬀerent query interfaces •  Low-‐latency/real-‐Vme •  Ad-‐hoc queries •  Scalable, reliable

Google’s Dremel* *) hEp://research.google.com/pubs/pub36632.html

Apache Drill Overview •  Inspired by Google’s Dremel
•  Standard SQL 2003 support •  Other QL possible •  Plug-‐able data sources •  Support for nested data •  Schema is opVonal •  Community driven, open, 100’s involved

High-‐level Architecture

High-‐level Architecture •  Each node: Drillbit -‐ maximize data
locality •  Co-‐ordinaVon, query planning, execuVon, etc, are distributed •  By default Drillbits hold all roles •  Any node can act as endpoint for a query Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node

High-‐level Architecture •  Zookeeper for ephemeral cluster membership info
•  Distributed cache (Hazelcast) for metadata, locality informaVon, etc. Curator/Zk Distributed Cache Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Distributed Cache Distributed Cache Distributed Cache

High-‐level Architecture •  Origina1ng Drillbit acts as foreman, manages
query execuVon, scheduling, locality informaVon, etc. •  Streaming data communica1on avoiding SerDe Curator/Zk Distributed Cache Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Storage Process Drillbit node Distributed Cache Distributed Cache Distributed Cache

Principled Query ExecuVon Source Query Parser
Logical Plan OpVmizer Physical Plan ExecuVon SQL 2003 DrQL MongoQL DSL scanner API topology query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser API

Drillbit Modules DFS Engine HBase Engine RPC
Endpoint SQL HiveQL Pig Parser Distributed Cache Logical Plan Physical Plan OpVmizer Storage Engine Interface Scheduler Foreman Operators Mongo

Key Features •  Full SQL 2003 •  Nested
data •  OpVonal schema •  Extensibility points

Full SQL – ANSI SQL 2003 •  SQL-‐like is
oien not enough •  IntegraVon with exisVng tools –  Datameer, Tableau, Excel, SAP Crystal Reports –  Use standard ODBC/JDBC driver

Nested Data •  Nested data becoming prevalent – 
JSON/BSON, XML, ProtoBuf, Avro –  Some data sources support it naVvely (MongoDB, etc.) •  FlaEening nested data is error-‐prone •  Extension to ANSI SQL 2003

OpVonal Schema •  Many data sources don’t have rigid
schemas –  Schema changes rapidly –  Diﬀerent schema per record (e.g. HBase) •  Supports queries against unknown schema •  User can deﬁne schema or via discovery

Extensibility Points •  Source query à parser API
•  Custom operators, UDF à logical plan •  Serving tree, CF, topology à physical plan/opVmizer •  Data sources &formats à scanner API Source Query Parser Logical Plan OpVmizer Physical Plan ExecuVon

… and Hadoop? •  HDFS can be a data
source •  Complementary use cases* •  … use Apache Drill –  Find record with speciﬁed condiVon –  AggregaVon under dynamic condiVons •  … use MapReduce –  Data mining with mulVple iteraVons –  ETL 22 *) hEps://cloud.google.com/ﬁles/BigQueryTechnicalWP.pdf

Example hEps://cwiki.apache.org/conﬂuence/display/DRILL/Demo+HowTo { "id": "0001", "type": "donut",
”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data source: donuts.json query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical plan: simple_plan.json result: out.json { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }

Status •  Heavy development by mulVple organizaVons • 
Available – Logical plan (ADSP) – Reference interpreter – Basic SQL parser – Basic demo – Basic HBase back-‐end

Status April 2013 •  Extend SQL syntax
•  Physical plan •  In-‐memory compressed data interfaces •  Distributed execuVon

ContribuVng •  Learn where and how to contribute
hEps://cwiki.apache.org/conﬂuence/display/DRILL/ ContribuVng •  Jira, Git, Apache build and test tools •  Preparing for dependencies –  Hazelcast –  Neolix Curator

ContribuVng General contribuVons appreciated: •  Supersonic (?)
•  Test data & test queries •  Use case scenarios (textual desc./SQL queries) •  DocumentaVon

ContribuVng •  Dremel-‐inspired columnar format –  TwiEer’s Parquet
–  Hive’s ORC ﬁle •  IntegraVon with Hive metastore (?) •  DRILL-‐13 Storage Engine: Deﬁne Java Interface •  DRILL-‐15 Build HBase storage engine implementaVon

ContribuVng •  DRILL-‐48 RPC interface for query submission and
physical plan execuVon •  DRILL-‐53 Setup cluster conﬁguraVon and membership mgmt system •  Further schedule –  Alpha Q2 –  Beta Q3

Kudos to … •  Julian Hyde, Pentaho
•  Lisen Mu •  Tim Chen, Microsoi •  Chris Merrick, RJMetrics •  David Alves, UT AusVn •  Sree Vaadi, SSS/NGData •  Jacques Nadeau, MapR •  Ted Dunning, MapR

Engage! •  Follow @ApacheDrill on TwiEer •  Sign
up at mailing lists (user | dev) hEp://incubator.apache.org/drill/mailing-‐lists.html •  Standing G+ hangouts every Tuesday at 18:00 CET •  Keep an eye on hEp://drill-‐user.org/

Apache Drill status 04-2013

Apache Drill status 04-2013

Michael Hausenblas

More Decks by Michael Hausenblas

Other Decks in Technology

Featured

Transcript

Apache Drill status Michael Hausenblas, Chief Data Engineer EMEA,

Kudos to hEp://cmx.io/

Workloads •  Batch processing (MapReduce) •  Light-‐weight OLTP

Impala InteracVve Query at Scale low-‐latency

Use Case I •  Jane, a markeVng analyst

Use Case II •  LogisVcs – supplier status

Today’s SoluVons •  RDBMS-‐focused –  ETL data from

Requirements •  Support for diﬀerent data sources •

Google’s Dremel* *) hEp://research.google.com/pubs/pub36632.html

Apache Drill Overview •  Inspired by Google’s Dremel

High-‐level Architecture

High-‐level Architecture •  Each node: Drillbit -‐ maximize data

High-‐level Architecture •  Zookeeper for ephemeral cluster membership info

High-‐level Architecture •  Origina1ng Drillbit acts as foreman, manages

Principled Query ExecuVon Source Query Parser

Drillbit Modules DFS Engine HBase Engine RPC

Key Features •  Full SQL 2003 •  Nested

Full SQL – ANSI SQL 2003 •  SQL-‐like is

Nested Data •  Nested data becoming prevalent –

OpVonal Schema •  Many data sources don’t have rigid

Extensibility Points •  Source query à parser API

… and Hadoop? •  HDFS can be a data

Example hEps://cwiki.apache.org/conﬂuence/display/DRILL/Demo+HowTo { "id": "0001", "type": "donut",

Status •  Heavy development by mulVple organizaVons •

Status April 2013 •  Extend SQL syntax

ContribuVng •  Learn where and how to contribute

ContribuVng General contribuVons appreciated: •  Supersonic (?)

ContribuVng •  Dremel-‐inspired columnar format –  TwiEer’s Parquet

ContribuVng •  DRILL-‐48 RPC interface for query submission and

Kudos to … •  Julian Hyde, Pentaho

Engage! •  Follow @ApacheDrill on TwiEer •  Sign