Slide 1

Slide 1 text

Apache Calcite One Front-end to Rule Them All Michael Mior, PMC Chair

Slide 2

Slide 2 text

Overview ● What is Apache Calcite? ● Calcite components ● Streaming SQL ● Next steps and contributing to Calcite

Slide 3

Slide 3 text

What is Apache Calcite? ● An ANSI-compliant SQL parser ● A logical query optimizer ● A heterogenous data processing framework

Slide 4

Slide 4 text

Origins 2004 LucidEra and SQLstream were each building SQL systems 2012 Code pared down and entered the ASF incubator 2015 Graduated from incubator 2016 I joined the Calcite project as a committer 2017 Joined the PMC and was voted as chair 2018 Paper presented at SIGMOD

Slide 5

Slide 5 text

Powered by Calcite ● Many open source projects (Apache Hive, Apache Drill, Apache Phoenix, Lingual, …) ● Commercial products (MapD, Dremio, Qubole, …) ● Contributors from Huawei, Uber, Intel, Salesforce, …

Slide 6

Slide 6 text

Powered by Calcite

Slide 7

Slide 7 text

Conventional Architecture JDBC Client JDBC Server SQL Parser Optimizer Datastore Metadata Operators

Slide 8

Slide 8 text

Calcite Architecture JDBC Client JDBC Server SQL Parser Optimizer 3rd party data Pluggable Metadata Adapters Pluggable Rules 3rd party data

Slide 9

Slide 9 text

Avatica

Slide 10

Slide 10 text

Optimizer ● Operates on relational algebra by matching rules ● Calcite contains 100+ rewrite rules ● Currently working on validating these using Cosette ● Optimization is cost-based ● “Calling convention” allows optimization across backends

Slide 11

Slide 11 text

Example rules ● Join order transposition ● Transpose different operators (e.g. project before join) ● Merge adjacent operators ● Materialized view query rewriting

Slide 12

Slide 12 text

Optimizer ● Based on the Volcano optimizer generator ○ Logical operators are functions (e.g. join) ○ Physical operators implement logical operators ○ Physical properties are attributes of the data (e.g. sorting, partitioning) ● Start with logical expressions and physical properties ● Optimization produces a plan with only physical operators

Slide 13

Slide 13 text

Materialized views

Slide 14

Slide 14 text

Performance

Slide 15

Slide 15 text

Relational Algebra and Streaming ● Scan ● Filter ● Project ● Join ● Sort ● Aggregate ● Union ● Values ● Delta (relation to stream) ● Chi (stream to relation)

Slide 16

Slide 16 text

Adapters ● Connect to different backends (not just relational) ● Only required operation is a table scan ● Allow push down of filter, sort, etc. ● Calcite implements remaining operators

Slide 17

Slide 17 text

● Calling convention allows Calcite to separate backend-specific operators and generic implementations ● Any relational algebra operator can be pushed down ● Operator push down simply requires a new optimizer rule Adapters

Slide 18

Slide 18 text

Conventions 1. Plans start as logical nodes 3. Fire rules to propagate conventions to other nodes 2. Assign each Scan its table’s native convention 4. The best plan may use an engine not tied to any native format Join Filter Scan Scan Scan Join Join Filter Scan Scan Scan Join Scan Scan Scan Join Filter Join Join Filter Scan Scan Scan Join

Slide 19

Slide 19 text

Conventions ● Conventions are a uniform representation of hybrid queries ● Physical property of nodes (like ordering, distribution) ● Adapter = Schema factory + Convention + Rules to convert to a convention Join Filter Scan Scan Scan Join

Slide 20

Slide 20 text

● Column store database ● Uses tables partitioned across servers and clustered ● Supports limited filtering and sorting Apache Cassandra Adapter

Slide 21

Slide 21 text

Query example CREATE TABLE playlists (id uuid, song_order int, song_id uuid, title text, artist text, PRIMARY KEY (id, song_order)); SELECT title FROM playlists WHERE id=62c36092-82a1-3a00-93d1-46196ee77204 AND artist='Relient K' ORDER BY song_order;

Slide 22

Slide 22 text

SELECT * FROM playlists; Query example Sort Scan Project Filter ● Start with a table scan ● Remaining operations performed by Calcite

Slide 23

Slide 23 text

SELECT * FROM playlists WHERE id=62c36092-82a1-3a00-93d1-46196ee77204; Query example Sort Scan Project Filter Filter ● Push the filter on the partition key to Cassandra ● The remaining filter is performed by Calcite

Slide 24

Slide 24 text

SELECT * FROM playlists WHERE id=62c36092-82a1-3a00-93d1-46196ee77204 ORDER BY song_order; Query example Filter Scan Project Filter Sort ● Push the ordering to Cassandra ● This uses the table’s clustering key

Slide 25

Slide 25 text

SELECT title, album FROM playlists WHERE id=62c36092-82a1-3a00-93d1-46196ee77204 ORDER BY song_order; Query example Scan Filter Filter Sort Project Project ● Push down the project of necessary fields ● This is the query sent to Cassandra ● Only the filter and project are done by Calcite

Slide 26

Slide 26 text

● Materialized view maintenance ● View-based query rewriting ● Full SQL support ● Join with other data sources What we get for free

Slide 27

Slide 27 text

● All data must be modeled as relations ● Easy for relational databases ● Also relatively easy for many wide column stores ● What about document stores? Data representation

Slide 28

Slide 28 text

Semistructured Data ● Columns can have complex types (e.g. arrays and maps) ● Add UNNEST operator to relational algebra ● New rules can be added to optimize these queries name age pets Sally 29 [{name: Fido, type: Dog}, {name: Jack, type: Cat}] name age pets Sally 29 {name: Fido, type: Dog} Sally 29 {name: Jack, type: Cat}

Slide 29

Slide 29 text

MongoDB Adapter _MAP { _id : 02401, city : BROCKTON, loc : [ -71.03434799999999, 42.081571 ], pop : 59498, state : MA } { _id : 06902, city : STAMFORD, loc : [ -73.53742800000001, 41.052552 ], pop : 54605, state : CT } ● Use one column with the whole document ● Unnest attributes as needed ● This is very messy, but we have no schema to work with

Slide 30

Slide 30 text

MongoDB Adapter id city latitude longitude population state 02401 BROCKTON -71.034348 42.081571 59498 MA 06902 STAMFORD -73.537428 41.052552 54605 CT ● Views to the rescue! ● Users of adapters can define structured views over semistructured data (or do this lazily! See Apache Drill)

Slide 31

Slide 31 text

Available Adapters

Slide 32

Slide 32 text

MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products Table: splunk SELECT p.productName, COUNT(*) AS c FROM splunk.splunk AS s JOIN mysql.products AS p ON s.productId = p.productId WHERE s.action = 'purchase' GROUP BY p.productName ORDER BY c DESC FilterIntoJoin

Slide 33

Slide 33 text

MySQL Splunk group Key: productName Agg: count sort Key: c desc FilterIntoJoin join Key: productId filter Condition: action = 'purchase' scan scan Table: splunk Table: products SELECT p.productName, COUNT(*) AS c FROM splunk.splunk AS s JOIN mysql.products AS p ON s.productId = p.productId WHERE s.action = 'purchase' GROUP BY p.productName ORDER BY c DESC

Slide 34

Slide 34 text

Streaming Data ● Calcite supports multiple windowing algorithms (e.g. tumbling, sliding, hopping) ● Streaming queries can be combined with tables ● Streaming queries can be optimized using the same rules along with new rules specifically for streaming queries

Slide 35

Slide 35 text

Streaming Data ● Relations can be used both as streams and tables ● Calcite is a reference implementation for streaming SQL (still being standardized) SELECT STREAM * FROM Orders AS o WHERE units > (SELECT AVG(units) FROM Orders AS h WHERE h.productId = o.productId AND h.rowtime > o.rowtime - INTERVAL ‘1’ YEAR)

Slide 36

Slide 36 text

Windowing Tumbling window Hopping window Session window SELECT STREAM … FROM Orders GROUP BY FLOOR(rowtime TO HOUR) SELECT STREAM … FROM Orders GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ HOUR) SELECT STREAM … FROM Orders GROUP BY HOP(rowtime, INTERVAL ‘1’ HOUR, INTERVAL ‘2’ HOUR) SELECT STREAM … FROM Orders GROUP BY SESSION(rowtime, INTERVAL ‘1’ HOUR)

Slide 37

Slide 37 text

My Use Case ● Perform view-based query rewriting to provide a logical model over a denormalized data store ● Denormalized tables are views over (non-materialized) logical tables ● Queries can be rewritten from logical tables to the most cost-efficient choice of materialized views

Slide 38

Slide 38 text

Use Cases ● Parsing and validating SQL (not so easy) ● Adding a relational front end to an existing system ● Prototyping new query processing algorithms ● Integrating data from multiple backends ● Allowing RDBMS tools to work with non-relational DBs

Slide 39

Slide 39 text

Calcite Project Future Work ● Geospatial queries ● Processing scientific data formats ● Sharing data in-memory between backends ● Additional query execution engines

Slide 40

Slide 40 text

My Future Work ● Better cost modeling ● Query-based data source selection ● Cost-based database system selection

Slide 41

Slide 41 text

Contributing to Apache Calcite ● Pick an existing issue or file a new one and start coding! ● Mailing list is generally very active ● New committers and PMC members regularly added ● Many opportunities for projects at various scales

Slide 42

Slide 42 text

Additional areas for contribution ● Testing (SQL is hard!) ● Incorporating state-of-the-art in DB research ● Access control across multiple systems ● Adapters for new classes of database (eg. array DBs) ● Implement missing SQL features (e.g. set operations) …

Slide 43

Slide 43 text

Thanks to ● Edmon Begoli, Oak Ridge National Laboratory ● Jesús Camacho-Rodríguez, Hortonworks ● Julian Hyde, Hortonworks ● Daniel Lemire, Université du Québec (TÉLUQ) ● All other Calcite contributors!

Slide 44

Slide 44 text

Questions? https://calcite.apache.org/