How we built SQL driver for in-memory data grid

Slide 1

Slide 1 text

How we built SQL driver for in-memory data grid Oleksiy Dyagilev Chief Software Engineer, Epam

Slide 2

Slide 2 text

About me Contribute to GigaSpaces projects: • InsightEdge: Data Analytics Platform powered by Spark http://github.com/InsightEdge • XAP: in-memory data grid http://github.com/xap Blogging at http://dyagilev.org/

Slide 3

Slide 3 text

About GigaSpaces XAP in 5 mins

Slide 4

Slide 4 text

Scale-out In-Memory Storage Scale by sharding or replication Low latency and High throughput In-memory storage, fast indexing, co-located ops, batch ops API and query language SQL-like against POJOs and schema-free documents. JDBC driver, lots of limitations ACID transactions Maintain full ACID compliance against your data set through full transaction semantics High availability and resiliency Fault tolerance through replication, cross-data center replication, auto-healing Data Tiering RAM and SSD

Slide 5

Slide 5 text

What customers want: plug-and-play BI

Slide 6

Slide 6 text

BI connector architecture BI tools ODBC JDBC XAP Typically supported ODBC/JDBC bridge software (EasySoft, OpenLink) The missing part

Slide 7

Slide 7 text

We need an ODBC/JDBC driver. ANSI compliant! • What are the options? • Enhance existing driver • custom SQL parser • extended SQL with ability to query embedded objects, geolocation • Create a new one • leverage industry-standard SQL parser, optimizer, JDBC driver • reimplement custom operations (though BI don’t support it) ”Modern relational query optimizers are complex pieces of code and typically represent 40 to 50 developer-years of effort” Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. Mc Graw Hill, 2000

Slide 8

Slide 8 text

Stages of query execution • Parse • parse SQL query • Validate • validate against metadata • Optimize • logical plan is optimized and converted into physical expression • Execute • physical plan is converted into application-specific execution SQL Relational algebra Runnable query plan

Slide 9

Slide 9 text

Apache Calcite https://www.slideshare.net/julianhyde/apache-calcite-overview • Top level Apache project since Oct, 2015 • Query planning framework • Relational algebra, rewrite rules, cost model, extensible • Packaging • Library (JDBC server optional) • Adoption • Embedded: Lingual (SQL interface for Cascading), Apache Drill, Apache Hive, Apache Kylin • Adapters: Splunk, JDBC, CSV, Phoenix, etc

Slide 10

Slide 10 text

Relational Algebra SELECT d.name, COUNT(*) as c FROM Employee e JOIN Department d ON e.depId = d.id WHERE e.age < 30 GROUP BY d.id HAVING count(*) > 5 ORDER BY c DESC Scan [Emp] Scan [Dept] Join [e.depId = d.id] Filter [e.age < 30] Aggregate [id, count(*) as c] Filter [c > 5] Project [name, c] Sort [c DESC]

Slide 11

Slide 11 text

Calcite query optimizer • Volcano planner (Goetz Graefe, 1991) implementation • Query optimizations are modeled as pluggable rules • Planner rules operate by looking for patterns in the query parse tree

Slide 12

Slide 12 text

Example: FilterIntoJoinRule optimization SELECT d.name, COUNT(*) as c FROM Employee e JOIN Department d ON e.depId = d.id WHERE e.age < 30 GROUP BY d.id HAVING count(*) > 5 ORDER BY c DESC Scan [Emp] Scan [Dept] Join [e.depId = d.id] Filter [e.age < 30] Aggregate [id, count(*) as c] Filter [c > 5] Project [name, c] Sort [c DESC] Scan [Emp] Scan [Dept] Join [e.depId = d.id] Filter [e.age < 30] Aggregate [id, count(*) as c] Filter [c > 5] Project [name, c] Sort [c DESC] FilterIntoJoinRule

Slide 13

Slide 13 text

The key idea: adapter • Push as much work as possible to the grid side (predicates, projects, joins, sorting) • Leave to Calcite implementation everything we cannot handle on the grid Custom optimization rules XAP grid (server side) Calcite (client side) Logical plan Physical plan

Slide 14

Slide 14 text

Example SELECT e.id, e.name, e.departmentId FROM Employee e JOIN Department d ON e.departmentId = d.id WHERE d.id = 2 ORDER BY e.name DESC EnumerableSort(sort0=[$1], dir0=[DESC]) EnumerableCalc(expr#0..3=[{inputs}], id=[$t2], name=[$t3], departmentId=[$t1]) XapToEnumerableConverter XapJoin(condition=[=($1, $0)], joinType=[inner]) XapProject(id=[$0]) XapFilter(condition=[=(CAST($0):INTEGER, 2)]) XapTableScan(table=[[sql-test-space, Department]]) XapProject(departmentId=[$0], id=[$2], name=[$5]) XapTableScan(table=[[sql-test-space, Employee]]) XAP grid (server side) Calcite (client side) XAP grid Calcite

Slide 15

Slide 15 text

XapFilterRule • Demo in IDE

Slide 16

Slide 16 text

Query processing • Flat query (project, filter, sort) • Aggregated query (aggregate) • Join query

Slide 17

Slide 17 text

Flat/aggregated query execution Client Partition 1 Partition 2 Partition 3 native XAP API XAP grid • XAP SQLQuery API to execute flat query • XAP Aggregators API for aggregated query

Slide 18

Slide 18 text

JOIN traits • Collocated / non-collocated. • Join tables are partitioned(routed) on the condition column • Equality conjunction (EquiJoin) • Join condition only consists of equality conjunctions, e.g. t1.c1 = t2.c2 AND t3.c2=t2.c4 AND ...

Slide 19

Slide 19 text

Collocated JOIN Client Partition 1 distributed task • Join tables are partitioned(routed) on the condition column • All the data is available on a partition. • No data movement. • Distributed Task (kind-of MapReduce) • Join in mapper • Union record in reducer • Scales well Table A Table B Partition 2 Table A Table B Partition 3 Table A Table B

Slide 20

Slide 20 text

Non-collocated JOIN • A portion of ‘Probe’ table is located on every partition • The whole ‘Build’ table is loaded on every partition(the missing data is moved) Client Partition 1 distributed task Table A Table B Partition 2 Table A Table B Partition 3 Table A Table B ‘Build’ table movement

Slide 21

Slide 21 text

JOIN Algorithms • Classic JOIN algorithms: • Hash Join (primary one, only applicable with Equality Conjunction) • Nested loop (fallback) • Merge Sort Join (currently not used)

Slide 22

Slide 22 text

Hash Join • Smaller relation is used as a ’build’, the other one as a ‘probe’ • The build should fit into the partition memory. Otherwise we get OOM • We scan ‘probe’ in place in a streaming fashion to allocate minimum memory and avoid GC

Slide 23

Slide 23 text

Testing • Unit tests • Run embedded grid instance • Check explain plan • Check query results • Check what XAP native query was pushed-down using Mockito Spy • TPC-DS benchmark

Slide 24

Slide 24 text

Unit tests @Test public void testOrderJoinProjectFilter() { CalciteAssert.that() .with(connectionFactory) .query("SELECT e.id, e.name, e.departmentId " + "FROM Employee e " + "JOIN Department d ON e.departmentId = d.id " + "WHERE d.id = 2 " + "ORDER BY e.name DESC") .returnsOrdered( "id=5; name=Smith; departmentId=2", "id=4; name=Ford; departmentId=2", "id=3; name=Clark; departmentId=2", "id=6; name=Adams; departmentId=2" ) .explainContains("PLAN=EnumerableSort(sort0=[$1], dir0=[DESC])\n" + " EnumerableCalc(expr#0..3=[{inputs}], id=[$t2], name=[$t3], departmentId=[$t1])\n" + " XapToEnumerableConverter\n" + " XapJoin(condition=[=($1, $0)], joinType=[inner])\n" + " XapProject(id=[$0])\n" + " XapFilter(condition=[=(CAST($0):INTEGER, 2)])\n" + " XapTableScan(table=[[sql-test-space, Department]])\n" + " XapProject(departmentId=[$0], id=[$2], name=[$5])\n" + " XapTableScan(table=[[sql-test-space, Employee]])"); }

Slide 25

Slide 25 text

TPC-DS benchmark • TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems. • Can run 40% of queries • Revealed a couple of bugs • Revealed bugs in Calcite • Some queries take too much time in the planner

Slide 26

Slide 26 text

Future potential work • Statistics (cost model) • Enables JOIN ordering optimization • Multi-join push down • Move Calcite runtime to the grid side • Use SQL struct type to support embedded objects

Slide 27

Slide 27 text

JOIN ordering TPC-DS Q3: SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, SUM(ss_ext_sales_price) sum_agg FROM date_dim dt, store_sales, item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 436 AND dt.d_moy = 12 GROUP BY dt.d_year , item.i_brand , item.i_brand_id ORDER BY dt.d_year , sum_agg DESC, brand_id LIMIT 10 fact table: store_sales (82B) two dimension table: date_dim (73K) and item (462K)

Slide 28

Slide 28 text

Cardinality Pictures taken from https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/

Slide 29

Slide 29 text

Two JOIN orderings Pictures taken from https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/

Slide 30

Slide 30 text

Cardinality Estimation. Intuition behind Histograms.

Slide 31

Slide 31 text

Selection query For logical AND expression, its filter selectivity is fs(a AND b) = fs(a) * fs (b) For logical-OR expression, its filter selectivity fs (a OR b) = fs (a) + fs (b) - fs (a AND b) = fs (a) + fs (b) – (fs (a) * fs (b)) For logical-NOT expression, its filter selectivity is fs (NOT a) = 1.0 - fs (a) f – number of tuples in the bucket dv – number of distinct values in the bucket