Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Other NoSQL

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Other NoSQL

Apache BigData 2016 Conference: http://sched.co/8U0a

When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.

To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.

We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.

Hopefully this will enable you to add SQL capabilities to your preferred NoSQL data system.

Christian Tzolov

November 15, 2016
Tweet

More Decks by Christian Tzolov

Other Decks in Technology

Transcript

  1. Apache Calcite for Enabling SQL
    Access to NoSQL Data Systems
    such as Apache Geode
    Christian Tzolov

    View full-size slide

  2. 2
    Christian Tzolov
    Engineer at Pivotal,
    Big-Data, Hadoop,
    Spring Cloud Dataflow,
    Apache Geode, Apache HAWQ,
    Apache Committer,
    Apache Crunch PMC member
    [email protected]
    blog.tzolov.net
    Twitter: @christzolov
    https://nl.linkedin.com/in/tzolov
    Whoami
    Disclaimer
    This talk expresses my personal opinions. It is not read or approved by Pivotal
    and does not necessarily reflect the views and opinions of Pivotal nor does it
    constitute any official communication of Pivotal.
    Pivotal does not support any of the code shared here.

    View full-size slide

  3. 3
    Big Data Landscape 2016
    •  Volume
    •  Velocity
    •  Varity
    •  Scalability
    •  Latency
    •  CAP - Consistency vs. Availability

    View full-size slide

  4. 4
    •  SQL
    •  Custom APIs
    –  Key / Value
    –  Fluid API
    –  REST API
    •  {My} Query Language
    Unified Access?
    At What Cost?
    Data Access

    View full-size slide

  5. 5
    •  Apache Apex
    •  Apache Drill
    •  Apache Flink
    •  Apache Hive
    •  Apache Kylin
    •  Apache Phoenix
    •  Apache Samza
    •  Apache Storm
    •  Cascading
    •  Qubole Quark
    •  SQL-Gremlin

    SQL?
    •  Apache Geode

    View full-size slide

  6. Calcite Geode Adapter - Overview
    Geode API and OQL
    SQL/JDBC/ODBC
    Convert SQL relational
    expressions into OQL queries
    Geode Adapter
    (Geode Client)
    Geode Server
    Geode Server
    Geode Server
    Data Data Data
    Push down the relational
    expressions supported by Geode
    OQL and falls back to the Calcite
    Enumerable Adapter for the rest
    Enumerable
    Adapter
    Apache Calcite
    Spring Data
    Geode
    Spring Data API for
    interacting with Geode
    Parse SQL, converts into
    relational expression and
    optimizes

    View full-size slide

  7. Relational Expressions and Optimization
    7
    Scan Scan
    Join
    Filter
    Project
    Customer (c) BookOrder (b)
    on customerNumber
    b.totalPrice > 0
    c.firstName, b.totalPrice
    SELECT b."totalPrice", c."firstName” FROM "BookOrder" as b
    INNER JOIN "Customer" as c ON b."customerNumber" = c."customerNumber”
    WHERE b."totalPrice" > 0;
    Scan Scan
    Join
    Project
    Customer (c) BookOrder (b)
    on customerNumber
    totalPrice > 0
    c.firstName, b.totalPrice
    Project
    firstName,
    customerNumber
    Filter
    totalPrice,
    customerNumber
    Project
    Optimiz

    View full-size slide

  8. Push Down Candidates
    8
    Relational Operator Geode Support
    LIMIT Supported without FETCH
    PROJECTION Supported
    FILTER Supported
    JOIN Only for collocated data
    AGGREGATE Only for MAX, MIN, SUM, AVG
    SORT Requires DISTINCT statement
    Implemented

    View full-size slide

  9. Apache Geode?
    “… in-memory, distributed database
    with strong consistency built to
    support low latency transactional
    applications at extreme scale”

    View full-size slide

  10. Why Apache Geode?
    10
    5,700 train stations
    4.5 million tickets per day
    20 million daily users
    1.4 billion page views per day
    40,000 visits per second
    7,000 stations
    72,000 miles of track
    23 million passengers daily
    120,000 concurrent users
    10,000 transactions per minute
    https://pivotal.io/big-data/case-study/distributed-in-memory-data-management-solution
    https://pivotal.io/big-data/case-study/scaling-online-sales-for-the-largest-railway-in-the-world-china-railway-corporation
    China Railway

    View full-size slide

  11. 11
    •  In-Memory Data Storage
    –  > 100TB memory
    –  JVM Heap + Off Heap
    •  Any Data Format
    –  Key-Value/Object Store
    •  ACID Compliant Transactions
    •  HA and Linear Scalability
    •  Strong Consistency
    •  Streaming and Event Processing
    –  Listeners
    –  Distributed Functions
    –  Continuous OQL Queries
    •  Multi-site / Inter-cluster
    •  Embedded and Standalone
    •  Top Level Apache Project
    Apache Geode Features

    View full-size slide

  12. Apache Geode Concepts
    Cache Server (member)
    Cache
    Region 1
    Region N
    Val
    Ke
    y
    v1
    k1
    v2
    k2

    Cache - In-memory collection
    of Regions
    Region - consistent, distributed
    Map (key-value),
    Partitioned or Replicated
    CacheServer – process
    connected to the distributed
    system with created Cache
    Client (member)
    Locator (member)
    Client –read and modify the
    content of the distributed
    system
    Locator – tracks
    system members and
    provides membership
    information

    Listeners
    Functions
    Functions – distributed,
    concurrent data
    processing
    Listener – event handler.
    Registers for one or
    more events and notified
    when they occur

    View full-size slide

  13. Geode Topologies
    Cache Server
    Cache Server
    Cache Server
    Cache Data Cache Data Cache Data
    Peer-to-Peer
    Cache Server
    Cache Server
    Cache Server
    Cache Data Cache Data Cache Data
    Client
    Local Cache
    pool Client-Server
    Cache Server
    Cache Server
    Gateway Sender

    Cache Server
    Gateway Receiver
    Cache Server
    Cache Server
    Cache Data Cache Data Cache Data Cache Data
    Gateway Receiver
    Cache Server
    … Gateway Sender
    Cache Server
    Cache Server
    Cache Data Cache Data Cache Data Cache Data
    WAN Multi-site Boundary
    Multi-Site

    View full-size slide

  14. Geode Client API
    •  Client Cache
    •  Key / Value - Region GET, PUT, REMOVE
    •  OQL – QueryService

    View full-size slide

  15. Geode Data Types & Serialization
    •  Key-Value with complex value formats
    •  Portable Data eXchange (PDX) Serialization – Delta propagation, schema
    evolution, polyglot support …
    •  Object Query Language (OQL)
    SELECT p.name
    FROM /Person p
    WHERE p.pet.type = “dino”
    {
    id: 1,
    name: “Fred”,
    age: 42,
    pet: {
    name: “Barney”,
    type: “dino”
    }
    }
    single field deserialization
    nested fields

    View full-size slide

  16. Geode Demo (GFSH and OQL)
    •  Connect to Geode cluster,
    •  List available Regions
    •  Run OQL query

    View full-size slide

  17. Apache Calcite?
    Java framework that allows SQL interface and advanced query
    optimization, for virtually any data system
    •  Query Parser, Validator and Optimizer(s)
    •  Local/Remote JDBC drivers
    •  Streaming
    •  Agnostic to how data is stored and process
    •  Balance SQL completes vs. integrity of Data system native capabilities

    View full-size slide

  18. Apache Calcite Data Types
    •  Catalog – namespaces accessed in queries
    •  Schema - collection of schemas and tables
    •  Table - single data set, collection of rows
    •  RelDataType – SQL fields types in a Table
    Your Data System
    Data System Data Types
    Calcite
    Schema
    SQL Engine
    Table Table
    JDBC
    Table

    Data Type
    Mapping SELECT title, author FROM test.BookMaster
    Data Type Fields Schema Table

    View full-size slide

  19. Calcite Data Types: RelDataType
    19
    Type of a scalar expression or row
    •  RelDataTypeFactory – RelDataType factory
    •  JavaTypeFactory - registers Java classes as record types
    •  JavaTypeFactoryImpl - Java Reflection to build RelDataTypes
    •  SqlTypeFactoryImpl - Default implementation with all SQL types

    View full-size slide

  20. Geode to Calcite Data Types Mapping
    20
    Geode Cache
    Region 1
    Region K
    Val
    Key
    v1
    k1
    v2
    k2

    Calcite Schema
    Table 1
    Table K
    Col1 Col2 ColN
    V(M,1)
    RowM V(M,2) V(M,N)
    V(2,1)
    Row2 V(2,2) V(2,N)
    V(1,1)
    Row1 V(1,2) V(1,N)

    Regions are mapped into Tables
    Geode Cache is mapped into Calcite Schema
    Geode Key/Value is mapped
    into Table Row
    Create Column Types
    (RelDataType) from Geode
    Value class
    (JavaTypeFactoryImpl)

    View full-size slide

  21. Calcite Bootstrap Flow
    21
    Typical calcite initialization flow
    Model
    (JSON)
    SchemaFactory
    Schema
    Tables
    Creates
    Creates
    Configures Calcite
    Creates

    View full-size slide

  22. Calcite Model
    {
    version: '1.0',
    defaultSchema: 'TEST',
    schemas: [ {
    name: 'TEST',
    type: 'custom',
    factory: 'org.apache.calcite.adapter.geode.simple.GeodeSchemaFactory',
    operand: {
    locatorHost: 'localhost',
    locatorPort: '10334',
    regions: 'BookMaster',
    pdxSerializablePackagePath: 'net.tzolov.geode.bookstore.domain.*'
    }
    }]
    }
    Reference to your adapter
    schema factory implementation
    class
    Parameters to be passed to
    your adapter schema factory
    implementation
    The path to .json is passed as JDBC connection argument:
    !connect jdbc:calcite:model=target/test-classes/.json︎
    Schema Name

    View full-size slide

  23. Geode Calcite Schema and Schema Factory
    public class GeodeSchemaFactory implements SchemaFactory {
    public Schema create(SchemaPlus parentSchema, String schemaName, Map operand) {
    String locatorHost = (String) operand.get(“locatorHost”);
    int locatorPort = …
    String[] regionNames = …
    String pdxPackagePath = …
    return new GeodeSchema(locatorHost, locatorPort, regionNames, pdxPackagePath);
    }
    }
    public class GeodeSchema extends AbstractSchema {
    private String regionName = ..
    protected Map getTableMap() {
    final ImmutableMap.Builder builder = ImmutableMap.builder();
    Region region = … Get Geode Region by region name …
    Class valueClass= … Find region’s value type …
    builder.put(regionName, new GeodeScannableTable(regionName, valueClass, clientCache));
    return tableMap;
    }
    Retrieves the parameters set in
    the model.json
    Create an Adapter Schema
    instance with the provided
    parameters.
    Create GeodeScannableTable
    instance for each Geode Region

    View full-size slide

  24. Geode Scannable Table
    public class GeodeScannableTable extends AbstractTable implements ScannableTable {
    public RelDataType getRowType(RelDataTypeFactory typeFactory) {
    return new JavaTypeFactoryImpl().createStructType(valueClass);
    }
    public Enumerable scan(DataContext root) {
    return new AbstractEnumerable() {
    public Enumerator enumerator() { return new GeodeEnumerator(clientCache, regionName); }
    }
    public class GeodeEnumerator implements Enumerator {
    private E current;
    private SelectResults geodeIterator;
    public GeodeEnumerator(ClientCache clientCache, String regionName) {
    geodeterator = clientCache.getQueryService().newQuery("select * from /" + regionName).execute().iterator();
    }
    public boolean moveNext() { current = convert(geodeIterator.next()); return true;}
    public E current() {return current;}
    public abstract E convert(Object geodeValue) {
    Convert PDX value into RelDataType row
    }
    Defined in the Linq4j sub-project
    Retrieves the entire Region!!
    Converts Geode value response into Calcite row data
    Uses reflection (or pdx-instance) to builds
    RelDataType from value’s class type
    Returns an Enumeration over the entire
    target data store

    View full-size slide

  25. Geode Demo (Scannable Tables)
    $ ./sqlline ︎
    sqlline> !connect jdbc:calcite:model=target/test-classes/model2.json admin admin︎

    jdbc:calcite> !tables ︎
    jdbc:calcite> SELECT * FROM "BookMaster”;︎
    jdbc:calcite> SELECT "yearPublished", AVG("retailCost") AS “AvgRetailCost” FROM "BookMaster" GROUP BY "yearPublished";︎
    jdbc:calcite> SELECT b."totalPrice", c."firstName” FROM "BookOrder" AS b INNER JOIN "Customer" AS c ON b."customerNumber" = c."customerNumber” WHERE b."totalPrice" > 0;︎


    Without and With Implementation

    View full-size slide

  26. Non-Relational Tables
    26
    Scanned without intermediate relational expression.
    •  ScannableTable - can be scanned
    •  FilterableTable - can be scanned, applying supplied filter expressions
    •  ProjectableFilterableTable - can be scanned, applying supplied filter expressions
    and projecting a given list of columns
    Enumerable scan(DataContext root, List filters, int[] projects);
    Enumerable scan(DataContext root, List filters);
    Enumerable scan(DataContext root);

    View full-size slide

  27. Calcite Ecosystem
    27
    Several “semi-independent” projects.
    JDBC and Avatica Linq4j
    Expression
    Tree
    Enumerable
    Adapter
    Relational
    •  Relational Expressions
    •  Row Expression
    •  Optimization Rules
    •  Planner …
    SQL Parser & AST
    Port of LINQ (Language-Integrated Query)
    to Java.
    Local and Remote
    JDBC driver
    Converts SQL
    queries Into AST
    (SqlNode …)
    3rd party Adapters
    Method for translating
    executable code into data
    (LINQ/MSN port)
    Default (In-memory) Data
    Store Adapter
    implementation.
    Leverages Linq4j
    Relational Algebra,
    expression,
    optimizations …
    Interpreter
    Complies Java code
    generated from linq4j
    “Expressions”. Part of the
    physical plan implementer

    View full-size slide

  28. Calcite SQL Query Execution Flow
    28
    Enumerable
    Interpreter
    Prepare
    SQL,
    Relational,
    Planner
    Geode
    Adapter
    Binder
    JDBC
    Geode
    Cluster
    1
    2
    3
    4
    5
    6 7
    7
    7
    2. Parse SQL, convert to rel.
    expressions. Validate and Optimize
    them
    3. Start building a physical plan from
    the relation expressions
    4. Implement the Geode relations and
    encode them as Expression tree
    5. Pass the Expression tree to the
    Interpreter to generate Java code
    6. Generate and Compile a Binder
    instance that on ‘bind()’ call runs
    Geodes’ query method
    1. On new SQL query JDBC delegates
    to Prepare to prepare the query
    execution
    7. JDBC uses the newly compiled
    Binder to perform the query on the
    Geode Cluster
    Calcite Framework
    Geode Adapter
    2

    View full-size slide

  29. Linq4j and Expression Tree
    29
    (Node) Visitor
    Node
    + accept(Visitor)
    + evaluate(Node)
    Expression
    Statement
    Block
    Condition
    For
    Goto
    Label
    Switch
    Throw
    Try
    While
    Binary Constant MethodCall Parameter Member


    Queryable
    Enumberable
    QueryProvider

    View full-size slide

  30. Bindable Generated Code
    30
    Calcite via Enumerable Converts Expressions into Java Code

    View full-size slide

  31. Calcite Relational Expressions
    31
    RelNode
    Relational
    expression
    TableScan
    Project
    Filter
    Aggregate
    Join
    Intersect
    Sort
    RexlNode
    Row-level
    expressions
    Project, Sort fields
    Filter, Join conditions
    Input Column
    Ref
    Literal
    Struct field
    access
    Function call
    Window
    expressions
    *
    RelTrait
    *
    Physical attribute
    of a relation

    View full-size slide

  32. Calcite Relational Expressions (2)
    32
    RelNode
    + register(RelOptPlander)
    + List getInputs();
    RelOptPlanner
    +findBestExp():RelNode
    RexNode
    RelTrait Convention
    NONE
    *
    *
    EnumberableConvention
    RelOptRule
    + onMatch(call)
    <>
    <>
    MyDBConvention
    ConverterRule
    + RelNode convert(RelNode)
    Converts from one calling
    convention to another
    Convertor
    Indicate that it converts a
    physical attribute only!
    <>
    *
    <>
    *
    <>
    Query optimizer: Transforms a
    relational expression according to
    a given set of rules and a cost model.
    RelOptCluster
    Rule transforms an expression into another. It has a list of
    Operands, which determine whether the rule can be applied to
    a particular section of the tree.
    RelOptRuleOperand
    *
    <>
    Calling convention used
    to represent a single
    data source.
    Inputs to a relational
    expression must be in
    the same convention

    View full-size slide

  33. Calcite Adapter Patterns
    33
    MyAdapterRel
    + implement(implContext)
    MyAdapterConvention
    Convention.Impl(“MyAdapter”)
    Common interface for all MyAdapter
    Relation Expressions. Provides
    implementation callback method called
    as part of physical plan implementation
    ImplContext
    + implParm1
    + implParm2 …
    RelNode
    MyAdapterTable
    + toRel(optTable)
    + asQueryable(provider,…)
    MyAdapterQueryable
    + myQuery(params) :
    Enumetator
    TranslatableTable
    <>
    AbstractQueryableTable
    AbstractTableQueryable <>
    Can convert
    queries in
    Expression
    myQuery() implements the call to your DB
    It is called by the auto generated code. It
    must return an Enumberable instance
    MyAdapterScan
    + register(planer) {
    Registers all MyAdapter Rules
    }
    <>
    MyAdapterToEnumerableConvertorRule
    operands: (RelNode.class,
    MyAdapterConvention, EnumerableConvention)
    ConverterRue
    TableScan
    MyAdapterToEnumerableConvertor
    + implement(EnumerableRelImplementor) {
    ctx = new MyAdapterRel.ImplContext()
    getImputs().implement(ctx)
    return BlockBuild.append( MY_QUERY_REF,
    Expressions.constant(ctx.implParms1),
    Expressions.constant(ctx.implParms2) …
    EnumerableRel
    ConvertorImpl
    <>
    MyAdapterProject
    MyAdapterFilter
    MyAdapterXXX
    RelOptRule
    MyAdapterProjectRu
    MyAdapterFilterRule
    MyAdapterXXXRule
    <>
    Recursively call the implement on each
    MyAdapter Relation Expression
    Encode the myQuery(params) call as
    Expressions
    MY_QUERY_REF = Types.lookupMethod(
    MyAdapterQueryable.class,
    ”myQuery”,
    String.class
    String.class);
    1
    3
    4
    5
    2
    6
    7
    8
    9
    Calcite Framework
    MyAdapter components

    View full-size slide

  34. Calcite with Geode - Without Implementation
    34
    SELECT b."totalPrice", c."firstName"
    FROM "BookOrder" as b
    INNER JOIN "Customer" as c ON b."customerNumber" = c."customerNumber"
    WHERE b."totalPrice" > 0;

    View full-size slide

  35. Calcite with Geode - With Implementation
    35
    SELECT b."totalPrice", c."firstName" FROM "BookOrder" as b INNER JOIN "Customer" as c
    ON b."customerNumber" = c."customerNumber" WHERE b."totalPrice" > 0;

    View full-size slide

  36. Calcite JDBC Connection
    36

    View full-size slide

  37. What About Testing?
    37

    View full-size slide

  38. Future work
    38
    •  Improve nested data structures support
    •  Push down Join for collocated data sets
    •  Push down the COUNT expression
    •  Push down the ORDER BY expression (requires adding a DISTINCT aggregate
    expression to the query)
    •  Use new Geode OQL aggregation operations (http://bit.ly/2eKApd0) to push down
    MIN, MAX, SUM, AVG aggregations
    •  Beyond OQL (e.g. implement Join, aggregations with custom functions)
    •  Leverage Calcite Streaming with Geode

    View full-size slide

  39. References
    39
    •  Apache Geode Adapter for Apache Calcite: https://github.com/tzolov/calcite
    •  Introduction to Apache Calcite (2016) : http://bit.ly/2fB1iBz
    •  Apache Calcite Overview (2014) : http://bit.ly/2fMJgbS
    •  Introduction to Apache Geode (2016) : http://bit.ly/1Rfztbd
    •  Apache Calcite Project (2016) : https://calcite.apache.org
    •  Apache Geode Project (2016) : http://geode.apache.org
    •  Geode Object Query Language (OQL) : http://bit.ly/2eKywgp
    •  Expression Tree Basic: http://bit.ly/2flBiXH

    View full-size slide