Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Enable SQL/JDBC Access to Apache Geode/GemFire Using Apache Calcite

Enable SQL/JDBC Access to Apache Geode/GemFire Using Apache Calcite

https://springoneplatform.io/sessions/enable-sql-jdbc-access-to-apache-geode-gemfire-using-apache-calcite

When working with BigData & IoT systems we often feel the need for an established, Common Query Language.

To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. It allows you to integrate SQL parser, Cost-Based Optimizer, and JDBC with your NoSql system. Calcite has been used to empower many BigData platforms such as Hive, Spark, Flink, Drill, HBase/Phoenix to name some.

In this session I will walk you through the process of building a SQL access layer for Apache Geode (GemFire). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of In-Memory-Data-Grid systems like Geode.

Hopefully this will enable you to add SQL capabilities to your preferred NoSQL data system.

Christian Tzolov

December 07, 2017
Tweet

More Decks by Christian Tzolov

Other Decks in Technology

Transcript

  1. Enable SQL/JDBC Access to Apache
    Geode/GemFire Using Apache Calcite
    By Christian Tzolov
    @christzolov
    1

    View Slide

  2. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Christian Tzolov
    Spring Software Engineer at Pivotal
    Spring Cloud Data Flow
    Apache Committer, Crunch PMC member
    Apache {Crunch, HAWQ, Zeppelin, Calcite...}
    2
    blog.tzolov.net
    @christzolov
    @tzolov

    View Slide

  3. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    “What happens if an established NoSQL database
    decides to implement a reasonably standard SQL;
    The only predictable outcome for such an
    eventuality is plenty of argument.”
    Martin Fowler, P.J.Sadalage, NoSQL Distilled, 2012
    3

    View Slide

  4. Disclaimer

    View Slide

  5. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Safe Harbor Statement
    The following is intended to outline the general direction of Pivotal's offerings. It is
    intended for information purposes only and may not be incorporated into any
    contract. Any information regarding pre-release of Pivotal offerings, future updates
    or other planned modifications is subject to ongoing evaluation by Pivotal and is
    subject to change. This information is provided without warranty or any kind,
    express or implied, and is not a commitment to deliver any material, code, or
    functionality, and should not be relied upon in making purchasing decisions
    regarding Pivotal's offerings. These purchasing decisions should only be based on
    features currently available. The development, release, and timing of any features
    or functionality described for Pivotal's offerings in this presentation remain at the
    sole discretion of Pivotal. Pivotal has no obligation to update forward looking
    information in this presentation.
    5

    View Slide

  6. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Demo?
    In a minute ;)
    6

    View Slide

  7. Data Management Systems

    View Slide

  8. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    8

    View Slide

  9. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    One Size Fit All*
    “The traditional DBMS architecture has been used to support many
    data-centric applications with widely varying characteristics and
    requirements… We argue that this concept is no longer applicable to the
    database market, and that the commercial world will fracture into a
    collection of independent database engines, some of which may be
    unified by a common front-end parser.”
    Michael Stonebraker, 2005
    9

    View Slide

  10. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Collection of Independent Databases
    10
    How to ingest and keep
    consistent across DBs?
    How to access the data
    across multiple DBs?

    View Slide

  11. Data Federation

    View Slide

  12. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Federation: Unified Query Interface
    12

    View Slide

  13. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Federation: Two Mapping Tasks
    13

    View Slide

  14. Federation: Data Model Mapping

    View Slide

  15. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Geode Model Mapping
    • PDX serialization
    • Primitive Types
    • Arrays (employ[0])
    • Nested Objects
    (parent[‘child.childChild…’])
    15
    Cache
    Region 1
    Region K
    Val
    Key
    v1
    k1
    v2
    k2
    Schema
    Table 1
    Table K
    C
    o
    l
    1
    C
    o
    l
    2
    C
    o
    l
    N
    V(M,1)
    R
    o
    w
    M
    V(M,2)
    V(M,N
    )
    V(2,1)
    R
    o
    w
    2
    V(2,2) V(2,N)
    V(1,1)
    R
    o
    w
    1
    V(1,2) V(1,N)
    2. Regions into Tables
    1. Cache into Schema
    3. Key/Value into
    Table Row
    4. Column Types inferred
    from PDX fields

    View Slide

  16. Federation: Query Mapping

    View Slide

  17. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    SQL for NoSQL?
    • Extended Relational Algebra - already present in most NoSql data
    • Relational Expression Optimization – Desirable but hard to implement
    17

    View Slide

  18. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Query = Relational Algebra
    18
    SELECT b.totalPrice, c.firstName
    FROM BookOrder AS b
    INNER JOIN BookCustomer AS c ON b.customerNumber = c.customerNumber
    WHERE b.totalPrice > 0

    }
    Scan Scan
    Join
    Filter
    Project
    BookCustomer [c] BookOrder [b]
    (on customerNumber)
    (b.totalPrice > 0)
    (c.firstName, b.totalPrice)
    Scan Scan
    Join
    Project
    BookCustomer [c] BookOrder [b]
    (on
    customerNumber)
    (totalPrice > 0)
    (c.firstName, b.totalPrice)
    Project
    (firstName,
    customerNumber)
    Filter
    (totalPrice,
    customerNumber)
    Project
    <>
    <>

    View Slide

  19. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    “Native” Geode Operations
    19
    Relational Operation Supported on Geode
    SORT/LIMIT YES (no Offset)
    PROJECT YES
    FILTER YES
    AGGREGATE
    SUM, AVG, MAX, MIN,
    COUNT, GROUP BY,
    DISTINCT
    JOIN NO

    View Slide

  20. SQL Adapter Framework

    View Slide

  21. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Apache Calcite
    • Java Framework
    • SQL Query Parser, Validator and Optimizer(s)
    • JDBC drivers - local and remote
    • Agnostic to data storage and processing
    • Powered by Calcite:
    21

    View Slide

  22. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Apache Geode Adapter
    22
    Geode API and OQL
    SQL/JDBC
    Convert relational expressions into OQL queries
    Geode Adapter
    (Geode Client)
    Geode Server
    Geode Server
    Geode Server
    Data Data Data
    Push down all supported expressions to Geode OQL
    and fall back to Calcite Enumerable for the rest
    Enumerable
    Adapter
    Apache Calcite Parse SQL into relational expression and optimizes
    Calcite-Geode-Embedded(e.g.JDBC.jar)

    View Slide

  23. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Apache Geode Concepts
    Cache Server
    Cache
    Region 1
    Region N
    Val
    Key
    v1
    k1
    v2
    k2
    Cache - In-memory collection of
    Regions
    Region - consistent, distributed Map
    (key-value),
    Partitioned or Replicated
    CacheServer – process
    connected to the distributed
    system with created Cache
    Client
    Locator
    Client –read and modify the content
    of the distributed system
    Locator – tracks system members and
    provides membership information

    Listeners
    Functions
    Functions – distributed,
    concurrent data processing
    Listener – event handler.
    Registers for one or more events
    and notified when they occur

    View Slide

  24. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Key/Value vs. OQL
    “to get extreme scale we use GemFire for what it does best: key-value
    storage. Then, when we can’t design our way to using what GemFire does
    best, we can use OQL and treat it like an object data-base”
    Mike Stolz, “Scaling Data Services with Pivotal GemFire®”, 2018

    View Slide

  25. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Query Execution Flow
    25
    2. Parse SQL, convert to rel. expressions. Validate
    and Optimize them
    3. Start building a physical plan from the relation
    expressions
    4. Implement the Geode relations and encode
    them as Expression tree
    5. Pass the Expression tree to the Interpreter to
    generate Java code
    6. Generate and Compile a Binder instance that
    on ‘bind()’ call runs Geodes’ query method
    1. On new SQL query JDBC delegates to Prepare
    to prepare the query execution
    7. JDBC uses the newly compiled Binder to
    perform the query on the Geode Cluster

    View Slide

  26. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Query Mapping - Example
    26
    SELECT author
    FROM BookMaster
    WHERE retailCost > 0
    LIMIT 1

    }
    SELECT author
    FROM BookMaster
    WHERE retailCost > 0
    LIMIT 1 OFFSET 1

    }

    View Slide

  27. Demo

    View Slide

  28. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    DB Tools Integration
    28
    Access Apache Geode /
    GemFire over SQL/JDBC
    Explore Geode & GemFire
    Data with IntelliJ
    SQL/Database tool

    View Slide

  29. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Analytics
    29
    Advanced Apache
    Geode Data Analytics
    with Apache Zeppelin
    over SQL/JDBC

    View Slide

  30. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Data Federation
    30
    {Geode|Greenplum|...} Data Federation

    View Slide

  31. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Spatial (OpenGIS /PostGIS)
    31

    View Slide

  32. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    SQL Stream (in progress...)
    32
    https://calcite.apache.org/docs/stream.html

    View Slide

  33. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    SQL Stream (2)
    33
    ● How many rows come out for each row in?
    ● Does each incoming value appear in one total, or more?
    ● What defines the “window”, the set of rows that contribute to a given output row?
    ● Is the result a stream or a relation?
    ● tumbling window (GROUP BY)
    ● hopping window (multi GROUP BY)
    ● sliding window (window functions)
    ● cascading window (window
    functions)

    View Slide

  34. Usability?

    View Slide

  35. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Pros & Cons
    • Ad-hoc data exploration
    • JDBC integration with 3rd party
    tools
    • Data Federation, correlate Geode
    with other data sources
    • SQL Streaming as CQ++
    • No-intrusive and extensible
    approach
    • What happened with SQLFire?
    • Geode == Transactional System!
    SQL+Geode <> Analytical System!
    • Key/Value vs. Full Scan
    • Overhead: SQL > OQL > Functions
    • Data at Rest (Table) vs. Data at
    Motion (Stream)

    View Slide

  36. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Future work
    • Add Geode Adapter to Calcite Project: [CALCITE-2059] – VOTE! ;)
    • Table Materialization
    • SQL Streaming (https://calcite.apache.org/docs/stream.html)
    • Pre-defined vs. Inferred Schema
    • “Native” JOIN operation implementation
    • Geode Indexes and Query statistics as Calcite metadata. Hint Indexes
    • Mixing Key/Value and OQL data access
    • Explore Geode functions as an alternative to OQL
    • Explore TX support
    • Performance Benchmark
    • Calcite Spatial (https://calcite.apache.org/docs/spatial.html)

    View Slide

  37. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Credits
    to Dan Baskette for inspiring this work
    and to Roman Shaposhnik for helping
    spread the word
    37

    View Slide

  38. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    References
    38
    [1 ] Apache Geode: http://geode.apache.org
    [2] Geode Object Query Language (OQL): http://bit.ly/2BfXmNR
    [3] Apache Calcite: https://calcite.apache.org
    [4] Calcite - supported SQL language: https://calcite.apache.org/docs/reference.html
    [5] Apache Geode Adapter for Apache Calcite: https://github.com/tzolov/calcite/tree/geode-1.3
    [6] Relational Algebra Operations: http://bit.ly/2zX8cMc
    [7] Apache Phoenix - http://phoenix.apache.org
    [8] Big Data is Four Different Problems, 2016, M.Stonebraker (video): http://bit.ly/2jDmKpq
    [9] Architecture of a Database System, 2007 (J.M. Hellerstein, M. Stonebraker, J. Hamilton):
    http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
    [10] One Size Fits All (Michael Stonebraker): http://ieeexplore.ieee.org/document/1410100/?denied

    View Slide

  39. Unless otherwise indicated, these slides are © 2013-2016 Pivotal Software, Inc. and licensed under a Creative Commons
    Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Bonus: Geode Adapter Implementation
    39
    Common interface for all MyAdapter Relation
    Expressions. Provides implementation callback
    method called as part of physical plan
    implementation


    <>
    <>
    Can convert
    queries in
    Expression
    myQuery() implements the call to your DB
    It is called by the auto generated code. It must
    return an Enumberable instance
    Registers all MyAdapter Rules
    <>
    operands: (RelNode.class,
    MyAdapterConvention, EnumerableConvention)

    <>
    <>
    Recursively call the implement on each MyAdapter
    Relation Expression
    Encode the myQuery(params) call as
    Expressions

    View Slide

  40. Learn More. Stay Connected.
    Vote for [CALCITE-2059] and don’t miss:
    Simplifying Apache Geode with Spring Data
    Exploring Data-Driven, Cognitive Capabilities in Pivotal Cloud Foundry
    Orchestrating Data Microservices with Spring Cloud Data Flow
    40
    #springone
    @s1p

    View Slide