Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Federated Queries with HAWQ - SQL on Hadoop and Beyond

Federated Queries with HAWQ - SQL on Hadoop and Beyond

Spring One 2015 presentation: http://bit.ly/1LQlTrL
InfoQ video: http://bit.ly/2wpzwgs

In the space of Big Data, Pivotal offers two powerful data processing tools namely HAWQ and GemFire. HAWQ is a scalable OLAP SQL-on-Hadoop system, while GemFire is OLTP like, in-memory data grid and event processing system. This presentation will show different integration approaches that allow integration and data exchange between HAWQ and GemFire. The practical experience in applying Spring Boot and Spring XD for some of the use cases will be shared while walking you through the implementation of the different Integration strategies. Amongst other we will show an integration path that leverages SpringXD to ingest GemFire data and store it in HDFS as well as the benefits of using Spring Boot to implement REStful proxy for the HAWQ Web Table integration scenario.

Christian Tzolov

September 15, 2015
Tweet

More Decks by Christian Tzolov

Other Decks in Technology

Transcript

  1. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    SPRINGONE2GX
    WASHINGTON, DC
    Federated Queries with HAWQ
    SQL on Hadoop and Beyond
    By Christian Tzolov
    @christzolov

    View Slide

  2. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Safe Harbor Statement
    The following is intended to outline the general direction of Pivotal's offerings. It is
    intended for information purposes only and may not be incorporated into any
    contract. Any information regarding pre-release of Pivotal offerings, future updates
    or other planned modifications is subject to ongoing evaluation by Pivotal and is
    subject to change. This information is provided without warranty or any kind,
    express or implied, and is not a commitment to deliver any material, code, or
    functionality, and should not be relied upon in making purchasing decisions
    regarding Pivotal's offerings. These purchasing decisions should only be based on
    features currently available. The development, release, and timing of any features
    or functionality described for Pivotal's offerings in this presentation remain at the
    sole discretion of Pivotal. Pivotal has no obligation to update forward looking
    information in this presentation.
    2

    View Slide

  3. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    3
    Bio
    Christian Tzolov
    Pivotal Technical Architect,
    BigData, Hadoop, SpringXD
    Apache Committer, Crunch PMC member
    [email protected]
    @christzolov
    blog.tzolov.net

    View Slide

  4. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Contents
    • Data System Concepts
    • Data System Reference Architectures.
    • Real Life Use Case: OLTP and OLAP Data Systems Integration
    • Passive Data Synchronization (Demo)
    • Federated Queries With HAWQ
    • HAWQ Web Tables
    • HAWQ PXF Architecture
    • Geode PXF (Demo)
    4

    View Slide

  5. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    How Compute
    Arbitrary Functions
    on Arbitrary Data

    Query = Function (All Data)
    5

    View Slide

  6. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Data System Concepts
    6

    View Slide

  7. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Data
    What is Data?
    • Data is Information
    • Derived vs. Fact Information
    • Not derived information will be called Data
    • Data hold to be true simply because it exists
    • All other information is derived from the Data
    7

    View Slide

  8. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Data System
    System that computes arbitrary functions on arbitrary data.
    • Low Latency
    • High Throughput
    • Scalability
    • Ad hoc queries
    • Provenance
    8

    View Slide

  9. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Data System Principles
    The 6-elements Application = Data-Native System
    • Fact Data
    • Immutable Data
    • Deterministic Functions
    • Data-Lineage
    • Data Locality - space or temporal
    • All Data vs. Working Set
    9

    View Slide

  10. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Data System – Key Properties
    10

    View Slide

  11. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Data System Architectures
    11

    View Slide

  12. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Data System Architectural Patterns
    How to build Data Systems?
    • Data Lake
    • Lambda
    • Kappa / Samza
    • Tachyon
    12

    View Slide

  13. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Data Lake (From ETL to ELT)
    Single Store For All Data
    • Store Raw Data
    • No Data Schema Limitations
    • No Data Volume Limitations
    • Late evaluation
    • Extract-Transform-Load (ETL) vs. Extract-Load-Transform (ELT)
    • Data Lineage
    13

    View Slide

  14. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    • Batch Layer
    • Serving Layer
    • Speed Layer
    14
    Lambda Architecture

    View Slide

  15. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    • Durable Queues (Kafka)
    • Replaying History
    • Versioned Outputs
    • Dynamic Output Switch
    15
    Kappa Architecture

    View Slide

  16. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Tachyon
    No write speed constrains
    • Data-Lineage Based Fault Tolerance
    • Asynchronous Checkpoints
    • Unbounded Writes
    16

    View Slide

  17. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Real Life Use Case
    17

    View Slide

  18. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Real Life Use Case
    • Integrate an In-Memory Data Grid - Geode (e.g. GemFire) with SQL-On-Hadoop
    OLAP system - HAWQ
    • Provide an unified data view across both systems
    • Use Geode as Slowly Changing Dimensions (SCDs) store for HAWQ
    • Keep the Operational and Historical data in Sync
    18

    View Slide

  19. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    • Cache - Performance / Consistency / Resiliency
    • Region
    • Distributed j.u.Map on steroids
    • Highly available, redundant
    • Distributed
    • Locator, Server, Client
    Apache Geode & GemFire (OLTP)
    19

    View Slide

  20. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Geode use cases…
    20
    China Railway

    Corporation
    5,700 train stations
    4.5 million tickets per day
    20 million daily users
    1.4 billion page views per day
    40,000 visits per second
    * http://pivotal.io/big-data/pivotal-gemfire
    Indian Railways
    7,000 stations
    72,000 miles of track
    23 million passengers daily
    120,000 concurrent users
    10,000 transactions per minute

    View Slide

  21. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    HAWQ (SQL-on-HADOOP)
    • Enterprise SQL on Hadoop analytic engine built around a high-performance
    Greenplum MPP Database
    • HAWQ runs natively on Apache Hadoop, HDFS and YARN.
    • Supports multiple Hadoop file formats: Parquet, HDFS and Avro
    • 100% ANSI SQL compliant (ANSI SQL-92, SQL-99, and SQL-2003…)
    • ODBC and JDBC connectivity
    • Implemented in C and C++
    21

    View Slide

  22. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    HAWQ (SQL-on-HADOOP)
    • Runs TPC-DS benchmark in half the wall clock time that Impala
    • Beats Impala by overall 454% in performance
    • Compared to Hive, HAWQ provides an additional of 344% of performance
    improvement on complex queries
    • Impala and Apache Hive™ do not support all standard TPC-DS queries
    • HAWQ runs 100% of them natively.
    22

    View Slide

  23. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Test Environment
    • HDFS – Distributed file system. PHD or HDP Hadoop distributions.
    • HAWQ - SQL-on-Hadoop.
    • Geode/GemFire - In-memory data grid.
    • SpringXD – runtime integration and streaming platform.
    • Apache Ambari – Manages all Clusters.
    • Apache Zeppelin – Web interface for interacting with different Data Systems.
    23
    GEODE
    Locator,
    Servers
    HAWQ
    Master,
    Segments
    HDFS
    NameNode, DataNodes
    SPRING
    XD
    Admin,
    Containers
    Zeppelin Notebook
    Ambari

    View Slide

  24. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Test Environment - Ambari Management
    24

    View Slide

  25. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Passive Data Synchronization
    25

    View Slide

  26. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Passive Synchronization (Case 1)
    26

    View Slide

  27. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Passive Synchronization (Case 1) - Demo
    27

    View Slide

  28. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Passive Synchronization (Case 2) - gpfdist
    28

    View Slide

  29. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Passive Synchronization (Case 2): Demo
    29

    View Slide

  30. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Federated Queries With HAWQ
    30

    View Slide

  31. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Access dynamic data sources on a web server or by executing OS scripts.
    31
    HAWQ Web Table – Architecture

    View Slide

  32. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    HAWQ Web Table
    • HAWQ Web Table - access dynamic data sources on a web server or by
    executing OS scripts.
    • Leverage Geode REST API and OQL.
    • RESTful SpringBoot app to convert JSON into TSV.
    32
    CREATE EXTERNAL WEB TABLE EMPLOYEE_WEB_TABLE (...)
    EXECUTE E'curl http:///gemfire-api/v1/
    queries/adhoc?q='
    ON MASTER FORMAT 'text' (delimiter '|' null 'null' escape E'\\');

    View Slide

  33. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    HAWQ Web Table – Limitations
    • Not Scalable
    • No “Filter Push Down”
    • Inflexibility
    • No Compression
    • Requires Additional Components
    33

    View Slide

  34. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    HAWQ PXF/Geode External Table
    34

    View Slide

  35. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Pivotal eXtension Framework (PXF)
    • Java-Based
    • Parallel, High Throughput Data Access
    • Heterogeneous Data Sources.
    • ANSI-compliant SQL On Any Dataset
    • Wide variety of PXF plugins
    35

    View Slide

  36. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    PXF Architecture
    36

    View Slide

  37. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    PXF Data Model
    • Data Source is modeled as a collection of one or more Fragments.
    • Each Fragment consists of many Rows that in turn are split into typed Fields.
    • Analyzer (optional) provides PXF statistical data for the HAWQ query optimizer
    • Metadata about the data source locations, access attributes, table schemas
    formats, SQL queries filters, etc
    37

    View Slide

  38. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    PXF Processors
    38

    View Slide

  39. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    PXF Runtime Model
    39

    View Slide

  40. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    CREATE EXTERNAL TABLE ext_table_name

    LOCATION('pxf://:/path/to/data?
    FRAGMENTER=package.name.FragmenterForX&
    ACCESSOR=package.name.AccessorForX&
    RESOLVER=package.name.ResolverForX&
    =')
    FORMAT ‘custom'(formatter='pxfwritable_import');
    40
    PXF External Tables Definition

    View Slide

  41. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Available PXF Plugins
    • HdfsTextSimple
    • HdfsTextMulti
    • Hive
    • HiveRC
    • HiveText
    • Hbase
    • Avro
    41
    • Accumulo
    • Casandra
    • JSON
    • Redis
    • Geode/Gemfire
    • Pipes

    View Slide

  42. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Build Geode PXF Plugin
    42

    View Slide

  43. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    GEMFIRE
    A profile for reading Gemfire data

    io.pivotal.pxf.plugins.gemfire.GemfireFragmenter
    fragmenter>
    io.pivotal.pxf.plugins.gemfire.GemfireAccessor
    io.pivotal.pxf.plugins.gemfire.GemfireResolver


    43
    PXF Geode Profile

    View Slide

  44. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Geode External Table Definition
    CREATE EXTERNAL TABLE (...)
    LOCATION('pxf:///?
    PROFILE=GEMFIRE &
    LOCATORS= &
    REGION=')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    44

    View Slide

  45. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    public class GemfireFragmenter extends Fragmenter {
    public GemfireFragmenter(InputData inputData) throws Exception {
    super(inputData);
    }
    public List getFragments() throws Exception {
    return Arrays.asList (new Fragment(
    inputData.getDataSource(),
    new String[] { inputData.getUserProperty("URL-HOST") },
    new byte[0]));
    }
    }
    45
    Geode Fragmenter

    View Slide

  46. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    public class GemfireAccessor extends Plugin implements ReadAccessor {
    public GemfireAccessor(InputData input) throws Exception {
    super(input);
    }
    public boolean openForRead() throws Exception { … }
    public OneRow readNextObject() throws Exception {…}
    public void closeForRead() throws Exception {…}
    }
    46
    Geode Accessor

    View Slide

  47. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    public class GemfireResolver extends Plugin implements ReadResolver {
    public GemfireResolver(InputData input) {
    super(input);
    }
    public List getFields(OneRow paramOneRow) throws Exception {
    }
    }
    47
    Geode Resolver

    View Slide

  48. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Demo 3: Federated queries with PXF/Geode
    48

    View Slide

  49. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    Demo 3: Federated queries with PXF/Geode
    49

    View Slide

  50. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a

    Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
    50
    By Christian Tzolov
    @christzolov
    HADOOP WORKFLOWS AND DISTRIBUTED YARN APPS USING SPRING
    TECHNOLOGIES
    Learn More. Stay Connected.
    @springcentral Spring.io/video

    View Slide