Federated Queries with HAWQ - SQL on Hadoop and Beyond

Federated Queries with HAWQ - SQL on Hadoop and Beyond

Spring One 2015 presentation: http://bit.ly/1LQlTrL
InfoQ video: http://bit.ly/2wpzwgs

In the space of Big Data, Pivotal offers two powerful data processing tools namely HAWQ and GemFire. HAWQ is a scalable OLAP SQL-on-Hadoop system, while GemFire is OLTP like, in-memory data grid and event processing system. This presentation will show different integration approaches that allow integration and data exchange between HAWQ and GemFire. The practical experience in applying Spring Boot and Spring XD for some of the use cases will be shared while walking you through the implementation of the different Integration strategies. Amongst other we will show an integration path that leverages SpringXD to ingest GemFire data and store it in HDFS as well as the benefits of using Spring Boot to implement REStful proxy for the HAWQ Web Table integration scenario.

B53016292e87bca26da88fc940070c4f?s=128

Christian Tzolov

September 15, 2015
Tweet

Transcript

  1. 1.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ SPRINGONE2GX WASHINGTON, DC Federated Queries with HAWQ SQL on Hadoop and Beyond By Christian Tzolov @christzolov
  2. 2.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Safe Harbor Statement The following is intended to outline the general direction of Pivotal's offerings. It is intended for information purposes only and may not be incorporated into any contract. Any information regarding pre-release of Pivotal offerings, future updates or other planned modifications is subject to ongoing evaluation by Pivotal and is subject to change. This information is provided without warranty or any kind, express or implied, and is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions regarding Pivotal's offerings. These purchasing decisions should only be based on features currently available. The development, release, and timing of any features or functionality described for Pivotal's offerings in this presentation remain at the sole discretion of Pivotal. Pivotal has no obligation to update forward looking information in this presentation. 2
  3. 3.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 3 Bio Christian Tzolov Pivotal Technical Architect, BigData, Hadoop, SpringXD Apache Committer, Crunch PMC member ctzolov@pivotal.io @christzolov blog.tzolov.net
  4. 4.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Contents • Data System Concepts • Data System Reference Architectures. • Real Life Use Case: OLTP and OLAP Data Systems Integration • Passive Data Synchronization (Demo) • Federated Queries With HAWQ • HAWQ Web Tables • HAWQ PXF Architecture • Geode PXF (Demo) 4
  5. 5.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ How Compute Arbitrary Functions  on Arbitrary Data  Query = Function (All Data) 5
  6. 6.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System Concepts 6
  7. 7.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data What is Data? • Data is Information • Derived vs. Fact Information • Not derived information will be called Data • Data hold to be true simply because it exists • All other information is derived from the Data 7
  8. 8.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System System that computes arbitrary functions on arbitrary data. • Low Latency • High Throughput • Scalability • Ad hoc queries • Provenance 8
  9. 9.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System Principles The 6-elements Application = Data-Native System • Fact Data • Immutable Data • Deterministic Functions • Data-Lineage • Data Locality - space or temporal • All Data vs. Working Set 9
  10. 10.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System – Key Properties 10
  11. 11.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System Architectures 11
  12. 12.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System Architectural Patterns How to build Data Systems? • Data Lake • Lambda • Kappa / Samza • Tachyon 12
  13. 13.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data Lake (From ETL to ELT) Single Store For All Data • Store Raw Data • No Data Schema Limitations • No Data Volume Limitations • Late evaluation • Extract-Transform-Load (ETL) vs. Extract-Load-Transform (ELT) • Data Lineage 13
  14. 14.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ • Batch Layer • Serving Layer • Speed Layer 14 Lambda Architecture
  15. 15.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ • Durable Queues (Kafka) • Replaying History • Versioned Outputs • Dynamic Output Switch 15 Kappa Architecture
  16. 16.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Tachyon No write speed constrains • Data-Lineage Based Fault Tolerance • Asynchronous Checkpoints • Unbounded Writes 16
  17. 17.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Real Life Use Case 17
  18. 18.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Real Life Use Case • Integrate an In-Memory Data Grid - Geode (e.g. GemFire) with SQL-On-Hadoop OLAP system - HAWQ • Provide an unified data view across both systems • Use Geode as Slowly Changing Dimensions (SCDs) store for HAWQ • Keep the Operational and Historical data in Sync 18
  19. 19.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ • Cache - Performance / Consistency / Resiliency • Region • Distributed j.u.Map on steroids • Highly available, redundant • Distributed • Locator, Server, Client Apache Geode & GemFire (OLTP) 19
  20. 20.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Geode use cases… 20 China Railway
 Corporation 5,700 train stations 4.5 million tickets per day 20 million daily users 1.4 billion page views per day 40,000 visits per second * http://pivotal.io/big-data/pivotal-gemfire Indian Railways 7,000 stations 72,000 miles of track 23 million passengers daily 120,000 concurrent users 10,000 transactions per minute
  21. 21.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ (SQL-on-HADOOP) • Enterprise SQL on Hadoop analytic engine built around a high-performance Greenplum MPP Database • HAWQ runs natively on Apache Hadoop, HDFS and YARN. • Supports multiple Hadoop file formats: Parquet, HDFS and Avro • 100% ANSI SQL compliant (ANSI SQL-92, SQL-99, and SQL-2003…) • ODBC and JDBC connectivity • Implemented in C and C++ 21
  22. 22.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ (SQL-on-HADOOP) • Runs TPC-DS benchmark in half the wall clock time that Impala • Beats Impala by overall 454% in performance • Compared to Hive, HAWQ provides an additional of 344% of performance improvement on complex queries • Impala and Apache Hive™ do not support all standard TPC-DS queries • HAWQ runs 100% of them natively. 22
  23. 23.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Test Environment • HDFS – Distributed file system. PHD or HDP Hadoop distributions. • HAWQ - SQL-on-Hadoop. • Geode/GemFire - In-memory data grid. • SpringXD – runtime integration and streaming platform. • Apache Ambari – Manages all Clusters. • Apache Zeppelin – Web interface for interacting with different Data Systems. 23 GEODE Locator, Servers HAWQ Master, Segments HDFS NameNode, DataNodes SPRING XD Admin, Containers Zeppelin Notebook Ambari
  24. 24.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Test Environment - Ambari Management 24
  25. 25.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Data Synchronization 25
  26. 26.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Synchronization (Case 1) 26
  27. 27.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Synchronization (Case 1) - Demo 27
  28. 28.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Synchronization (Case 2) - gpfdist 28
  29. 29.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Synchronization (Case 2): Demo 29
  30. 30.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Federated Queries With HAWQ 30
  31. 31.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Access dynamic data sources on a web server or by executing OS scripts. 31 HAWQ Web Table – Architecture
  32. 32.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ Web Table • HAWQ Web Table - access dynamic data sources on a web server or by executing OS scripts. • Leverage Geode REST API and OQL. • RESTful SpringBoot app to convert JSON into TSV. 32 CREATE EXTERNAL WEB TABLE EMPLOYEE_WEB_TABLE (...) EXECUTE E'curl http://<adapter proxy>/gemfire-api/v1/ queries/adhoc?q=<URLencoded OQL statement>' ON MASTER FORMAT 'text' (delimiter '|' null 'null' escape E'\\');
  33. 33.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ Web Table – Limitations • Not Scalable • No “Filter Push Down” • Inflexibility • No Compression • Requires Additional Components 33
  34. 34.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ PXF/Geode External Table 34
  35. 35.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Pivotal eXtension Framework (PXF) • Java-Based • Parallel, High Throughput Data Access • Heterogeneous Data Sources. • ANSI-compliant SQL On Any Dataset • Wide variety of PXF plugins 35
  36. 36.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ PXF Architecture 36
  37. 37.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ PXF Data Model • Data Source is modeled as a collection of one or more Fragments. • Each Fragment consists of many Rows that in turn are split into typed Fields. • Analyzer (optional) provides PXF statistical data for the HAWQ query optimizer • Metadata about the data source locations, access attributes, table schemas formats, SQL queries filters, etc 37
  38. 38.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ PXF Processors 38
  39. 39.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ PXF Runtime Model 39
  40. 40.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ CREATE EXTERNAL TABLE ext_table_name <attr list, ...> LOCATION('pxf://<host>:<port>/path/to/data? FRAGMENTER=package.name.FragmenterForX& ACCESSOR=package.name.AccessorForX& RESOLVER=package.name.ResolverForX& <Other custom user options>=<value>') FORMAT ‘custom'(formatter='pxfwritable_import'); 40 PXF External Tables Definition
  41. 41.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Available PXF Plugins • HdfsTextSimple • HdfsTextMulti • Hive • HiveRC • HiveText • Hbase • Avro 41 • Accumulo • Casandra • JSON • Redis • Geode/Gemfire • Pipes
  42. 42.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Build Geode PXF Plugin 42
  43. 43.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ <profile> <name>GEMFIRE</name> <description>A profile for reading Gemfire data</description> <plugins> <fragmenter>io.pivotal.pxf.plugins.gemfire.GemfireFragmenter</ fragmenter> <accessor>io.pivotal.pxf.plugins.gemfire.GemfireAccessor</accessor> <resolver>io.pivotal.pxf.plugins.gemfire.GemfireResolver</resolver> </plugins> </profile> 43 PXF Geode Profile
  44. 44.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Geode External Table Definition CREATE EXTERNAL TABLE <GEMFIRE_TABLE_NAME> (...) LOCATION('pxf://<namenode>/<path>? PROFILE=GEMFIRE & LOCATORS=<gemfire-server:port> & REGION=<region-name>') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); 44
  45. 45.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ public class GemfireFragmenter extends Fragmenter { public GemfireFragmenter(InputData inputData) throws Exception { super(inputData); } public List<Fragment> getFragments() throws Exception { return Arrays.asList (new Fragment( inputData.getDataSource(), new String[] { inputData.getUserProperty("URL-HOST") }, new byte[0])); } } 45 Geode Fragmenter
  46. 46.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ public class GemfireAccessor extends Plugin implements ReadAccessor { public GemfireAccessor(InputData input) throws Exception { super(input); } public boolean openForRead() throws Exception { … } public OneRow readNextObject() throws Exception {…} public void closeForRead() throws Exception {…} } 46 Geode Accessor
  47. 47.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ public class GemfireResolver extends Plugin implements ReadResolver { public GemfireResolver(InputData input) { super(input); } public List<OneField> getFields(OneRow paramOneRow) throws Exception { } } 47 Geode Resolver
  48. 48.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Demo 3: Federated queries with PXF/Geode 48
  49. 49.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Demo 3: Federated queries with PXF/Geode 49
  50. 50.

    Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software,

    Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 50 By Christian Tzolov @christzolov HADOOP WORKFLOWS AND DISTRIBUTED YARN APPS USING SPRING TECHNOLOGIES Learn More. Stay Connected. @springcentral Spring.io/video