Slide 1

Slide 1 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ SPRINGONE2GX WASHINGTON, DC Federated Queries with HAWQ SQL on Hadoop and Beyond By Christian Tzolov @christzolov

Slide 2

Slide 2 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Safe Harbor Statement The following is intended to outline the general direction of Pivotal's offerings. It is intended for information purposes only and may not be incorporated into any contract. Any information regarding pre-release of Pivotal offerings, future updates or other planned modifications is subject to ongoing evaluation by Pivotal and is subject to change. This information is provided without warranty or any kind, express or implied, and is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions regarding Pivotal's offerings. These purchasing decisions should only be based on features currently available. The development, release, and timing of any features or functionality described for Pivotal's offerings in this presentation remain at the sole discretion of Pivotal. Pivotal has no obligation to update forward looking information in this presentation. 2

Slide 3

Slide 3 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 3 Bio Christian Tzolov Pivotal Technical Architect, BigData, Hadoop, SpringXD Apache Committer, Crunch PMC member [email protected] @christzolov blog.tzolov.net

Slide 4

Slide 4 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Contents • Data System Concepts • Data System Reference Architectures. • Real Life Use Case: OLTP and OLAP Data Systems Integration • Passive Data Synchronization (Demo) • Federated Queries With HAWQ • HAWQ Web Tables • HAWQ PXF Architecture • Geode PXF (Demo) 4

Slide 5

Slide 5 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ How Compute Arbitrary Functions  on Arbitrary Data  Query = Function (All Data) 5

Slide 6

Slide 6 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System Concepts 6

Slide 7

Slide 7 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data What is Data? • Data is Information • Derived vs. Fact Information • Not derived information will be called Data • Data hold to be true simply because it exists • All other information is derived from the Data 7

Slide 8

Slide 8 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System System that computes arbitrary functions on arbitrary data. • Low Latency • High Throughput • Scalability • Ad hoc queries • Provenance 8

Slide 9

Slide 9 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System Principles The 6-elements Application = Data-Native System • Fact Data • Immutable Data • Deterministic Functions • Data-Lineage • Data Locality - space or temporal • All Data vs. Working Set 9

Slide 10

Slide 10 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System – Key Properties 10

Slide 11

Slide 11 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System Architectures 11

Slide 12

Slide 12 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data System Architectural Patterns How to build Data Systems? • Data Lake • Lambda • Kappa / Samza • Tachyon 12

Slide 13

Slide 13 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data Lake (From ETL to ELT) Single Store For All Data • Store Raw Data • No Data Schema Limitations • No Data Volume Limitations • Late evaluation • Extract-Transform-Load (ETL) vs. Extract-Load-Transform (ELT) • Data Lineage 13

Slide 14

Slide 14 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ • Batch Layer • Serving Layer • Speed Layer 14 Lambda Architecture

Slide 15

Slide 15 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ • Durable Queues (Kafka) • Replaying History • Versioned Outputs • Dynamic Output Switch 15 Kappa Architecture

Slide 16

Slide 16 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Tachyon No write speed constrains • Data-Lineage Based Fault Tolerance • Asynchronous Checkpoints • Unbounded Writes 16

Slide 17

Slide 17 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Real Life Use Case 17

Slide 18

Slide 18 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Real Life Use Case • Integrate an In-Memory Data Grid - Geode (e.g. GemFire) with SQL-On-Hadoop OLAP system - HAWQ • Provide an unified data view across both systems • Use Geode as Slowly Changing Dimensions (SCDs) store for HAWQ • Keep the Operational and Historical data in Sync 18

Slide 19

Slide 19 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ • Cache - Performance / Consistency / Resiliency • Region • Distributed j.u.Map on steroids • Highly available, redundant • Distributed • Locator, Server, Client Apache Geode & GemFire (OLTP) 19

Slide 20

Slide 20 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Geode use cases… 20 China Railway
 Corporation 5,700 train stations 4.5 million tickets per day 20 million daily users 1.4 billion page views per day 40,000 visits per second * http://pivotal.io/big-data/pivotal-gemfire Indian Railways 7,000 stations 72,000 miles of track 23 million passengers daily 120,000 concurrent users 10,000 transactions per minute

Slide 21

Slide 21 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ (SQL-on-HADOOP) • Enterprise SQL on Hadoop analytic engine built around a high-performance Greenplum MPP Database • HAWQ runs natively on Apache Hadoop, HDFS and YARN. • Supports multiple Hadoop file formats: Parquet, HDFS and Avro • 100% ANSI SQL compliant (ANSI SQL-92, SQL-99, and SQL-2003…) • ODBC and JDBC connectivity • Implemented in C and C++ 21

Slide 22

Slide 22 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ (SQL-on-HADOOP) • Runs TPC-DS benchmark in half the wall clock time that Impala • Beats Impala by overall 454% in performance • Compared to Hive, HAWQ provides an additional of 344% of performance improvement on complex queries • Impala and Apache Hive™ do not support all standard TPC-DS queries • HAWQ runs 100% of them natively. 22

Slide 23

Slide 23 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Test Environment • HDFS – Distributed file system. PHD or HDP Hadoop distributions. • HAWQ - SQL-on-Hadoop. • Geode/GemFire - In-memory data grid. • SpringXD – runtime integration and streaming platform. • Apache Ambari – Manages all Clusters. • Apache Zeppelin – Web interface for interacting with different Data Systems. 23 GEODE Locator, Servers HAWQ Master, Segments HDFS NameNode, DataNodes SPRING XD Admin, Containers Zeppelin Notebook Ambari

Slide 24

Slide 24 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Test Environment - Ambari Management 24

Slide 25

Slide 25 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Data Synchronization 25

Slide 26

Slide 26 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Synchronization (Case 1) 26

Slide 27

Slide 27 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Synchronization (Case 1) - Demo 27

Slide 28

Slide 28 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Synchronization (Case 2) - gpfdist 28

Slide 29

Slide 29 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Passive Synchronization (Case 2): Demo 29

Slide 30

Slide 30 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Federated Queries With HAWQ 30

Slide 31

Slide 31 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Access dynamic data sources on a web server or by executing OS scripts. 31 HAWQ Web Table – Architecture

Slide 32

Slide 32 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ Web Table • HAWQ Web Table - access dynamic data sources on a web server or by executing OS scripts. • Leverage Geode REST API and OQL. • RESTful SpringBoot app to convert JSON into TSV. 32 CREATE EXTERNAL WEB TABLE EMPLOYEE_WEB_TABLE (...) EXECUTE E'curl http:///gemfire-api/v1/ queries/adhoc?q=' ON MASTER FORMAT 'text' (delimiter '|' null 'null' escape E'\\');

Slide 33

Slide 33 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ Web Table – Limitations • Not Scalable • No “Filter Push Down” • Inflexibility • No Compression • Requires Additional Components 33

Slide 34

Slide 34 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ HAWQ PXF/Geode External Table 34

Slide 35

Slide 35 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Pivotal eXtension Framework (PXF) • Java-Based • Parallel, High Throughput Data Access • Heterogeneous Data Sources. • ANSI-compliant SQL On Any Dataset • Wide variety of PXF plugins 35

Slide 36

Slide 36 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ PXF Architecture 36

Slide 37

Slide 37 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ PXF Data Model • Data Source is modeled as a collection of one or more Fragments. • Each Fragment consists of many Rows that in turn are split into typed Fields. • Analyzer (optional) provides PXF statistical data for the HAWQ query optimizer • Metadata about the data source locations, access attributes, table schemas formats, SQL queries filters, etc 37

Slide 38

Slide 38 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ PXF Processors 38

Slide 39

Slide 39 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ PXF Runtime Model 39

Slide 40

Slide 40 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ CREATE EXTERNAL TABLE ext_table_name LOCATION('pxf://:/path/to/data? FRAGMENTER=package.name.FragmenterForX& ACCESSOR=package.name.AccessorForX& RESOLVER=package.name.ResolverForX& =') FORMAT ‘custom'(formatter='pxfwritable_import'); 40 PXF External Tables Definition

Slide 41

Slide 41 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Available PXF Plugins • HdfsTextSimple • HdfsTextMulti • Hive • HiveRC • HiveText • Hbase • Avro 41 • Accumulo • Casandra • JSON • Redis • Geode/Gemfire • Pipes

Slide 42

Slide 42 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Build Geode PXF Plugin 42

Slide 43

Slide 43 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ GEMFIRE A profile for reading Gemfire data io.pivotal.pxf.plugins.gemfire.GemfireFragmenter io.pivotal.pxf.plugins.gemfire.GemfireAccessor io.pivotal.pxf.plugins.gemfire.GemfireResolver 43 PXF Geode Profile

Slide 44

Slide 44 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Geode External Table Definition CREATE EXTERNAL TABLE (...) LOCATION('pxf:///? PROFILE=GEMFIRE & LOCATORS= & REGION=') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); 44

Slide 45

Slide 45 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ public class GemfireFragmenter extends Fragmenter { public GemfireFragmenter(InputData inputData) throws Exception { super(inputData); } public List getFragments() throws Exception { return Arrays.asList (new Fragment( inputData.getDataSource(), new String[] { inputData.getUserProperty("URL-HOST") }, new byte[0])); } } 45 Geode Fragmenter

Slide 46

Slide 46 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ public class GemfireAccessor extends Plugin implements ReadAccessor { public GemfireAccessor(InputData input) throws Exception { super(input); } public boolean openForRead() throws Exception { … } public OneRow readNextObject() throws Exception {…} public void closeForRead() throws Exception {…} } 46 Geode Accessor

Slide 47

Slide 47 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ public class GemfireResolver extends Plugin implements ReadResolver { public GemfireResolver(InputData input) { super(input); } public List getFields(OneRow paramOneRow) throws Exception { } } 47 Geode Resolver

Slide 48

Slide 48 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Demo 3: Federated queries with PXF/Geode 48

Slide 49

Slide 49 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Demo 3: Federated queries with PXF/Geode 49

Slide 50

Slide 50 text

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a
 Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 50 By Christian Tzolov @christzolov HADOOP WORKFLOWS AND DISTRIBUTED YARN APPS USING SPRING TECHNOLOGIES Learn More. Stay Connected. @springcentral Spring.io/video