Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to use Hadoop for operational and transact...

Big Data Spain
November 25, 2014

How to use Hadoop for operational and transactional purposes by RODRIGO MERINO at Big Data Spain 2014

Hadoop is an open source framework designed to rapidly ingest, store, and analyze large data sets. Hadoop is well suited for batch processing where immediate interactive analytics are not required. But today, Hadoop does not support the operational and transactional workloads. These workloads consist of a constant flow of transactions requiring low-latency response times for read/write access.

Big Data Spain

November 25, 2014
Tweet

More Decks by Big Data Spain

Other Decks in Business

Transcript

  1. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. Trafodion: How to use Hadoopfor operational and transactional purposes Enterprise-Class Operational SQL-on-Hadoop DBMS
  2. © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained

    herein is subject to change without notice. Agenda Current database landscape … and a prediction How special are transactions HP Trafodion. Trafodion Innovation Use cases Trafodion: an open-source project
  3. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained

    herein is subject to change without notice. Thecurrentsituation Each database type has its strengths and their perfect fit … but they also have weaknesses You can’t use one of them for all type of workloads! Source: http://www.datasciencecentral.com/profiles/blogs/ hadoop-vs-nosql-vs-sql-vs-newsql-by-example
  4. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. 6 Hadoop workload profiles Operational Non-interactive • Real-time analytics • Data preparation • Incremental batch processing • Dashboards, scorecards Interactive • Parameterized reports • Drilldown visualization • Exploration Batch • Operational batch processing • Enterprise reports • Data mining •Transactional SQL = OLTP + interactions Sub-second Response Time Hours Current Market Focus: Data Warehousing and Analytics Operational Optimizations Data Integrity Workload Management Transaction Support Real-time Performance Exposes Hadoop limitations
  5. © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained

    herein is subject to change without notice. 7 We could have a situation like this. Sound familiar? But if we use the right tool for each job… MapReduce MPP DBMS NoSQL DBMS In Memory Analytics Large Data Movement / Replication of Data Varying Platform Requirements Departmental segmentation HDFS Centric Traditional
  6. © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained

    herein is subject to change without notice. 8 Big Data is hard to move… because it’s BIG !!! Source: www.pinterest.com
  7. © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained

    herein is subject to change without notice. 9 And there is a fair chance that something will fail Source: www.shutterstock.com
  8. © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained

    herein is subject to change without notice. 11 Hadoop: One platform to rule them all Source: www.wallconvert.com
  9. The Future of Hadoop: What Happened & What's Possible? Operational

    SQL-on-Hadoop “Transactions were something that were long thought to be out of scope for this style of platform. There are a lot of important cases for transactions. You are selling a ticket to something then you need to move money from one place to another. You need to assign a seat to someone. And you need to make sure that the money is in one place or the other. Not in both, not nowhere. And you need to at the same time assign that seat or not assign that seat. This is an important class of workload that is currently well served but not by the Hadoop platform. A year ago Google published a paper describing their internal system they have built on their platform, that is very similar to Hadoop, which does this, demonstrating that its possible to bring online transaction processing to this style of platform. And in the past when we have seen its possible, within a few years it happens. So I think the prediction we can make here is that it is inevitable that we will see just about every kind of workload be moved to this platform – even Online Transaction Processing. – Doug Cutting, Cloudera, October 30 2013
  10. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. Characteristics of operational DBMS applications Generalized characteristics and requirements: • Low latency response times • ACID (data consistency guaranteed) transactions • Large number of users • High concurrency • High availability • Scalable data volumes • Multi-structured data • Rapidly evolving data requirements (i.e. flexible schemas) Expose Hadoop limitations Operational Query Optimization Data Integrity Workload Management Transaction Support Real-time Performance
  11. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. Characteristics of operational DBMS applications Generalized characteristics and requirements: • Low latency response times • ACID (data consistency guaranteed) transactions • Large number of users • High concurrency • High availability • Scalable data volumes • Multi-structured data • Rapidly evolving data requirements (i.e. flexible schemas) Expose Hadoop limitations Operational Query Optimization Data Integrity Workload Management Transaction Support Real-time Performance
  12. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. ACID properties for transactions Atomicity Either all operations of the transaction are properly reflected in the database or none are. Consistency Execution of a transaction in isolation preserves the consistency of the database. Isolation Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. Durability After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.
  13. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. The typical bank transfer example Transfer £50 from account A to account B Read(A) A = A - 50 Write(A) Read(B) B = B + 50 Write(B) Atomicity Shouldn’t take money from A without giving it to B Consistency Money isn’t lost or gained Isolation Other queries shouldn’t see A or B change until completion Durability The money does not go back to A transaction
  14. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. And a funnier example “As the 6 a.m. deadline approached, Police Minister Toleafoa Faafisi went on national radio to tell drivers everywhere to stop their vehicles. Minutes later, Prime Minister Tuilaepa Sailele Malielegaoi broadcast the formal instructions for drivers to switch sides.” Imagine we could do it in a SQL database: If this “transaction” were not atomic there would be trouble! On 2009 Samoa switched from driving on the right side of the road to the left Source: michaeljswart.com
  15. Trafodion - Introduction Open source project to develop transactional SQL-on-HBase

    Rides the unstoppable Hadoop wave! Transforms how companies store, process, and share big data Affordable performance, elastic scalability, availability Open source project - downloadable for free Eliminates vendor lock-in and licensing fees Leverages community development resources and speed Schema flexibility and multi-structured data Capturing and storing all data for all business functions Full-function ANSI SQL Reuses existing SQL skills and improves developer productivity Distributed ACID transaction protection Guarantees data consistency across multiple rows, tables, SQL statements Targeted for operational workloads! Optimized for real-time transaction processing applications i.e. OL TP + New Style Transactions (Interactions + Observations) Leverages 20+ years of HP investments + Transactional SQL HBase
  16. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. Trafodion - Features Complete: Full-function SQL Reuse existing SQL skills and improve developer productivity Protected: Distributed ACID transactions Data consistency across multiple rows, tables, SQL statements Efficient: Low-latency R/W transactions Optimized for real-time transaction processing applications Interoperable: Standard ODBC/JDBC access Works with existing tools and applications Data federation: Trafodion/HBase/Hive tables Enables multiple data model deployment Scalable: Elastic scale for high concurrency Provides elastic scalability as number of users / data grows Highly Available: For enterprise applications Leverages HBase / Hadoop replication Open: Hadoop and Linux distribution neutral Easy to add to existing infrastructure with no vendor lock-in Eco-system: Leverages large Hadoop eco-system Can use any tool or database accessing Hadoop Joint HP Labs & HP-IT project for transactional SQL database capabilities on Hadoop + Transactional SQL Hadoop
  17. HBase vs. Trafodion comparison HBase Trafodion + HBase Data abstraction

    Key and value pair Relational schema Physical Layout Column family store where row data is stored together by cells Same except there is a single column family with space-saving column encoding Column values Uninterpreted array of bytes Explicitly defined and enforced data types ACID Guarantee Single row atomicity Multi- SQL statements, tables, and rows defined as part of transaction Language API Get/put/delete SQL (Trafodian invokes native HBase API) Row Key Index Single (string) row key Composite (multi-column) row key Secondary Indexes Not supported Arbitrary secondary key columns
  18. Trafodion and Hadoop – Benefits! Leverages and extends Hadoop for

    transactional SQL workloads Complete: Full-function ANSI SQL Reuse existing SQL skills and improve developer productivity Protected: Distributed ACID transactions Guarantees data consistency across multiple rows, tables, SQL statements Efficient: Optimized for low-latency read and write transactions Supports real-time transaction processing applications Flexible: Schema flexibility and multi-structured data Seamlessly integrates structured, unstructured, and semi-structured data Interoperable: Standard ODBC/JDBC access Works with existing tools and applications Open: Hadoop and Linux distribution neutral Easy to add to your existing infrastructure and no vendor lock-in Open source project sponsorship and investment from HP Scale without complexity Reuse SQL skills Complements Hadoop Reduce Costs Real-time Performance + HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  19. Trafodion innovation built upon Hadoop stack Leverages Hadoop and HBase

    for core modules • Maintains API compatibility • Inherited scalability and availability Differentiation • ANSI SQL via ODBC/JDBC • Relational schema abstraction • Distributed transaction protection • Mature SQL technology • Automatic parallelism Zookeeper Client Application using ODBC/JDBC on Windows/Linux Client Services for ODBC and JDBC SQL Compiler / Optimizer / Executor Distributed Transaction Manager Hive HBase HDFS + Standard Hadoop Trafodion HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  20. Trafodion – Software architecture (3 layers) JDBC ODBC User and

    ISV Operational Applications Driver Client SQL Storage Engine *ESP CMP Master ESP DTM WMS Compiler and Optimizer Workload Management SQL Parallelism Distributed Transaction Management . . . . Future Database Connectivity HBase Relational Schema Trafodion Tables HDFS Data Store Integration HBase Native HBase Tables KVS, Columnar via HBase API + coprocessors Hive Direct HDFS access to Hive tables using HCatalog *Executor Server Process HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  21. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. Process overview and SQL execution flow … … HBase HBase Client connections via ODBC, JDBC .Net (future). SQL execution service with an instance of the executor serving as the master for parallel SQL execution plans. CMP (Compiler/Optimizer) component to generate the optimal execution plan. DTM provides distributed transaction management across the cluster. Executor server processes used for parallel execution based on plan (optional). Multiple layers of ESP may be used. HBase data services responsible for accessing and maintaining database objects. Operational Application Clients. HDFS HBase-Trx provides transactional resource management for HBase. Database connection service – lightweight coordination service & process control using Apache Zookeeper. ESP ESP DTM CMP DCS Master TRX TRX
  22. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. Optimized execution plans based on statistics Optimizer features • Top-down, multi-pass optimizations, branch and bound plan pruning considers more potential plans • Utilizes “equal-height” histogram statistics • SQL pushdown considerations e.g. predicate evaluation • Eliminates sorts when feasible, syntactically and semantically • In-memory vs. overflow considerations • Optimal degree of parallelism (DOP) considerations including non-parallel plans Benefits • Facilitates enhanced parallelism and SQL object handling efficiencies • Optimizations for operational transactions and reporting workloads SQL Statement Optimized Plan SQL Normalizer Plan Generator Table Statistics Cardinality Estimator Cost Estimator SQL Analyzer
  23. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. Data flow SQL execution with optimized DOP Data-flow, scheduler-driven Parallelism throughout Scan Scan Join Group By Operator parallelism Partitioned parallelism Pipeline parallelism Master Join Scan Group by Scan 40 30 20 – Operators executed by Master or ESP – Varying degrees of parallelism – SQL divided into operators Nested, merge, hash joins; unions; partial & full aggregations; sorts; input/output operations (scan, update, delete, insert)
  24. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. Trafodion Distributed transaction protection  Multiple row inserts, updates, and deletes to a table  Multiple table and SQL insert, update, and delete statements  Distributed multiple HBase region insert, update, and delete transaction (2-phase commit)  Read-only transaction (eliminates commit overhead) Trafodion 1 4 3 . . . Region A Region B Region C Region D 2 Table A Table B Table C Table A
  25. HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information

    contained herein is subject to change without notice. Integrating external (non-Trafodion) Hadoop tables Benefits • Able to run queries against external tables without needing to copy them into a Trafodion table structure • Optimized access to external HBase and Hive tables without complex map-reduce programming • Data can be joined across disparate data sources (e.g. Trafodion, Hive, HBase) • Able to leverage HBase’s inherent schema flexibility capabilities HBase tables (created outside of Trafodion by HBase) • Schema-less format i.e. no information in Trafodion metadata • Accessible through Trafodion SQL in two modes – Cell-per-row access i.e. each row returned represents a single HBase cell – Row-wise access i.e. all column values of the row will be returned as a single, big varchar Hive tables (created outside of Trafodion by Hive) • Hive metadata, HDFS files storage, delimited data, read/append only • Support for both SELECT and INSERT statements • Automatic data type mapping
  26. Good fit for Trafodion • Online financial management Finance •

    Billing systems • Provisioning systems Telecom • RFID tracking Manufacturing • Smart Metering Energy • Authorization and claims processing Healthcare • 911 Emergency System Government • Reservation systems Transportation • Online shopping Consumer & Retail Multi-Structured Data ACID Protection, Data Integrity Low Latency, High Concurrency Generates Revenue Touches the Customer Helps Run the Business HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  27. 35 HP PRIVATE © Copyright 2013 Hewlett-Packard Development Company, L.P.

    The information contained herein is subject to change without notice. Operational SQL on Hadoop – Use cases • Integration of structured, semi- structured, and unstructured support • Integration of operational, historical, & external (Big) data along common master data for better insights Item id Description Cost Price … Structured Type Display Size Resolution Brand Model 3D … … ISBN Author Publish Date Format Dept TV Book … Semi- structured SELECT all TVs WHERE Price > 2000 and Type = ‘Plasma’ and Display Size > ‘50’ and customer sentiment is very positive Unstructured Image … Review … Open distributed HDFS structures HBase & Hive Free at last! Capture data directly into open file structures Accessible for reporting & analytics with no latency
  28. Modern open source environment Following best practices of OpenStack project

    Source code in GitHub Build/test in OpenStack gerrit, zuul, jenkins Defect tracking in Launchpad Documentation in MediaWiki HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  29. Building an Open Source Community Simple installation Meritocracy Recruiting project

    contributors Share your expertise: Developing, fixing defects, testing, writing, translating and more Want to try? Discover our capabilities: Download and install in your Hadoop environment and take a test-drive www.trafodion.org Recruiting project contributors HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  30. See for yourself… Come discover and develop on Trafodion www.trafodion.org

    HP © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.