Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

Big Data Spain
December 15, 2016
22

The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

Big Data Spain

December 15, 2016
Tweet

More Decks by Big Data Spain

Transcript

  1. The Enterprise and Connected Data, Trends in the Apache Hadoop

    Ecosystem Alan Gates Co-Founder Hortonworks @alanfgates
  2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Our

    Hadoop Journey Begins… 1 ° ° ° ° ° ° N HDFS MapReduce Batch apps 2006
  3. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Today

    Our Hadoop Journey: Ecosystem Innovation Accelerates 2006 2011
  4. © Hortonworks Inc. 2011 – 2016. All Rights Reserved 6

    Years of Apache Hive and Beyond • Apache Hive becomes a Top-Level Project • HiveServer2 adds ODBC/JDBC • SQL breadth expands with windowing and more • Apache Tez enters incubation • Hive 0.13 marks delivery of the Stinger Initiative with Tez, Vectorized Query and ORCFile support • Standard SQL authorization, integration with Apache Ranger • ACID transactions introduced • Governance added with Apache Atlas integration • Hive 2 introduces LLAP and intelligent in-memory caching 2010 2011 2012 2013 2014 2015 2016 A SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud • Extensive SQL:2011 Support • Compatible with every major BI Tool • Proven at 300+ PB Scale
  5. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive

    2 with LLAP: Architecture Overview Deep Storage HDFS S3 + Other HDFS Compatible Filesystems YARN Cluster LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries
  6. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive

    2 with LLAP: 25+x Performance Boost 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup (x Factor) Query Time(s) (Lower is Better) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
  7. © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s

    new in Spark 2.0?  API Improvements – SparkSession – new entry point – Unified DataFrame & DataSet API – Structured Streaming/Continuous Application  Performance Improvements – Tungsten Phase 2 – Whole-stage code generation  ML – ML model persistence – Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)  SparkSQL – SQL 2003 support (new ANSI SQL parser, subquery support)
  8. © Hortonworks Inc. 2011 – 2016. All Rights Reserved How

    to Secure and Govern Access to Your Data? Classification Prohibition Time Location Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake Policies ?
  9. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Secure

    and Govern Your Data with Tag-Based Access Policies Classification Prohibition Time Location Policies PDP Resource Cache Ranger Manage Access Policies and Audit Logs Track Metadata and Lineage Atlas Client Subscribers to Topic Gets Metadata Updates Atlas Metastore Tags Assets Entitles Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake
  10. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data

    In Motion  Constrained  High-latency  Localized context  Hybrid – cloud/on-premises  Low-latency  Global context SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
  11. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Our

    Hadoop Journey: From the Data Center to the Cloud! 2006 Today
  12. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why

    Hadoop in the Cloud? Unlimited Elastic Scale Ephemeral & Long-Running IT & Business Agility No Upfront HW Costs
  13. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key

    Architectural Considerations for Hadoop in the Cloud Shared Data & Storage On-Demand Ephemeral Workloads 10101 10101010101 01010101010101 0101010101010101010 Elastic Resource Management Shared Metadata, Security & Governance
  14. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared

    Data and Storage Understand and Leverage Unique Cloud Properties  Shared data lake is cloud storage accessible by all apps  Cloud storage segregated from compute  Built-in geo-distribution and DR Focus Areas  Address cloud storage consistency and performance  Enhance performance via memory and local storage Shared Data & Storage 10101 10101010101 01010101010101 0101010101010101010
  15. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enhance

    Performance via Caching Tabular Data: LLAP Read + Write-thru Cache  Shared across jobs / apps and across engines  Cache only the needed columns  Spills to SSD when memory is full (anti-caching)  Read & Write-through cache  Security: Column-level and row-level HDFS Caching for Non-tabular Data  Cache data from cloud storage as needed  Write-through cache Workloads Cloud Storage LLAP R/W Tables HDFS Files Cache
  16. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Prescriptive

    On-Demand Ephemeral Workloads On-Demand Ephemeral Workloads Data Science R/W Tables Compute Fabric ETL R/W Tables Compute Fabric Warehouse R/W Tables Compute Fabric Search R/W Tables Compute Fabric
  17. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared

    Data Requires Shared Metadata, Security, and Governance Shared Metadata Across All Workloads  Metadata considerations – Tabular data metastore – Lineage and provenance metadata – Pipeline and job management metadata – Add upon ingest – Update as processing modifies data  Access / tag-based policies and audit logs  Centrally stored to facilitate use across clusters – Ex. backed by Cloud RDS (or shared DB) Classification Prohibition Time Location Streams Pipelines Feeds Tables Files Objects Shared Metadata Policies
  18. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Elastic

    Resource Management in Context of Workload Workload Management vs. Cluster Management  Understand resource needs of different workload types  Add / remove resources to meet workload SLAs  Manage compute power and high-performance data-access (ex., LLAP)  Pricing-aware: instances (spot, reserved), data, bandwidth Elastic Resource Management
  19. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data

    in Motion Data at Rest Deep Historical Analysis DATA C E N TE R Stream Analytics Edge Data Data in Motion Machine Learning C LOU D Edge Data Edge Analytics Data at Rest Transformational Applications Require Connected Data