The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

The Enterprise and Connected Data, Trends in the Apache Hadoop
Ecosystem Alan Gates Co-Founder Hortonworks @alanfgates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Our
Hadoop Journey Begins… 1 ° ° ° ° ° ° N HDFS MapReduce Batch apps 2006

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Today
Our Hadoop Journey: Ecosystem Innovation Accelerates 2006 2011

© Hortonworks Inc. 2011 – 2016. All Rights Reserved 6
Years of Apache Hive and Beyond • Apache Hive becomes a Top-Level Project • HiveServer2 adds ODBC/JDBC • SQL breadth expands with windowing and more • Apache Tez enters incubation • Hive 0.13 marks delivery of the Stinger Initiative with Tez, Vectorized Query and ORCFile support • Standard SQL authorization, integration with Apache Ranger • ACID transactions introduced • Governance added with Apache Atlas integration • Hive 2 introduces LLAP and intelligent in-memory caching 2010 2011 2012 2013 2014 2015 2016 A SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud • Extensive SQL:2011 Support • Compatible with every major BI Tool • Proven at 300+ PB Scale

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive
2 with LLAP: Architecture Overview Deep Storage HDFS S3 + Other HDFS Compatible Filesystems YARN Cluster LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive
2 with LLAP: 25+x Performance Boost 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup (x Factor) Query Time(s) (Lower is Better) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

© Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s
new in Spark 2.0?  API Improvements – SparkSession – new entry point – Unified DataFrame & DataSet API – Structured Streaming/Continuous Application  Performance Improvements – Tungsten Phase 2 – Whole-stage code generation  ML – ML model persistence – Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)  SparkSQL – SQL 2003 support (new ANSI SQL parser, subquery support)

© Hortonworks Inc. 2011 – 2016. All Rights Reserved How
to Secure and Govern Access to Your Data? Classification Prohibition Time Location Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake Policies ?

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Secure
and Govern Your Data with Tag-Based Access Policies Classification Prohibition Time Location Policies PDP Resource Cache Ranger Manage Access Policies and Audit Logs Track Metadata and Lineage Atlas Client Subscribers to Topic Gets Metadata Updates Atlas Metastore Tags Assets Entitles Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Data
In Motion  Constrained  High-latency  Localized context  Hybrid – cloud/on-premises  Low-latency  Global context SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Our
Hadoop Journey: From the Data Center to the Cloud! 2006 Today

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Why
Hadoop in the Cloud? Unlimited Elastic Scale Ephemeral & Long-Running IT & Business Agility No Upfront HW Costs

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Key
Architectural Considerations for Hadoop in the Cloud Shared Data & Storage On-Demand Ephemeral Workloads 10101 10101010101 01010101010101 0101010101010101010 Elastic Resource Management Shared Metadata, Security & Governance

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared
Data and Storage Understand and Leverage Unique Cloud Properties  Shared data lake is cloud storage accessible by all apps  Cloud storage segregated from compute  Built-in geo-distribution and DR Focus Areas  Address cloud storage consistency and performance  Enhance performance via memory and local storage Shared Data & Storage 10101 10101010101 01010101010101 0101010101010101010

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Enhance
Performance via Caching Tabular Data: LLAP Read + Write-thru Cache  Shared across jobs / apps and across engines  Cache only the needed columns  Spills to SSD when memory is full (anti-caching)  Read & Write-through cache  Security: Column-level and row-level HDFS Caching for Non-tabular Data  Cache data from cloud storage as needed  Write-through cache Workloads Cloud Storage LLAP R/W Tables HDFS Files Cache

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Prescriptive
On-Demand Ephemeral Workloads On-Demand Ephemeral Workloads Data Science R/W Tables Compute Fabric ETL R/W Tables Compute Fabric Warehouse R/W Tables Compute Fabric Search R/W Tables Compute Fabric

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared
Data Requires Shared Metadata, Security, and Governance Shared Metadata Across All Workloads  Metadata considerations – Tabular data metastore – Lineage and provenance metadata – Pipeline and job management metadata – Add upon ingest – Update as processing modifies data  Access / tag-based policies and audit logs  Centrally stored to facilitate use across clusters – Ex. backed by Cloud RDS (or shared DB) Classification Prohibition Time Location Streams Pipelines Feeds Tables Files Objects Shared Metadata Policies

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Elastic
Resource Management in Context of Workload Workload Management vs. Cluster Management  Understand resource needs of different workload types  Add / remove resources to meet workload SLAs  Manage compute power and high-performance data-access (ex., LLAP)  Pricing-aware: instances (spot, reserved), data, bandwidth Elastic Resource Management

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Data
in Motion Data at Rest Deep Historical Analysis DATA C E N TE R Stream Analytics Edge Data Data in Motion Machine Learning C LOU D Edge Data Edge Analytics Data at Rest Transformational Applications Require Connected Data

Thank You

The Enterprise and Connected Data, Trends in th...

The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by Alan Gates

Big Data Spain

More Decks by Big Data Spain

Featured

Transcript

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Enterprise and Connected Data, Trends in the Apache Hadoop

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Our

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Today

© Hortonworks Inc. 2011 – 2016. All Rights Reserved 6

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

© Hortonworks Inc. 2011 – 2016. All Rights Reserved How

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Secure

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Data

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Our

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Why

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Key

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Enhance

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Prescriptive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Elastic

© Hortonworks Inc. 2011 – 2016. All Rights Reserved Data

Thank You