Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Coordinating the Many Tools of Big Data Page 1 Alan
F. Gates @alanfgates Big Data Spain 2012 http://www.bigdataspain.org/

Big Data = Terabytes, Petabytes, … Page 2 © Hortonworks
2012 Image Credit: Gizmodo

But It Is Also Complex Algorithms Page 3 © Hortonworks
2012 • An example from a talk byJimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs (user defined functions) in Pig. This equation uses stochastic gradient descent to do machine learning across with their data: w(t+1) =w(t) −γ(t)∇ (f(x;w(t)),y)

Pre-Cloud: One Tool per Machine Page 4 © Hortonworks 2012
• Databases presented SQL or SQL-like paradigms for operating on data • Other tools came in separate packages (e.g. R) or on separate platforms (SAS). Data Warehouse Statistical Analysis Cube/M OLAP OLTP Data Mart

Cloud: Many Tools One Platform Page 5 © Hortonworks 2012
• Users no longer want to be concerned with what platform their data is in – just apply the tool to it • SQL no longer the only or primary data access tool Data Warehouse Statistical Analysis Data Mart Cube/M OLAP OLT P

Upside - Pick the Right Tool for the Job Page
6 © Hortonworks 2012

Downside – Tools Don’t Play Well Together Page 7 ©
Hortonworks 2012 • Hard for users to share data between tools – Different storage formats – Different data models – Different user defined function interfaces

Downside – Wasted Developer Time Page 8 © Hortonworks 2012
• Wastes developer time since each tool supplies the redundant functionality Executor Physical Planner Optimizer Parser Executor Physical Planner Optimizer Parser Metadata Pig Hive

Downside – Wasted Developer Time Page 9 © Hortonworks 2012
• Wastes developer time since each tool supplies the redundant functionality Executor Physical Planner Optimizer Parser Executor Physical Planner Optimizer Parser Metadata Pig Hive Overlap

Conclusion: We Need Services Page 10 © Hortonworks 2012 •
We need to find a way to share services where we can. • Gives users the same experience across tools • Allows developers to share effort when it makes sense

Hadoop = Distributed Data Operating System Page 11 © Hortonworks
2012 Service Hadoop Component Single Node Analogue Table Management HCatalog RDBMS User access control Hadoop /etc/passwd, file system permissions, etc. Resource management YARN Process management Notification HCatalog Signals, semaphores, mutexes REST/Connectors HCatalog, Hive, HBase, Oozie Network layer Batch data processing Data Virtual Machine JVM Exists Pieces exist in this component To be built

Hadoop = Distributed Data Operating System Page 12 © Hortonworks
2012 Service Hadoop Component Single Node Analogue Table Management HCatalog RDBMS User access control Hadoop /etc/passwd, file system permissions, etc. Resource management YARN Process management Notification HCatalog Signals, semaphores, mutexes REST/Connectors HCatalog, Hive, HBase, Oozie Network layer Batch data processing Data Virtual Machine JVM Exists Pieces exist in this component To be built

HCatalog – Table Management Page 13 © Hortonworks 2012 •
Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access

Data Access Without HCatalog Page 14 © Hortonworks 2012 Metastore
HDFS Hive Metastore Client InputFormat/ OuputFormat SerDe InputFormat/ OuputFormat MapReduce Pig Load/ Store

Data & Metadata Access With HCatalog Page 15 © Hortonworks
2012 Metastore HDFS Hive Metastore Client InputFormat/ OuputFormat SerDe HCatInputFormat/ HCatOuputFormat MapReduce Pig HCatLoader/ HCatStorer REST External System

Without HCatalog Page 16 © Hortonworks 2012 Feature MapReduce Pig
Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists Schema Encoded in app Declared in script or read by loader Read from metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata

With HCatalog Page 17 © Hortonworks 2012 Feature MapReduce +
HCatalog Pig + HCatalog Hive Record format Record Tuple Record Data model int, float, string, maps, structs, lists int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists Schema Read from metadata Read from metadata Read from metadata Data location Read from metadata Read from metadata Read from metadata Data format Read from metadata Read from metadata Read from metadata

YARN – Resource Manager Page 18 © Hortonworks 2012 •
Hadoop 1.0: HDFS plus MapReduce • Hadoop 2.0: HDFS plus YARN Resource Manager, an interface for developers to write parallel applications on top of the Hadoop cluster • The Resource Manager provides: – applications a way to request resources in the cluster – allocation and scheduling of machine resource to the applications • MapReduce is now an application provided inside YARN • Other systems have been ported to YARN such as Spark (cluster computing system that focuses on in memory operations) and Storm (streaming computations)

Architectural Comparison Page 19 © Hortonworks 2012 Hadoop 1.0 Hadoop
2.0

Data Virtual Machine – Shared Batch Processing Page 20 ©
Hortonworks 2012 • Recall our previous diagram of Pig and Hive Executor Physical Planner Optimizer Parser Executor Physical Planner Optimizer Parser Metadata Pig Hive Overlap

A VM That Provides Page 21 © Hortonworks 2012 •
Standard operators (equivalent of Java byte codes): – Project – Select – Join – Aggregate – Sort – … • An optimizer that could – Choose appropriate implementation of an operator based on physical data characteristics – Dynamically re-optimize the plan based on information gathered executing the plan • Shared execution layer – Can provide its own YARN application master and improve on MapReduce paradigm for batch processing • Shared User Defined Function (UDF) framework – user code works across systems

2012 Map Map Reduce Reduce Map Map Reduce Reduce HDFS Why do I need these maps?

2012 Map Map Reduce Reduce Map Map Reduce Reduce HDFS Map Map Reduce Reduce Reduce Reduce • Removed an entire write/read cycle of HDFS • Still want to checkpoint sometimes

Taking Advantage of YARN – In Memory Data Transfer Page
26 © Hortonworks 2012 Map Map Reduce Reduce These are writes to disk Switching shuffle to in memory instead of on disk • Better performance • Data must also be spilled to disk for retry-ability and to handle memory overflow • Will benefit from stronger guarantees of simultaneous execution

On the Fly Optimization Page 27 © Hortonworks 2012 •
Traditionally databases do all optimization up front based on statistics – But often there are not statistics in Hadoop – Languages like Pig Latin allow very long series of operations that make up front estimates unrealistic • Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information MR Job MR Job Hash Join

Traditionally databases do all optimization up front based on statistics – But often there are not statistics in Hadoop – Languages like Pig Latin allow very long series of operations that make up front estimates unrealistic • Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information MR Job MR Job Hash Join Output fits in memory

Traditionally databases do all optimization up front based on statistics – But often there are not statistics in Hadoop – Languages like Pig Latin allow very long series of operations that make up front estimates unrealistic • Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information MR Job MR Job Hash Join MR Job MR Job Map- side Join Load into distributed cache

Coordinating the Many Tools of Big Data - Apach...

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

Coordinating the Many Tools of Big Data Page 1 Alan

Big Data = Terabytes, Petabytes, … Page 2 © Hortonworks

But It Is Also Complex Algorithms Page 3 © Hortonworks

Pre-Cloud: One Tool per Machine Page 4 © Hortonworks 2012

Cloud: Many Tools One Platform Page 5 © Hortonworks 2012

Upside - Pick the Right Tool for the Job Page

Downside – Tools Don’t Play Well Together Page 7 ©

Downside – Wasted Developer Time Page 8 © Hortonworks 2012

Downside – Wasted Developer Time Page 9 © Hortonworks 2012

Conclusion: We Need Services Page 10 © Hortonworks 2012 •

Hadoop = Distributed Data Operating System Page 11 © Hortonworks

Hadoop = Distributed Data Operating System Page 12 © Hortonworks

HCatalog – Table Management Page 13 © Hortonworks 2012 •

Data Access Without HCatalog Page 14 © Hortonworks 2012 Metastore

Data & Metadata Access With HCatalog Page 15 © Hortonworks

Without HCatalog Page 16 © Hortonworks 2012 Feature MapReduce Pig

With HCatalog Page 17 © Hortonworks 2012 Feature MapReduce +

YARN – Resource Manager Page 18 © Hortonworks 2012 •

Architectural Comparison Page 19 © Hortonworks 2012 Hadoop 1.0 Hadoop

Data Virtual Machine – Shared Batch Processing Page 20 ©

A VM That Provides Page 21 © Hortonworks 2012 •

Taking Advantage of YARN – MR* Page 22 © Hortonworks

Taking Advantage of YARN – MR* Page 23 © Hortonworks

Taking Advantage of YARN – MR* Page 24 © Hortonworks

Taking Advantage of YARN – In Memory Data Transfer Page

Taking Advantage of YARN – In Memory Data Transfer Page

On the Fly Optimization Page 27 © Hortonworks 2012 •

On the Fly Optimization Page 28 © Hortonworks 2012 •

On the Fly Optimization Page 29 © Hortonworks 2012 •

Thank You Big Data Spain Page 30 © Hortonworks 2012