Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Coordinating the Many Tools of Big Data - Apach...

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/coordinating-many-tools-of-big-data/alan-gates

Big Data Spain

November 16, 2012
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. Coordinating the Many Tools of Big Data Page 1 Alan

    F. Gates @alanfgates Big Data Spain 2012 http://www.bigdataspain.org/
  2. But It Is Also Complex Algorithms Page 3 © Hortonworks

    2012 • An example from a talk byJimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs (user defined functions) in Pig. This equation uses stochastic gradient descent to do machine learning across with their data: w(t+1) =w(t) −γ(t)∇ (f(x;w(t)),y)
  3. Pre-Cloud: One Tool per Machine Page 4 © Hortonworks 2012

    • Databases presented SQL or SQL-like paradigms for operating on data • Other tools came in separate packages (e.g. R) or on separate platforms (SAS). Data Warehouse Statistical Analysis Cube/M OLAP OLTP Data Mart
  4. Cloud: Many Tools One Platform Page 5 © Hortonworks 2012

    • Users no longer want to be concerned with what platform their data is in – just apply the tool to it • SQL no longer the only or primary data access tool Data Warehouse Statistical Analysis Data Mart Cube/M OLAP OLT P
  5. Downside – Tools Don’t Play Well Together Page 7 ©

    Hortonworks 2012 • Hard for users to share data between tools – Different storage formats – Different data models – Different user defined function interfaces
  6. Downside – Wasted Developer Time Page 8 © Hortonworks 2012

    • Wastes developer time since each tool supplies the redundant functionality Executor Physical Planner Optimizer Parser Executor Physical Planner Optimizer Parser Metadata Pig Hive
  7. Downside – Wasted Developer Time Page 9 © Hortonworks 2012

    • Wastes developer time since each tool supplies the redundant functionality Executor Physical Planner Optimizer Parser Executor Physical Planner Optimizer Parser Metadata Pig Hive Overlap
  8. Conclusion: We Need Services Page 10 © Hortonworks 2012 •

    We need to find a way to share services where we can. • Gives users the same experience across tools • Allows developers to share effort when it makes sense
  9. Hadoop = Distributed Data Operating System Page 11 © Hortonworks

    2012 Service Hadoop Component Single Node Analogue Table Management HCatalog RDBMS User access control Hadoop /etc/passwd, file system permissions, etc. Resource management YARN Process management Notification HCatalog Signals, semaphores, mutexes REST/Connectors HCatalog, Hive, HBase, Oozie Network layer Batch data processing Data Virtual Machine JVM Exists Pieces exist in this component To be built
  10. Hadoop = Distributed Data Operating System Page 12 © Hortonworks

    2012 Service Hadoop Component Single Node Analogue Table Management HCatalog RDBMS User access control Hadoop /etc/passwd, file system permissions, etc. Resource management YARN Process management Notification HCatalog Signals, semaphores, mutexes REST/Connectors HCatalog, Hive, HBase, Oozie Network layer Batch data processing Data Virtual Machine JVM Exists Pieces exist in this component To be built
  11. HCatalog – Table Management Page 13 © Hortonworks 2012 •

    Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access
  12. Data Access Without HCatalog Page 14 © Hortonworks 2012 Metastore

    HDFS Hive Metastore Client InputFormat/ OuputFormat SerDe InputFormat/ OuputFormat MapReduce Pig Load/ Store
  13. Data & Metadata Access With HCatalog Page 15 © Hortonworks

    2012 Metastore HDFS Hive Metastore Client InputFormat/ OuputFormat SerDe HCatInputFormat/ HCatOuputFormat MapReduce Pig HCatLoader/ HCatStorer REST External System
  14. Without HCatalog Page 16 © Hortonworks 2012 Feature MapReduce Pig

    Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists Schema Encoded in app Declared in script or read by loader Read from metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata
  15. With HCatalog Page 17 © Hortonworks 2012 Feature MapReduce +

    HCatalog Pig + HCatalog Hive Record format Record Tuple Record Data model int, float, string, maps, structs, lists int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists Schema Read from metadata Read from metadata Read from metadata Data location Read from metadata Read from metadata Read from metadata Data format Read from metadata Read from metadata Read from metadata
  16. YARN – Resource Manager Page 18 © Hortonworks 2012 •

    Hadoop 1.0: HDFS plus MapReduce • Hadoop 2.0: HDFS plus YARN Resource Manager, an interface for developers to write parallel applications on top of the Hadoop cluster • The Resource Manager provides: – applications a way to request resources in the cluster – allocation and scheduling of machine resource to the applications • MapReduce is now an application provided inside YARN • Other systems have been ported to YARN such as Spark (cluster computing system that focuses on in memory operations) and Storm (streaming computations)
  17. Data Virtual Machine – Shared Batch Processing Page 20 ©

    Hortonworks 2012 • Recall our previous diagram of Pig and Hive Executor Physical Planner Optimizer Parser Executor Physical Planner Optimizer Parser Metadata Pig Hive Overlap
  18. A VM That Provides Page 21 © Hortonworks 2012 •

    Standard operators (equivalent of Java byte codes): – Project – Select – Join – Aggregate – Sort – … • An optimizer that could – Choose appropriate implementation of an operator based on physical data characteristics – Dynamically re-optimize the plan based on information gathered executing the plan • Shared execution layer – Can provide its own YARN application master and improve on MapReduce paradigm for batch processing • Shared User Defined Function (UDF) framework – user code works across systems
  19. Taking Advantage of YARN – MR* Page 22 © Hortonworks

    2012 Map Map Reduce Reduce Map Map Reduce Reduce HDFS
  20. Taking Advantage of YARN – MR* Page 23 © Hortonworks

    2012 Map Map Reduce Reduce Map Map Reduce Reduce HDFS Why do I need these maps?
  21. Taking Advantage of YARN – MR* Page 24 © Hortonworks

    2012 Map Map Reduce Reduce Map Map Reduce Reduce HDFS Map Map Reduce Reduce Reduce Reduce • Removed an entire write/read cycle of HDFS • Still want to checkpoint sometimes
  22. Taking Advantage of YARN – In Memory Data Transfer Page

    25 © Hortonworks 2012 Map Map Reduce Reduce
  23. Taking Advantage of YARN – In Memory Data Transfer Page

    26 © Hortonworks 2012 Map Map Reduce Reduce These are writes to disk Switching shuffle to in memory instead of on disk • Better performance • Data must also be spilled to disk for retry-ability and to handle memory overflow • Will benefit from stronger guarantees of simultaneous execution
  24. On the Fly Optimization Page 27 © Hortonworks 2012 •

    Traditionally databases do all optimization up front based on statistics – But often there are not statistics in Hadoop – Languages like Pig Latin allow very long series of operations that make up front estimates unrealistic • Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information MR Job MR Job Hash Join
  25. On the Fly Optimization Page 28 © Hortonworks 2012 •

    Traditionally databases do all optimization up front based on statistics – But often there are not statistics in Hadoop – Languages like Pig Latin allow very long series of operations that make up front estimates unrealistic • Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information MR Job MR Job Hash Join Output fits in memory
  26. On the Fly Optimization Page 29 © Hortonworks 2012 •

    Traditionally databases do all optimization up front based on statistics – But often there are not statistics in Hadoop – Languages like Pig Latin allow very long series of operations that make up front estimates unrealistic • Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information MR Job MR Job Hash Join MR Job MR Job Map- side Join Load into distributed cache