Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science London : YARN, Tez and Hive

Data Science London : YARN, Tez and Hive

cj_harris5

July 27, 2013
Tweet

More Decks by cj_harris5

Other Decks in Technology

Transcript

  1. Hadoop MapReduce Classic •  JobTracker –  Manages cluster resources and

    job scheduling •  TaskTracker –  Per-node agent –  Manage tasks
  2. Current Limitations •  Scalability –  Maximum Cluster size – 4,000

    nodes –  Maximum concurrent tasks – 40,000 –  Coarse synchronization in JobTracker •  Single point of failure –  Failure kills all queued and running jobs –  Jobs need to be re-submitted by users •  Restart is very tricky due to complex state 4
  3. Current Limitations •  Hard partition of resources into map and

    reduce slots –  Low resource utilization •  Lacks support for alternate paradigms –  Iterative applications implemented using MapReduce are 10x slower –  Hacks for the likes of MPI/Graph Processing •  Lack of wire-compatible protocols –  Client and cluster must be of same version –  Applications and workflows cannot migrate to different clusters 5
  4. Requirements •  Reliability •  Availability •  Utilization •  Wire Compatibility

    •  Agility & Evolution – Ability for customers to control upgrades to the grid software stack. •  Scalability - Clusters of 6,000-10,000 machines –  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks –  100,000+ concurrent tasks –  10,000 concurrent jobs 6
  5. Design Centre •  Split up the two major functions of

    JobTracker –  Cluster resource management –  Application life-cycle management •  MapReduce becomes user-land library 7
  6. Concepts •  Application –  Application is a job submitted to

    the framework –  Example – Map Reduce Job •  Container –  Basic unit of allocation –  Example – container A = 2GB, 1CPU –  Fine-grained resource allocation –  Replaces the fixed map/reduce slots 8
  7. Architecture •  Resource Manager –  Global resource scheduler –  Hierarchical

    queues •  Node Manager –  Per-machine agent –  Manages the life-cycle of container –  Container resource monitoring •  Application Master –  Per-application –  Manages application scheduling and task execution –  E.g. MapReduce Application Master 9
  8. Architecture Resource Manager MapReduce Status Job Submission Client Node Manager

    Node Manager Container Node Manager App Mstr Node Status Resource Request Resource Manager Client MapReduce Status Job Submission Client Node Manager Node Manager App Mstr Container Node Manager App Mstr Node Status Resource Request Resource Manager Client MapReduce Status Job Submission Client Node Manager Container Container Node Manager App Mstr Container Node Manager Container App Mstr Node Status Resource Request
  9. Improvements vis-à-vis classic MapReduce 11 •  Utilization –  Generic resource

    model •  Data Locality (node, rack etc.) •  Memory •  CPU •  Disk b/q •  Network b/w –  Remove fixed partition of map and reduce slot •  Scalability –  Application life-cycle management is very expensive –  Partition resource management and application life-cycle management –  Application management is distributed –  Hardware trends - Currently run clusters of 4,000 machines •  6,000 2012 machines > 12,000 2009 machines •  <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB>
  10. •  Fault Tolerance and Availability –  Resource Manager •  No

    single point of failure – state saved in ZooKeeper (coming soon) •  Application Masters are restarted automatically on RM restart –  Application Master •  Optional failover via application-specific checkpoint •  MapReduce applications pick up where they left off via state saved in HDFS •  Wire Compatibility –  Protocols are wire-compatible –  Old clients can talk to new servers –  Rolling upgrades 12 Improvements vis-à-vis classic MapReduce
  11. •  Innovation and Agility –  MapReduce now becomes a user-land

    library –  Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) •  Faster deployment cycles for improvements –  Customers upgrade MapReduce versions on their schedule –  Users can customize MapReduce 13 Improvements vis-à-vis classic MapReduce
  12. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 To Talk

    to Tez..You Have To First Talk Yarn •  New Processing layer in Hadoop 2.0 that decouples Resource management from application management •  Created to manage resource needs across all uses •  Ensures predictable performance & QoS for all apps •  Enables apps to run “IN” Hadoop rather than “ON” – Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc. Page 14 Applica'ons  Run  Na'vely  IN  Hadoop   HDFS2  (Redundant,  Reliable  Storage)   YARN  (Cluster  Resource  Management)       BATCH   (MapReduce)   INTERACTIVE   (Tez)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   ONLINE   (HBase)   OTHER   (Search)   (Weave…)  
  13. © Hortonworks Inc. 2013 Resource Manager Client MapReduce Status Job

    Submission Client Node Manager Container Container Node Manager App Mstr Container Node Manager Container App Mstr Node Status Resource Request Tez: High Throughput and Low Latency Tez runs in YARN Accelerate High Throughput AND Low Latency Processing
  14. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Tez -

    Core Idea Task with pluggable Input, Processor & Output Page 16 YARN ApplicationMaster to run DAG of Tez Tasks Input Processor Task Output Tez Task - <Input, Processor, Output>
  15. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Tez Hive

    Performance • Low level data-processing, execution engine on YARN • Base for MapReduce, Hive, Pig, Cascading, etc. • Re-usable data processing primitives (ex: sort, merge, intermediate data management) • All Hive SQL, can be expressed as single job – Jobs are no longer interrupted (efficient pipeline) – Avoid writing intermediate output to HDFS when performance outweights job re-start (speed and network/disk usage savings) – Break MR contract to turn MRMRMR to MRRR (flexible DAG) • Removes task and job overhead (10s savings is huge for a 2s query!) Page 17
  16. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Tez Service

    • MR Query Startup Expensive – Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) • Tez Service – An always-on pool of Application Masters – Hive (and soon others like Pig) jobs are executed on an Application Masterin the pool instead of starting a new Application Master(saving precious seconds) – Container Preallocation/Ware Containers • In Summary… – Removes job-launch overhead (Application Master) – Removes task-launch overhead (Pre-warmed Containers) Page 18
  17. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Stinger Initiative:

    Making Apache Hive 100X Faster Page 19 © Hortonworks Inc. 2013 Page 19 DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & SENSITIVE INFORMATION Hadoop Base  Op'miza'ons     Generate  simplified  DAGs   In-­‐memory  Hash  Joins   Current 3 – 9 Months 9 – 18 Months Deep  Analy'cs     SQL  CompaEble  Types   SQL  CompaEble  Windowing   More  SQL  Subqueries   Hive YARN     Next-­‐gen  Hadoop  data   processing  framework   Tez     Express  data  processing   tasks  more  simply   Eliminate  disk  writes   Hive  Tez  Service     Pre-­‐warmed  Containers   Low-­‐latency  dispatch   ORCFile     Column  Store   High  Compression   Predicate  /  Filter  Pushdowns   Buffer  Caching     Cache  accessed  data   OpEmized  for  vector  engine   Query  Planner     Intelligent  Cost-­‐Based   OpEmizer   Vector  Query  Engine     OpEmized  for  modern   processor  architectures  
  18. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Demo Benchmark

    Spec • The TPC-DS benchmark data+query set • Query 27 – big table(store_sales) joins lots of small tables – A.K.A Star Schema Join • What does Query 27 do? For all items sold in stores located in specified states during a given year, find the average quantity, average list price, average list sales price, average coupon amount for a given gender, marital status, education and customer demographic..
  19. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 SELECT  col5,

     avg(col6)   FROM  store_sales_fact  ssf    join  item_dim  on  (ssf.col1  =  item_dim  .col1)        join  date_dim  on  (ssf.col2  =  date_dim.col2                                    join  custdmgrphcs_dim  on  (ssf.col3  =custdmgrphcs_dim.col3)        join  store_dim  on  (ssf.col4  =  store_dim.col4)     GROUP  BY  col5                                                                                                         ORDER  BY  col5   LIMIT  100;   Query 27 - Star Schema Join • Derived from TPC-DS Query 27 Page 21 41 GB 58 MB 11MB 80MB 106 KB
  20. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Benchmark Cluster

    Specs • 6 Node HDP 1.3 Cluster – 2 Master Nodes – 4 Data/Slave Nodes • Slave Node Specs – Dual Processors – 14 GB
  21. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Query27 Execution

    Before Hive 11-Text Format Query spawned 5 MR Jobs The intermediate output of each job is written to HDFS Job 1 of 5 – Mappers stream Fact table and first dimension table sorted by join key. Reducers do the join operation between the two tables. Job 2 of 5: Mappers and Reducers join the output of job 1 and second dimension table. Last job performs the order by and group operation Query Response Time 149 total mappers got executed
  22. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Query27 Execution

    With Hive 11-Text Format Query spawned of 1 job with Hive 11 compared to 5 MR Jobs with Hive 10 Job 1 of 1 – Each Mapper loads into memory the 4 small dimension tables and streams parts of the large fact table. Joins then occur in Mapper hence the name MapJoin Increase in performance with Hive 11 as query time went down from 21 minutes to about 4 minutes
  23. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Query27 Execution

    With Hive 11- RC Format Conversion from Text to RC file format decreased size of dimension data set from 38 GB to 8.21 GB Smaller file equates to less IO causing the query time to decrease from 246 seconds to 136 seconds
  24. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Query27 Execution

    With Hive 11- ORC Format ORC File type consolidates data more tighly than RCFile as the size of dataset decreased from 8.21 GB to 2.83 GB Smaller file equates to less IO causing the query time to decrease from 136 seconds to 104 seconds
  25. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Query27 Execution

    With Hive 11- RC Format/ Partitioned, Bucketed and Sorted Partitioned the data decreases the file input size to 1.73 GB Smaller file equates to less IO causing the query time to decrease from 136 seconds to 104 seconds
  26. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Query27 Execution

    With Hive 11- ORCFormat/ Partitioned, Bucketed and Sorted Partitioned Data with ORC file format produces the smallest input size of 687 MB a decrease from 43 GB Smallest Input Size allows us the fastest response time for the query: 79 seconds
  27. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Summary of

    Results File  Type   Number  of   MR  Jobs   Input  Size   Mappers   Time   Text/Hive  10   5   43.1  GB   179   21  minutes   Text/Hive  11   1   38  GB   151   246  seconds   RC/Hive  11   1   8.21  GB   76   136  seconds   ORC/Hive  11   1   2.83  GB   38   104  seconds   RC/Hive  11/ Partitioned/ Bucketed   1   1.73  GB   19   104  seconds   ORC/Hive  11/ Partitioned/ Bucketed   1   687  MB   27   79.62  
  28. Resources Page 30 hadoop-2.0.3-alpha: http://hadoop.apache.org/common/releases.html Release Documentation: http://hadoop.apache.org/common/docs/r2.0.3-alpha/ Blogs: http://hortonworks.com/blog/category/apache-hadoop/yarn/

    http://hortonworks.com/blog/introducing-apache-hadoop-yarn/ http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an- overview/ http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an- overview/
  29. © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL

    & PROPRIETARY INFORMATION Thank You! Questions & Answers Page 31