Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Bird's-Eye View of Pig and Scalding with hRaven

A Bird's-Eye View of Pig and Scalding with hRaven

Presented at Hadoop Summit San Jose 2013

As Twitter's use of mapreduce rapidly expands, tracking usage on our clusters grows correspondingly more difficult. With an ever increasing job load, and a reliance on higher level abstractions such as Pig and Scalding, the utility of existing tools for viewing job history decreases rapidly, and extracting insights becomes a challenge. At Twitter, we created hRaven to fill this gap. hRaven archives the full history and metrics from all mapreduce jobs on our clusters, and strings together each job from a Pig or Scalding script execution into a combined flow. From this archive, we can easily derive aggregate resource utilization by user, pool, or application. While the historical trending of an individual application allows us to perform runtime optimization of resource scheduling. We will cover how hRaven provides a rich historical archive of mapreduce job execution, and how the data is structured into higher level flows representing the job sequence for frameworks such as Pig, Scalding, and Hive. We will then explore how we mine hRaven data to account for Hadoop resource utilization, to optimize runtime scheduling, and to identify common anti-patterns in user jobs. Finally, we will look at the end user experience, including Ambrose integration for flow visualization.

Avatar for Gary Helmling

Gary Helmling

June 27, 2013
Tweet

More Decks by Gary Helmling

Other Decks in Programming

Transcript

  1. A Bird’s-Eye View of Pig and Scalding with hRaven a

    tale by @gario and @joep Hadoop Summit 2013 v1.2 Friday, June 28, 13
  2. @Twitter #HadoopSummit2013 2 • Apache HBase PMC member and Committer

    • Software Engineer @ Twitter • Core Storage Team - Hadoop/HBase About the authors • Software Engineer @ Twitter • Engineering Manager Hadoop/HBase team @ Twitter Friday, June 28, 13
  3. @Twitter #HadoopSummit2013 3 Chapter 1: The Problem Chapter 2: Why

    hRaven? Chapter 3: How Does it Work? • 3a: Loading • 3b: Table structure / querying Chapter 4: Current Uses Appendix: Future Work Table of Contents Friday, June 28, 13
  4. @Twitter #HadoopSummit2013 5 Most users run Pig and Scalding scripts,

    not straight map reduce JobTracker UI shows jobs, not DAGs of jobs generated by Pig and Scalding Chapter 1: Mismatched Abstractions Friday, June 28, 13
  5. @Twitter #HadoopSummit2013 7 How many Pig versus Scalding jobs do

    we run ? What cluster capacity do jobs in my pool take ? How many jobs do we run each day ? What % of jobs have > 30k tasks ? Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ? Chapter 1: Questions Friday, June 28, 13
  6. @Twitter #HadoopSummit2013 8 How many Pig versus Scalding jobs do

    we run ? What cluster capacity do jobs in my pool take ? How many jobs do we run each day ? What % of jobs have > 30k tasks ? Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ? Chapter 1: Questions #Nevermore Friday, June 28, 13
  7. @Twitter #HadoopSummit2013 10 Stores stats, configuration and timing for every

    map reduce job on every cluster Structured around the full DAG of jobs from a Pig or Scalding application Easily queryable for historical trending Allows for Pig reducer optimization based on historical run stats Keep data online forever (12.6M jobs, 4.5B tasks + attempts) Chapter 2: Why hRaven? Friday, June 28, 13
  8. @Twitter #HadoopSummit2013 11 cluster - each cluster has a unique

    name mapping to the Job Tracker user - map reduce jobs are run as a given user application - a Pig or Scalding script (or plain map reduce job) flow - the combined DAG of jobs executed from a single run of an application version - changes impacting the DAG are recorded as a new version of the same application Chapter 2: Key Concepts Friday, June 28, 13
  9. @Twitter #HadoopSummit2013 14 All jobs in a flow are ordered

    together Chapter 2: Flow Storage Friday, June 28, 13
  10. @Twitter #HadoopSummit2013 16 All jobs in a flow are ordered

    together Per-job metrics stored • Total map and reduce tasks • HDFS bytes read / written • File bytes read / written • Total map and reduce slot milliseconds Easy to aggregate stats for an entire flow Easy to scan the timeseries of each application’s flows Chapter 2: Key Features Friday, June 28, 13
  11. @Twitter #HadoopSummit2013 21 Chapter 3: ETL - Step 3: JobFileProcessor

    Jobs finish out of order with respect to job_id Friday, June 28, 13
  12. @Twitter #HadoopSummit2013 23 Row key: cluster!jobID Columns: • jobconf -

    stores serialized raw job_*_conf.xml file • jobhistory - stored serialized raw job history log file • job_processed_success - indicates whether job has been processed Chapter 3: job_history_raw Friday, June 28, 13
  13. @Twitter #HadoopSummit2013 24 Row key: cluster!user!application!timestamp!jobID • cluster - unique

    cluster name (ie. “cluster1@dc1”) • user - user running the application (“edgar”) • application - application ID derived from job configuration: • uses “batch.desc” property if set • otherwise parses a consistent ID from “mapred.job.name” • timestamp - inverted (Long.MAX_VALUE - value) value of submission time • jobID - stored as Job Tracker start time (long), concatenated with job sequence number •job_201306271100_0001 -> [1372352073732L][1L] Chapter 3: job_history Friday, June 28, 13
  14. @Twitter #HadoopSummit2013 25 Row key: cluster!user!application!timestamp!jobID!taskID • same components as

    job_history key (same ordering) • taskID - (ie. “m_00001”) uniquely identifies individual task/attempt in job Two row types: • Task - “meta” row cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001 • Task Attempt - individual execution on a Task Tracker cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001_1 Chapter 3: job_history_task Friday, June 28, 13
  15. @Twitter #HadoopSummit2013 27 Using Pig’s HBaseStorage (or direct HBase APIs)

    Through Client API Through REST API Chapter 3: Querying hRaven Friday, June 28, 13
  16. @Twitter #HadoopSummit2013 29 Pig reducer optimizations Cluster utilization / capacity

    planning Application performance trending over time Identifying common job anti-patterns Ad-hoc analysis troubleshooting cluster problems Chapter 4: Current Uses Friday, June 28, 13
  17. @Twitter #HadoopSummit2013 Chapter 4: Pool / Application reads/writes 31 Pool

    view • Spike in File size read • Indicates jobs spilling Application view • Spike in HDFS size read • Indicates spiking input Friday, June 28, 13
  18. @Twitter #HadoopSummit2013 35 Real-time data loading from Job Tracker /

    Application Master Full flow-centric UI (Job Tracker UI replacement) Hadoop 2.0 compatibility (in-progress) Ambrose integration Appendix: Future Work Friday, June 28, 13
  19. @Twitter #HadoopSummit2013 Afterword 37 Now will thou drop your job

    data on the floor ? Quoth the hRaven, 'Nevermore.' Friday, June 28, 13
  20. #TheEnd @gario and @joep Come visit us at booth #26

    to continue the story Friday, June 28, 13
  21. @Twitter #HadoopSummit2013 39 Desired order job_201306271100_9999 job_201306271100_10000 ... job_201306271100_99999 job_201306271100_100000

    ... job_201306271100_999999 job_201306271100_1000000 Sort order Variable length job_id Lexical order job_201306271100_10000 job_201306271100_100000 job_201306271100_1000000 job_201306271100_9999 job_201306271100_99999 job_201306271100_999999 Friday, June 28, 13