Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HadoopStack: Big Data on Cloud

HadoopStack: Big Data on Cloud

This poster was presented at IIIT-H RnD show case


February 09, 2013

More Decks by dharmeshkakadia

Other Decks in Technology


  1. HadoopStack: Big Data on Cloud International Institute of Information Technology,

    Hyderabad Search and Information Extraction Lab LTRC, IIIT-H HadoopStack allows enterprises to process Big Data across multiple clouds seamlessly. It reduces job completion time and improves resource utilization using machine learning based job scheduling. Use Cases •  Hassle free Big Data processing on cloud •  Single platform for all Big Data processing needs •  Complicated workflow pipelines •  Leveraging both Public and Private Infrastructure •  Managed deployment of multiple clusters on cloud Components •  Provisioning – Spawning Hadoop clusters on demand. •  Job Scheduling •  Deadline aware •  Cost aware •  Monitoring – Leveraging usage metrics for auto scaling and scheduling •  Job Management – Single pane of Glass management for all job spread across multiple clouds RoadMap •  Integrate other Hadoop Ecosystem projects – mahout, hive, pig etc. •  Integration of other data processing frameworks – spark, graphlab and R Motivation •  High Cost of setting up Big Data clusters •  Inability to grow and shrink •  Multiple clouds Dharmesh Kakadia Shashank Sahni Job   Submission   •  Mul-ple  client  interfaces   •  Web  UI,  Command-­‐line  tools,  API   Job  Manager   •  Support  for  mul--­‐tenancy   •  Support  for  job  workflows   Scheduling   •  Cost  and  deadline  aware  Scheduling   •  Support  for  custom  scheduler  plugins   Provisioning   •  Spawning  on-­‐demand  clusters  suppor-ng     mul-ple  IaaS   •  Provider  specific  features   Monitoring   •  Support  for  mul-ple  monitoring  plaGorms.   •  Framework  instrumenta-on  for  fine-­‐grained   monitoring   Scaling   •  Automa-c  and  user-­‐triggered  scaling   •  Cost  efficient  via  spot  instances  and  scaling   down   Repor-ng   •  Repor-ng  Job  status,  resource  usage,  cost   etc.   User Workflow Challenges •  Deadline aware scheduling •  Exploiting IaaS specific features – instance grouping in AWS, determining optimal instance configuration •  Predicting Job completion time •  Characterizing Jobs •  Tracking spot instance for price/performance tradeoff siel-iiith/hadoopstack