Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HadoopStack: Big Data Processing on Cloud

HadoopStack: Big Data Processing on Cloud

This poster was presented at Workshop on Understanding Big Data Analytics, ACM India Special Interest Group on Knowledge Discovery and Data Mining (iKDD)


February 15, 2013

More Decks by dharmeshkakadia

Other Decks in Technology


  1. HadoopStack: Big Data Processing on Cloud Dharmesh Kakadia Shashank Sahni

    Vasudeva Varma SIEL@IIIT-H HadoopStack allows enterprises to process Big Data across multiple clouds seamlessly. It reduces job completion time and improves resource utilization using machine learning based job scheduling. Use Cases •  Hassle free Big Data processing on cloud •  Single platform for all Big Data processing needs •  Complicated workflow pipelines •  Leveraging both Public and Private Infrastructure •  Managed deployment of multiple clusters on cloud Components •  Provisioning – Spawning Hadoop clusters on demand. •  Job Scheduling •  Deadline aware •  Cost aware •  Monitoring – Leveraging usage metrics for auto scaling and scheduling •  Job Management – Single pane of Glass management for all job spread across multiple clouds RoadMap •  Integrate other Hadoop Ecosystem projects – mahout, hive, pig etc. •  Integration of other data processing frameworks – spark, graphlab and R Motivation •  High Cost of setting up Big Data clusters •  Inability to grow and shrink •  Multiple clouds Job   Submission   •  Mul-ple  client  interfaces   •  Web  UI,  Command-­‐line  tools,  API   Job  Manager   •  Support  for  mul--­‐tenancy   •  Support  for  job  workflows   Scheduling   •  Cost  and  deadline  aware  Scheduling   •  Support  for  custom  scheduler  plugins   Provisioning   •  Spawning  on-­‐demand  clusters  suppor-ng     mul-ple  IaaS   •  Provider  specific  features   Monitoring   •  Support  for  mul-ple  monitoring  plaGorms.   •  Framework  instrumenta-on  for  fine-­‐grained   monitoring   Scaling   •  Automa-c  and  user-­‐triggered  scaling   •  Cost  efficient  via  spot  instances  and  scaling   down   Repor-ng   •  Repor-ng  Job  status,  resource  usage,  cost   etc.   User Workflow Challenges •  Deadline aware scheduling •  Exploiting IaaS specific features – instance grouping in AWS, determining optimal instance configuration •  Predicting Job completion time •  Characterizing Jobs •  Tracking spot instance for price/performance tradeoff siel-iiith/hadoopstack [email protected]