HadoopStack: Big Data Processing on Cloud

HadoopStack: Big Data Processing on Cloud

This poster was presented at Workshop on Understanding Big Data Analytics, ACM India Special Interest Group on Knowledge Discovery and Data Mining (iKDD)

0aa2ebd008cdd198af5e9765062bb265?s=128

dharmeshkakadia

February 15, 2013
Tweet

Transcript

  1. HadoopStack: Big Data Processing on Cloud Dharmesh Kakadia Shashank Sahni

    Vasudeva Varma SIEL@IIIT-H HadoopStack allows enterprises to process Big Data across multiple clouds seamlessly. It reduces job completion time and improves resource utilization using machine learning based job scheduling. Use Cases •  Hassle free Big Data processing on cloud •  Single platform for all Big Data processing needs •  Complicated workflow pipelines •  Leveraging both Public and Private Infrastructure •  Managed deployment of multiple clusters on cloud Components •  Provisioning – Spawning Hadoop clusters on demand. •  Job Scheduling •  Deadline aware •  Cost aware •  Monitoring – Leveraging usage metrics for auto scaling and scheduling •  Job Management – Single pane of Glass management for all job spread across multiple clouds RoadMap •  Integrate other Hadoop Ecosystem projects – mahout, hive, pig etc. •  Integration of other data processing frameworks – spark, graphlab and R Motivation •  High Cost of setting up Big Data clusters •  Inability to grow and shrink •  Multiple clouds Job   Submission   •  Mul-ple  client  interfaces   •  Web  UI,  Command-­‐line  tools,  API   Job  Manager   •  Support  for  mul--­‐tenancy   •  Support  for  job  workflows   Scheduling   •  Cost  and  deadline  aware  Scheduling   •  Support  for  custom  scheduler  plugins   Provisioning   •  Spawning  on-­‐demand  clusters  suppor-ng     mul-ple  IaaS   •  Provider  specific  features   Monitoring   •  Support  for  mul-ple  monitoring  plaGorms.   •  Framework  instrumenta-on  for  fine-­‐grained   monitoring   Scaling   •  Automa-c  and  user-­‐triggered  scaling   •  Cost  efficient  via  spot  instances  and  scaling   down   Repor-ng   •  Repor-ng  Job  status,  resource  usage,  cost   etc.   User Workflow Challenges •  Deadline aware scheduling •  Exploiting IaaS specific features – instance grouping in AWS, determining optimal instance configuration •  Predicting Job completion time •  Characterizing Jobs •  Tracking spot instance for price/performance tradeoff siel-iiith/hadoopstack dharmesh.kakadia@research.iiit.ac.in