HadoopStack: Big Data Processing on Cloud

Slide 1

Slide 1 text

HadoopStack: Big Data Processing on Cloud Dharmesh Kakadia Shashank Sahni Vasudeva Varma SIEL@IIIT-H HadoopStack allows enterprises to process Big Data across multiple clouds seamlessly. It reduces job completion time and improves resource utilization using machine learning based job scheduling. Use Cases •  Hassle free Big Data processing on cloud •  Single platform for all Big Data processing needs •  Complicated workflow pipelines •  Leveraging both Public and Private Infrastructure •  Managed deployment of multiple clusters on cloud Components •  Provisioning – Spawning Hadoop clusters on demand. •  Job Scheduling •  Deadline aware •  Cost aware •  Monitoring – Leveraging usage metrics for auto scaling and scheduling •  Job Management – Single pane of Glass management for all job spread across multiple clouds RoadMap •  Integrate other Hadoop Ecosystem projects – mahout, hive, pig etc. •  Integration of other data processing frameworks – spark, graphlab and R Motivation •  High Cost of setting up Big Data clusters •  Inability to grow and shrink •  Multiple clouds Job Submission •  Mul-ple client interfaces •  Web UI, Command-‐line tools, API Job Manager •  Support for mul--‐tenancy •  Support for job workflows Scheduling •  Cost and deadline aware Scheduling •  Support for custom scheduler plugins Provisioning •  Spawning on-‐demand clusters suppor-ng mul-ple IaaS •  Provider specific features Monitoring •  Support for mul-ple monitoring plaGorms. •  Framework instrumenta-on for fine-‐grained monitoring Scaling •  Automa-c and user-‐triggered scaling •  Cost efficient via spot instances and scaling down Repor-ng •  Repor-ng Job status, resource usage, cost etc. User Workflow Challenges •  Deadline aware scheduling •  Exploiting IaaS specific features – instance grouping in AWS, determining optimal instance configuration •  Predicting Job completion time •  Characterizing Jobs •  Tracking spot instance for price/performance tradeoff siel-iiith/hadoopstack [email protected]