Hyderabad Search and Information Extraction Lab LTRC, IIIT-H HadoopStack allows enterprises to process Big Data across multiple clouds seamlessly. It reduces job completion time and improves resource utilization using machine learning based job scheduling. Use Cases • Hassle free Big Data processing on cloud • Single platform for all Big Data processing needs • Complicated workflow pipelines • Leveraging both Public and Private Infrastructure • Managed deployment of multiple clusters on cloud Components • Provisioning – Spawning Hadoop clusters on demand. • Job Scheduling • Deadline aware • Cost aware • Monitoring – Leveraging usage metrics for auto scaling and scheduling • Job Management – Single pane of Glass management for all job spread across multiple clouds RoadMap • Integrate other Hadoop Ecosystem projects – mahout, hive, pig etc. • Integration of other data processing frameworks – spark, graphlab and R Motivation • High Cost of setting up Big Data clusters • Inability to grow and shrink • Multiple clouds Dharmesh Kakadia Shashank Sahni Job Submission • Mul-ple client interfaces • Web UI, Command-‐line tools, API Job Manager • Support for mul--‐tenancy • Support for job workflows Scheduling • Cost and deadline aware Scheduling • Support for custom scheduler plugins Provisioning • Spawning on-‐demand clusters suppor-ng mul-ple IaaS • Provider specific features Monitoring • Support for mul-ple monitoring plaGorms. • Framework instrumenta-on for fine-‐grained monitoring Scaling • Automa-c and user-‐triggered scaling • Cost efficient via spot instances and scaling down Repor-ng • Repor-ng Job status, resource usage, cost etc. User Workflow Challenges • Deadline aware scheduling • Exploiting IaaS specific features – instance grouping in AWS, determining optimal instance configuration • Predicting Job completion time • Characterizing Jobs • Tracking spot instance for price/performance tradeoff siel-iiith/hadoopstack