HadoopStack: Big Data on Cloud International Institute of Information Technology, Hyderabad Search and Information Extraction Lab LTRC, IIIT-H HadoopStack allows enterprises to process Big Data across multiple clouds seamlessly. It reduces job completion time and improves resource utilization using machine learning based job scheduling. Use Cases • Hassle free Big Data processing on cloud • Single platform for all Big Data processing needs • Complicated workflow pipelines • Leveraging both Public and Private Infrastructure • Managed deployment of multiple clusters on cloud Components • Provisioning – Spawning Hadoop clusters on demand. • Job Scheduling • Deadline aware • Cost aware • Monitoring – Leveraging usage metrics for auto scaling and scheduling • Job Management – Single pane of Glass management for all job spread across multiple clouds RoadMap • Integrate other Hadoop Ecosystem projects – mahout, hive, pig etc. • Integration of other data processing frameworks – spark, graphlab and R Motivation • High Cost of setting up Big Data clusters • Inability to grow and shrink • Multiple clouds Dharmesh Kakadia Shashank Sahni Job
Submission
• Mul-ple
client
interfaces
• Web
UI,
Command-‐line
tools,
API
Job
Manager
• Support
for
mul--‐tenancy
• Support
for
job
workflows
Scheduling
• Cost
and
deadline
aware
Scheduling
• Support
for
custom
scheduler
plugins
Provisioning
• Spawning
on-‐demand
clusters
suppor-ng
mul-ple
IaaS
• Provider
specific
features
Monitoring
• Support
for
mul-ple
monitoring
plaGorms.
• Framework
instrumenta-on
for
fine-‐grained
monitoring
Scaling
• Automa-c
and
user-‐triggered
scaling
• Cost
efficient
via
spot
instances
and
scaling
down
Repor-ng
• Repor-ng
Job
status,
resource
usage,
cost
etc.
User Workflow Challenges • Deadline aware scheduling • Exploiting IaaS specific features – instance grouping in AWS, determining optimal instance configuration • Predicting Job completion time • Characterizing Jobs • Tracking spot instance for price/performance tradeoff siel-iiith/hadoopstack