HadoopStack: Big Data Processing on Cloud
Dharmesh Kakadia Shashank Sahni Vasudeva Varma
SIEL@IIIT-H
HadoopStack allows enterprises to process Big Data
across multiple clouds seamlessly. It reduces job
completion time and improves resource utilization
using machine learning based job scheduling.
Use Cases
• Hassle free Big Data processing on cloud
• Single platform for all Big Data processing needs
• Complicated workflow pipelines
• Leveraging both Public and Private Infrastructure
• Managed deployment of multiple clusters on
cloud
Components
• Provisioning – Spawning Hadoop clusters on
demand.
• Job Scheduling
• Deadline aware
• Cost aware
• Monitoring – Leveraging usage metrics for auto
scaling and scheduling
• Job Management – Single pane of Glass management
for all job spread across multiple clouds
RoadMap
• Integrate other Hadoop Ecosystem projects –
mahout, hive, pig etc.
• Integration of other data processing frameworks –
spark, graphlab and R
Motivation
• High Cost of setting up Big Data clusters
• Inability to grow and shrink
• Multiple clouds
Job
Submission
• Mul-ple
client
interfaces
• Web
UI,
Command-‐line
tools,
API
Job
Manager
• Support
for
mul--‐tenancy
• Support
for
job
workflows
Scheduling
• Cost
and
deadline
aware
Scheduling
• Support
for
custom
scheduler
plugins
Provisioning
• Spawning
on-‐demand
clusters
suppor-ng
mul-ple
IaaS
• Provider
specific
features
Monitoring
• Support
for
mul-ple
monitoring
plaGorms.
• Framework
instrumenta-on
for
fine-‐grained
monitoring
Scaling
• Automa-c
and
user-‐triggered
scaling
• Cost
efficient
via
spot
instances
and
scaling
down
Repor-ng
• Repor-ng
Job
status,
resource
usage,
cost
etc.
User Workflow
Challenges
• Deadline aware scheduling
• Exploiting IaaS specific features – instance
grouping in AWS, determining optimal instance
configuration
• Predicting Job completion time
• Characterizing Jobs
• Tracking spot instance for price/performance
tradeoff
siel-iiith/hadoopstack
[email protected]