$30 off During Our Annual Pro Sale. View Details »

HadoopStack: Big Data Processing on Cloud

HadoopStack: Big Data Processing on Cloud

This poster was presented at Workshop on Understanding Big Data Analytics, ACM India Special Interest Group on Knowledge Discovery and Data Mining (iKDD)

dharmeshkakadia

February 15, 2013
Tweet

More Decks by dharmeshkakadia

Other Decks in Technology

Transcript

  1. HadoopStack: Big Data Processing on Cloud
    Dharmesh Kakadia Shashank Sahni Vasudeva Varma
    SIEL@IIIT-H
    HadoopStack allows enterprises to process Big Data
    across multiple clouds seamlessly. It reduces job
    completion time and improves resource utilization
    using machine learning based job scheduling.
    Use Cases
    •  Hassle free Big Data processing on cloud
    •  Single platform for all Big Data processing needs
    •  Complicated workflow pipelines
    •  Leveraging both Public and Private Infrastructure
    •  Managed deployment of multiple clusters on
    cloud
    Components
    •  Provisioning – Spawning Hadoop clusters on
    demand.
    •  Job Scheduling
    •  Deadline aware
    •  Cost aware
    •  Monitoring – Leveraging usage metrics for auto
    scaling and scheduling
    •  Job Management – Single pane of Glass management
    for all job spread across multiple clouds
    RoadMap
    •  Integrate other Hadoop Ecosystem projects –
    mahout, hive, pig etc.
    •  Integration of other data processing frameworks –
    spark, graphlab and R
    Motivation
    •  High Cost of setting up Big Data clusters
    •  Inability to grow and shrink
    •  Multiple clouds
    Job  
    Submission  
    •  Mul-ple  client  interfaces  
    •  Web  UI,  Command-­‐line  tools,  API  
    Job  Manager  
    •  Support  for  mul--­‐tenancy  
    •  Support  for  job  workflows  
    Scheduling  
    •  Cost  and  deadline  aware  Scheduling  
    •  Support  for  custom  scheduler  plugins  
    Provisioning  
    •  Spawning  on-­‐demand  clusters  suppor-ng    
    mul-ple  IaaS  
    •  Provider  specific  features  
    Monitoring  
    •  Support  for  mul-ple  monitoring  plaGorms.  
    •  Framework  instrumenta-on  for  fine-­‐grained  
    monitoring  
    Scaling  
    •  Automa-c  and  user-­‐triggered  scaling  
    •  Cost  efficient  via  spot  instances  and  scaling  
    down  
    Repor-ng  
    •  Repor-ng  Job  status,  resource  usage,  cost  
    etc.  
    User Workflow
    Challenges
    •  Deadline aware scheduling
    •  Exploiting IaaS specific features – instance
    grouping in AWS, determining optimal instance
    configuration
    •  Predicting Job completion time
    •  Characterizing Jobs
    •  Tracking spot instance for price/performance
    tradeoff
    siel-iiith/hadoopstack
    [email protected]

    View Slide