$30 off During Our Annual Pro Sale. View Details »

HadoopStack: Big Data on Cloud

HadoopStack: Big Data on Cloud

This poster was presented at IIIT-H RnD show case

dharmeshkakadia

February 09, 2013
Tweet

More Decks by dharmeshkakadia

Other Decks in Technology

Transcript

  1. HadoopStack: Big Data on Cloud
    International Institute of Information Technology, Hyderabad
    Search and Information Extraction Lab
    LTRC, IIIT-H
    HadoopStack allows enterprises to process Big Data
    across multiple clouds seamlessly. It reduces job
    completion time and improves resource utilization
    using machine learning based job scheduling.
    Use Cases
    •  Hassle free Big Data processing on cloud
    •  Single platform for all Big Data processing needs
    •  Complicated workflow pipelines
    •  Leveraging both Public and Private Infrastructure
    •  Managed deployment of multiple clusters on
    cloud
    Components
    •  Provisioning – Spawning Hadoop clusters on
    demand.
    •  Job Scheduling
    •  Deadline aware
    •  Cost aware
    •  Monitoring – Leveraging usage metrics for auto
    scaling and scheduling
    •  Job Management – Single pane of Glass management
    for all job spread across multiple clouds
    RoadMap
    •  Integrate other Hadoop Ecosystem projects –
    mahout, hive, pig etc.
    •  Integration of other data processing frameworks –
    spark, graphlab and R
    Motivation
    •  High Cost of setting up Big Data clusters
    •  Inability to grow and shrink
    •  Multiple clouds
    Dharmesh Kakadia
    Shashank Sahni
    Job  
    Submission  
    •  Mul-ple  client  interfaces  
    •  Web  UI,  Command-­‐line  tools,  API  
    Job  Manager  
    •  Support  for  mul--­‐tenancy  
    •  Support  for  job  workflows  
    Scheduling  
    •  Cost  and  deadline  aware  Scheduling  
    •  Support  for  custom  scheduler  plugins  
    Provisioning  
    •  Spawning  on-­‐demand  clusters  suppor-ng    
    mul-ple  IaaS  
    •  Provider  specific  features  
    Monitoring  
    •  Support  for  mul-ple  monitoring  plaGorms.  
    •  Framework  instrumenta-on  for  fine-­‐grained  
    monitoring  
    Scaling  
    •  Automa-c  and  user-­‐triggered  scaling  
    •  Cost  efficient  via  spot  instances  and  scaling  
    down  
    Repor-ng  
    •  Repor-ng  Job  status,  resource  usage,  cost  
    etc.  
    User Workflow
    Challenges
    •  Deadline aware scheduling
    •  Exploiting IaaS specific features – instance
    grouping in AWS, determining optimal instance
    configuration
    •  Predicting Job completion time
    •  Characterizing Jobs
    •  Tracking spot instance for price/performance
    tradeoff
    siel-iiith/hadoopstack

    View Slide