Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from managing hadoop clusters on aws @Indix

Lessons from managing hadoop clusters on aws @Indix

Slides of the talk I presented at DevopsDays India 2016.

Video - https://www.youtube.com/watch?v=eBbgylpRufQ

D90acaa01cb59a2b8b7e986958953eee?s=128

Ashwanth Kumar

November 05, 2016
Tweet

Transcript

  1. lessons from managing hadoop clusters on aws @indix bit.ly/autoscaling-on-aws

  2. dev on ops duty @indix oss contributor ashwanthkumar.in @_ashwanthkumar ashwanth

    kumar
  3. assumptions You all know about AWS Regions and Availability Zones

    Spot instances on AWS (optional) Worked / Operated Hadoop clusters
  4. dev+ops culture @indix autoscaling hadoop clusters cost reduction using spot

    on aws agenda
  5. “If the development team is frequently called in the middle

    of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team.” On designing and deploying internet-scale services - James Hamilton - LISA ’07 dev+ops culture @indix
  6. dev+ops culture @indix Development Teams manage a. infrastructure provisioning b.

    configuration management c. deployment (incl. releases and testing at various environments) d. on-call roster and weekly rotation for all the systems they inherit and build.
  7. Ops Team is responsible for a. making infrastructure “human fault

    tolerant” b. AWS cost c. processes like Ops-Review for each system before hitting production d. common infrastructure like configuration management, logging, metric collection, alerting etc. That’s 3 folks responsible for the work across 50+ developers in ops. dev+ops culture @indix
  8. This talk’s work is a result of operations and development

    teams coming together to solve some of common problems of both. Dev Problem - Automatic scaling for apps to meet certain SLA Ops Problem - Keeping the cost under control when systems scale dev+ops culture @indix
  9. dev problem autoscaling hadoop clusters

  10. We’ve lot of pipelines running on various versions of Hadoop

    clusters Each of them have their own usage pattern A Staging cluster has only workloads for 3-4 hours a day Production cluster has workloads 24x7 - running 100s of jobs hadoop @indix
  11. hadoop @indix We started having Scale Up and Scale Down

    stages in our pipelines
  12. Initially it helped but started breaking when Pipeline fails before

    completion and the cluster will not scale down Every new pipeline created had to have a scale up and scale down stage More than 1 pipeline started sharing the cluster hadoop @indix
  13. hadoop cluster setup

  14. hadoop cluster setup using asg

  15. Scales to large number of instances Has Cool Off capabilities

    to avoid scale storms Auto Balances instances across AZs (Subnets) Integration with ELB - Elastic Beanstalk amazon asg - good parts
  16. Can support only 1 launch configuration actively • Single Instance

    type • Single instance life cycle - Spot / OD Scaling Policies Tightly coupled with only Cloudwatch amazon asg - limitations
  17. github.com/indix/vamana VAMANA application specific autoscaling

  18. vamana architecture VAMANA Push Demand and Supply Metrics Get Demand

    And Supply Metrics Set the computed “Desired” Value
  19. • We collect supply metrics from the Cluster Summary table

    ◦ map_supply ◦ reduce_supply • Demand metrics are collected as cumulative sum of map & reduce tasks of all Running jobs ◦ map_demand ◦ reduce_demand demand vs supply metrics for hadoop
  20. post vamana - cost reduction Savings ~ $ 30 per

    day
  21. vamana Push Demand and Supply Metrics Get Demand And Supply

    Metrics Set the computed “Desired” Value VAMANA
  22. vamana - pluggable app scalar Push Demand and Supply Metrics

    Get Demand And Supply Metrics Set the computed “Desired” Value VAMANA
  23. vamana - pluggable metric store Push Demand and Supply Metrics

    Get Demand And Supply Metrics Set the computed “Desired” Value VAMANA
  24. vamana - pluggable auto scalar Push Demand and Supply Metrics

    Get Demand And Supply Metrics Set the computed “Desired” Value VAMANA
  25. ops problem cost reduction using spot

  26. hadoop cluster setup with vamana

  27. By default you would spin up On Demand instances on

    AWS If you’re running 100 m3.2xlarge (30G memory, 8 cores) instances, you’re spending 100 * 0.532 = $53.2 per hour 100 * 0.532 * 24 = $1276.8 per day
  28. By default you would spin up On Demand instances on

    AWS If you’re running 100 m3.2xlarge (30G memory, 8 cores) instances, you’re spending 100 * 0.532 = $53.2 per hour 100 * 0.532 * 24 = $1276.8 per day problem 1 - cost of on-demand instances
  29. use spot instances* * wherever applicable

  30. Best value for money Using spot is a High Risk

    vs Reward game There’s Spot Termination Notice, but unfortunately not all applications are AWS-aware like Hadoop in our case
  31. problem 2 - spot outages Best value for money Using

    spot is a High Risk vs Reward game There’s Spot Termination Notice, but unfortunately not all applications are AWS-aware like Hadoop in our case
  32. choose a different instance type split the spot instances across

    availability zones
  33. We had this running for a while with good success

    Until we saw our AWS bill was gradually increasing - esp. under “Data Transfer” Because HDFS write pipeline is not very AWS-Cross-AZ-Data-Transfer-Cost aware
  34. We had this running for a while with good success

    Until we saw our AWS bill was gradually increasing - esp. under “Data Transfer” Because HDFS write pipeline is not very AWS-Cross-AZ-Data-Transfer-Cost aware problem 3 - data transfer cost
  35. go back to single availability zone

  36. go back to single availability zone Find a way around

    Spot outages within an AZ
  37. You always need the full fleet (like Hadoop1 / Mesos

    / YARN) running 24x7 You need to save cost by running Spot instances Handle surge pricing for spot by switching AZs Ability to fallback to On Demand if needed to meet certain SLA Switch back to Spot once the surge ends learnings so far
  38. github.com/indix/matsya optimize for cost and keep the fleet running

  39. Scala app that monitors spot prices and moves the ASG

    to cheapest AZ Meant to be run as a CRON task Posts notifications to Slack when migrating matsya
  40. how matsya works?

  41. how matsya works? ASG

  42. how matsya works? Spot ASG

  43. how matsya works? us-east-1a us-east-1c ... Spot ASG

  44. how matsya works? us-east-1a us-east-1c ... Spot ASG

  45. how matsya works? OD ASG us-east-1a us-east-1c ... Spot ASG

    (Optional)
  46. matsya @indix Deployed across all production clusters for 10 months

    now Along with Vamana, enabled us to achieve • ~40% reduction in monthly AWS bill • ~50% of AWS Infrastructure is on Spot • 100% of Hadoop MR workloads are on Spot
  47. hadoop now

  48. Make operations approachable to developers - this is more than

    using shiny tools Think Autoscale as first class functional feature across all systems Use Spot on AWS - Save costs - but be cognizant of when, where and hows take aways Questions? github.com/indix/matsya || github.com/indix/vamana
  49. http://serverfault.com/questions/448746/ec2-auto-scaling-with-spot-and-on-demand-instances http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html https://aws.amazon.com/ec2/spot/bid-advisor/ http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/ references

  50. meta

  51. Support for other Spot products - Spot Blocks Multiple Region

    support Minimum number of OD instances matsya - roadmap
  52. Support more Metrics store Support for different types of App

    scalar Spot Fleet Integration GCE integration vamana - roadmap
  53. AWS leases unused hardware at a lower cost as Spot

    instances Spot Prices are highly volatile But, highly cost effective if used right Spot’s “Demand vs Supply” is local to it’s Spot Market aws spot primer
  54. For a cluster, the spot markets can be viewed in

    the following dimensions • # of Instance Types, Availability Zones and Regions The number of spot markets is a cartesian product of all the above numbers. Example - Requirement for 36 CPUs per instance • Instance Types - [d2.8xlarge, c4.8xlarge] • AZs - [us-east-1a, us-east-1b, us-east-1c, …] • Region - [us-east, us-west, …] - 10 regions • Total in US-EAST (alone) => 2 * 5 = 10 spot markets aws spot markets