Lessons from managing hadoop clusters on aws @Indix

Slide 1

Slide 1 text

lessons from managing hadoop clusters on aws @indix bit.ly/autoscaling-on-aws

Slide 2

Slide 2 text

dev on ops duty @indix oss contributor ashwanthkumar.in @_ashwanthkumar ashwanth kumar

Slide 3

Slide 3 text

assumptions You all know about AWS Regions and Availability Zones Spot instances on AWS (optional) Worked / Operated Hadoop clusters

Slide 4

Slide 4 text

dev+ops culture @indix autoscaling hadoop clusters cost reduction using spot on aws agenda

Slide 5

Slide 5 text

“If the development team is frequently called in the middle of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team.” On designing and deploying internet-scale services - James Hamilton - LISA ’07 dev+ops culture @indix

Slide 6

Slide 6 text

dev+ops culture @indix Development Teams manage a. infrastructure provisioning b. configuration management c. deployment (incl. releases and testing at various environments) d. on-call roster and weekly rotation for all the systems they inherit and build.

Slide 7

Slide 7 text

Ops Team is responsible for a. making infrastructure “human fault tolerant” b. AWS cost c. processes like Ops-Review for each system before hitting production d. common infrastructure like configuration management, logging, metric collection, alerting etc. That’s 3 folks responsible for the work across 50+ developers in ops. dev+ops culture @indix

Slide 8

Slide 8 text

This talk’s work is a result of operations and development teams coming together to solve some of common problems of both. Dev Problem - Automatic scaling for apps to meet certain SLA Ops Problem - Keeping the cost under control when systems scale dev+ops culture @indix

Slide 9

Slide 9 text

dev problem autoscaling hadoop clusters

Slide 10

Slide 10 text

We’ve lot of pipelines running on various versions of Hadoop clusters Each of them have their own usage pattern A Staging cluster has only workloads for 3-4 hours a day Production cluster has workloads 24x7 - running 100s of jobs hadoop @indix

Slide 11

Slide 11 text

hadoop @indix We started having Scale Up and Scale Down stages in our pipelines

Slide 12

Slide 12 text

Initially it helped but started breaking when Pipeline fails before completion and the cluster will not scale down Every new pipeline created had to have a scale up and scale down stage More than 1 pipeline started sharing the cluster hadoop @indix

Slide 13

Slide 13 text

hadoop cluster setup

Slide 14

Slide 14 text

hadoop cluster setup using asg

Slide 15

Slide 15 text

Scales to large number of instances Has Cool Off capabilities to avoid scale storms Auto Balances instances across AZs (Subnets) Integration with ELB - Elastic Beanstalk amazon asg - good parts

Slide 16

Slide 16 text

Can support only 1 launch configuration actively ● Single Instance type ● Single instance life cycle - Spot / OD Scaling Policies Tightly coupled with only Cloudwatch amazon asg - limitations

Slide 17

Slide 17 text

github.com/indix/vamana VAMANA application specific autoscaling

Slide 18

Slide 18 text

vamana architecture VAMANA Push Demand and Supply Metrics Get Demand And Supply Metrics Set the computed “Desired” Value

Slide 19

Slide 19 text

● We collect supply metrics from the Cluster Summary table ○ map_supply ○ reduce_supply ● Demand metrics are collected as cumulative sum of map & reduce tasks of all Running jobs ○ map_demand ○ reduce_demand demand vs supply metrics for hadoop

Slide 20

Slide 20 text

post vamana - cost reduction Savings ~ $ 30 per day

Slide 21

Slide 21 text

vamana Push Demand and Supply Metrics Get Demand And Supply Metrics Set the computed “Desired” Value VAMANA

Slide 22

Slide 22 text

vamana - pluggable app scalar Push Demand and Supply Metrics Get Demand And Supply Metrics Set the computed “Desired” Value VAMANA

Slide 23

Slide 23 text

vamana - pluggable metric store Push Demand and Supply Metrics Get Demand And Supply Metrics Set the computed “Desired” Value VAMANA

Slide 24

Slide 24 text

vamana - pluggable auto scalar Push Demand and Supply Metrics Get Demand And Supply Metrics Set the computed “Desired” Value VAMANA

Slide 25

Slide 25 text

ops problem cost reduction using spot

Slide 26

Slide 26 text

hadoop cluster setup with vamana

Slide 27

Slide 27 text

By default you would spin up On Demand instances on AWS If you’re running 100 m3.2xlarge (30G memory, 8 cores) instances, you’re spending 100 * 0.532 = $53.2 per hour 100 * 0.532 * 24 = $1276.8 per day

Slide 28

Slide 28 text

Slide 29

Slide 29 text

use spot instances* * wherever applicable

Slide 30

Slide 30 text

Best value for money Using spot is a High Risk vs Reward game There’s Spot Termination Notice, but unfortunately not all applications are AWS-aware like Hadoop in our case

Slide 31

Slide 31 text

problem 2 - spot outages Best value for money Using spot is a High Risk vs Reward game There’s Spot Termination Notice, but unfortunately not all applications are AWS-aware like Hadoop in our case

Slide 32

Slide 32 text

choose a different instance type split the spot instances across availability zones

Slide 33

Slide 33 text

We had this running for a while with good success Until we saw our AWS bill was gradually increasing - esp. under “Data Transfer” Because HDFS write pipeline is not very AWS-Cross-AZ-Data-Transfer-Cost aware

Slide 34

Slide 34 text

Slide 35

Slide 35 text

go back to single availability zone

Slide 36

Slide 36 text

go back to single availability zone Find a way around Spot outages within an AZ

Slide 37

Slide 37 text

You always need the full fleet (like Hadoop1 / Mesos / YARN) running 24x7 You need to save cost by running Spot instances Handle surge pricing for spot by switching AZs Ability to fallback to On Demand if needed to meet certain SLA Switch back to Spot once the surge ends learnings so far

Slide 38

Slide 38 text

github.com/indix/matsya optimize for cost and keep the fleet running

Slide 39

Slide 39 text

Scala app that monitors spot prices and moves the ASG to cheapest AZ Meant to be run as a CRON task Posts notifications to Slack when migrating matsya

Slide 40

Slide 40 text

how matsya works?

Slide 41

Slide 41 text

how matsya works? ASG

Slide 42

Slide 42 text

how matsya works? Spot ASG

Slide 43

Slide 43 text

how matsya works? us-east-1a us-east-1c ... Spot ASG

Slide 44

Slide 44 text

how matsya works? us-east-1a us-east-1c ... Spot ASG

Slide 45

Slide 45 text

how matsya works? OD ASG us-east-1a us-east-1c ... Spot ASG (Optional)

Slide 46

Slide 46 text

matsya @indix Deployed across all production clusters for 10 months now Along with Vamana, enabled us to achieve ● ~40% reduction in monthly AWS bill ● ~50% of AWS Infrastructure is on Spot ● 100% of Hadoop MR workloads are on Spot

Slide 47

Slide 47 text

hadoop now

Slide 48

Slide 48 text

Make operations approachable to developers - this is more than using shiny tools Think Autoscale as first class functional feature across all systems Use Spot on AWS - Save costs - but be cognizant of when, where and hows take aways Questions? github.com/indix/matsya || github.com/indix/vamana

Slide 49

Slide 49 text

http://serverfault.com/questions/448746/ec2-auto-scaling-with-spot-and-on-demand-instances http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html https://aws.amazon.com/ec2/spot/bid-advisor/ http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/ references

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Support for other Spot products - Spot Blocks Multiple Region support Minimum number of OD instances matsya - roadmap

Slide 52

Slide 52 text

Support more Metrics store Support for different types of App scalar Spot Fleet Integration GCE integration vamana - roadmap

Slide 53

Slide 53 text

AWS leases unused hardware at a lower cost as Spot instances Spot Prices are highly volatile But, highly cost effective if used right Spot’s “Demand vs Supply” is local to it’s Spot Market aws spot primer

Slide 54

Slide 54 text

For a cluster, the spot markets can be viewed in the following dimensions ● # of Instance Types, Availability Zones and Regions The number of spot markets is a cartesian product of all the above numbers. Example - Requirement for 36 CPUs per instance ● Instance Types - [d2.8xlarge, c4.8xlarge] ● AZs - [us-east-1a, us-east-1b, us-east-1c, …] ● Region - [us-east, us-west, …] - 10 regions ● Total in US-EAST (alone) => 2 * 5 = 10 spot markets aws spot markets