Matsya - Geeknight April 2016

learnings from running production infra on aws spot reliably j.mp/to-matsya-geeknight

Ashwanth Kumar Dev/Ops @ Indix Hindu Mythology Fan OSS Contributor
ashwanthkumar.in

running hadoop clusters on aws (this is not Amazon EMR)

Typical Hadoop Slaves Setup on AWS An Auto Scaling Group
(ASG) based, set of instances running TaskTrackers and DataNodes Scaling up or down is as easy as updating the Desired value (or) You can also do Auto scaling based on specific metrics from CloudWatch

By default you would spin up On Demand instances on
AWS If you’re running 100 c3.2xlarge (15G memory, 8 cores) instances, you’re spending 100 * 0.420 = $42 per hour 100 * 0.420 * 24 = $1008 per day Problem 1 - Cost of On Demand Instances

use spot instances (wherever applicable)

AWS leases unused hardware at a lower cost as Spot
instances Spot Prices are highly volatile But, highly cost effective if used right Spot’s “Demand vs Supply” is local to it’s Spot Market AWS Spot Primer

Problem 2 - Spot Outages Using spot is a high
risk and high reward game While you get the best value in terms of cost of the same compute, there’s no guarantee on when the machines are going to be taken away There’s Spot Termination Notice, but unfortunately not all applications are AWS- aware like Hadoop in our case

Split the spot instances across availability zones

For a cluster, the spot markets can be viewed in
the following dimensions • # of Instance Types, Regions and Availability Zones The number of spot markets is a cartesian product of all the above numbers. Example - Requirement for 36 CPUs per instance • Instance Types - [d2.8xlarge, c4.8xlarge] • AZs - [us-east-1a, us-east-1b, us-east-1c, …] • Region - [us-east, us-west, …] - 10 regions • Total in US-EAST (alone) => 2 * 1 * 5 = 10 spot markets AWS Spot Markets

We had this running for a while with good success
We saw our AWS bill was gradually increasing on “Data Transfer” section Because HDFS write pipeline is not very AWS-Cross-AZ-Data-Transfer-Cost aware With HDFS on Spot, each machine going down meant replication will kick in from start Not only HDFS but in a MR job, reducers download data from each mappers Problem 3 - Cost of Data Transfer

Go back to single availability zone Find a way around
Spot outages within a AZ

github.com/ind9/matsya Optimize for cost and keep the fleet running

You always need the full fleet (like Hadoop / Mesos
/ YARN) running 24x7 Ability to fallback to On Demand if needed to save costs Switch back to Spot once the surge ends Use Cases

Scala app that monitors spot prices and moves the ASG
to cheapest AZ Meant to be run as a CRON task Can fallback to OD (if required) Posts notifications to Slack when migrating Matsya

How Matsya Works?

How Matsya Works? ASG

How Matsya Works? Spot ASG

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

How Matsya Works? OD ASG us-east-1a us-east-1c ... Spot ASG
(Optional)

matsya { working-dir = “local_run” slack-webhook = “http://hooks.slack.com/services/foo/bar/baz” clusters =
[{ name = “Staging Hadoop Cluster” spot-asg = “as-hadoop-staging-spot” od-asg = “as-hadoop-staging-od” machine-type = “c3.2xlarge” bid-price = 0.420 od-price = 0.420 max-threshold = 0.90 nr-of-times = 3 fallback-to-od = false subnets = { “us-east-1a” = “subnet-east-1a” “us-east-1b” = “subnet-east-1b” “us-east-1c” = “subnet-east-1c” } }] } Sample Configuration

Deployed across all production clusters for 6 months now Along
with Vamana, enabled us to achieve • ~40% reduction in monthly AWS bill • ~50% of AWS Infrastructure is on Spot • 100% of Hadoop MR workloads are on Spot Matsya at Indix

Matsya at Indix

Support for other Spot products - Spot Fleet and Spot
Blocks More notification systems Multiple Region support Multiple Product support Minimum number of OD instances Work In Progress Questions? github.com/ind9/matsya

http://serverfault.com/questions/448746/ec2-auto-scaling-with-spot-and-on-demand-instances http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html https://aws.amazon.com/ec2/spot/bid-advisor/ References

Matsya - Geeknight April 2016

Matsya - Geeknight April 2016

Ashwanth Kumar

More Decks by Ashwanth Kumar

Other Decks in Programming

Featured

Transcript

learnings from running production infra on aws spot reliably j.mp/to-matsya-geeknight

Ashwanth Kumar Dev/Ops @ Indix Hindu Mythology Fan OSS Contributor

running hadoop clusters on aws (this is not Amazon EMR)

Typical Hadoop Slaves Setup on AWS An Auto Scaling Group

By default you would spin up On Demand instances on

use spot instances (wherever applicable)

AWS leases unused hardware at a lower cost as Spot

Problem 2 - Spot Outages Using spot is a high

Split the spot instances across availability zones

For a cluster, the spot markets can be viewed in

We had this running for a while with good success

Go back to single availability zone Find a way around

github.com/ind9/matsya Optimize for cost and keep the fleet running

You always need the full fleet (like Hadoop / Mesos

Scala app that monitors spot prices and moves the ASG

How Matsya Works?

How Matsya Works? ASG

How Matsya Works? Spot ASG

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

How Matsya Works? OD ASG us-east-1a us-east-1c ... Spot ASG

matsya { working-dir = “local_run” slack-webhook = “http://hooks.slack.com/services/foo/bar/baz” clusters =

Deployed across all production clusters for 6 months now Along

Matsya at Indix

Support for other Spot products - Spot Fleet and Spot

http://serverfault.com/questions/448746/ec2-auto-scaling-with-spot-and-on-demand-instances http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html https://aws.amazon.com/ec2/spot/bid-advisor/ References