Slide 1

Slide 1 text

learnings from running production infra on aws spot reliably j.mp/to-matsya-geeknight

Slide 2

Slide 2 text

Ashwanth Kumar Dev/Ops @ Indix Hindu Mythology Fan OSS Contributor ashwanthkumar.in

Slide 3

Slide 3 text

running hadoop clusters on aws (this is not Amazon EMR)

Slide 4

Slide 4 text

Typical Hadoop Slaves Setup on AWS An Auto Scaling Group (ASG) based, set of instances running TaskTrackers and DataNodes Scaling up or down is as easy as updating the Desired value (or) You can also do Auto scaling based on specific metrics from CloudWatch

Slide 5

Slide 5 text

By default you would spin up On Demand instances on AWS If you’re running 100 c3.2xlarge (15G memory, 8 cores) instances, you’re spending 100 * 0.420 = $42 per hour 100 * 0.420 * 24 = $1008 per day Problem 1 - Cost of On Demand Instances

Slide 6

Slide 6 text

use spot instances (wherever applicable)

Slide 7

Slide 7 text

AWS leases unused hardware at a lower cost as Spot instances Spot Prices are highly volatile But, highly cost effective if used right Spot’s “Demand vs Supply” is local to it’s Spot Market AWS Spot Primer

Slide 8

Slide 8 text

Problem 2 - Spot Outages Using spot is a high risk and high reward game While you get the best value in terms of cost of the same compute, there’s no guarantee on when the machines are going to be taken away There’s Spot Termination Notice, but unfortunately not all applications are AWS- aware like Hadoop in our case

Slide 9

Slide 9 text

Split the spot instances across availability zones

Slide 10

Slide 10 text

For a cluster, the spot markets can be viewed in the following dimensions ● # of Instance Types, Regions and Availability Zones The number of spot markets is a cartesian product of all the above numbers. Example - Requirement for 36 CPUs per instance ● Instance Types - [d2.8xlarge, c4.8xlarge] ● AZs - [us-east-1a, us-east-1b, us-east-1c, …] ● Region - [us-east, us-west, …] - 10 regions ● Total in US-EAST (alone) => 2 * 1 * 5 = 10 spot markets AWS Spot Markets

Slide 11

Slide 11 text

We had this running for a while with good success We saw our AWS bill was gradually increasing on “Data Transfer” section Because HDFS write pipeline is not very AWS-Cross-AZ-Data-Transfer-Cost aware With HDFS on Spot, each machine going down meant replication will kick in from start Not only HDFS but in a MR job, reducers download data from each mappers Problem 3 - Cost of Data Transfer

Slide 12

Slide 12 text

Go back to single availability zone Find a way around Spot outages within a AZ

Slide 13

Slide 13 text

github.com/ind9/matsya Optimize for cost and keep the fleet running

Slide 14

Slide 14 text

You always need the full fleet (like Hadoop / Mesos / YARN) running 24x7 Ability to fallback to On Demand if needed to save costs Switch back to Spot once the surge ends Use Cases

Slide 15

Slide 15 text

Scala app that monitors spot prices and moves the ASG to cheapest AZ Meant to be run as a CRON task Can fallback to OD (if required) Posts notifications to Slack when migrating Matsya

Slide 16

Slide 16 text

How Matsya Works?

Slide 17

Slide 17 text

How Matsya Works? ASG

Slide 18

Slide 18 text

How Matsya Works? Spot ASG

Slide 19

Slide 19 text

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

Slide 20

Slide 20 text

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

Slide 21

Slide 21 text

How Matsya Works? OD ASG us-east-1a us-east-1c ... Spot ASG (Optional)

Slide 22

Slide 22 text

matsya { working-dir = “local_run” slack-webhook = “http://hooks.slack.com/services/foo/bar/baz” clusters = [{ name = “Staging Hadoop Cluster” spot-asg = “as-hadoop-staging-spot” od-asg = “as-hadoop-staging-od” machine-type = “c3.2xlarge” bid-price = 0.420 od-price = 0.420 max-threshold = 0.90 nr-of-times = 3 fallback-to-od = false subnets = { “us-east-1a” = “subnet-east-1a” “us-east-1b” = “subnet-east-1b” “us-east-1c” = “subnet-east-1c” } }] } Sample Configuration

Slide 23

Slide 23 text

Deployed across all production clusters for 6 months now Along with Vamana, enabled us to achieve ● ~40% reduction in monthly AWS bill ● ~50% of AWS Infrastructure is on Spot ● 100% of Hadoop MR workloads are on Spot Matsya at Indix

Slide 24

Slide 24 text

Matsya at Indix

Slide 25

Slide 25 text

Support for other Spot products - Spot Fleet and Spot Blocks More notification systems Multiple Region support Multiple Product support Minimum number of OD instances Work In Progress Questions? github.com/ind9/matsya

Slide 26

Slide 26 text

http://serverfault.com/questions/448746/ec2-auto-scaling-with-spot-and-on-demand-instances http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html https://aws.amazon.com/ec2/spot/bid-advisor/ References