Slide 1

Slide 1 text

An Introduction j.mp/to-matsya-rootconf

Slide 2

Slide 2 text

Ashwanth Kumar Dev/Ops @ Indix Hindu Mythology Fan OSS Contributor ashwanthkumar.in

Slide 3

Slide 3 text

Typical Hadoop Setup on AWS An Auto Scaling Group (ASG) based set of instances running TaskTrackers and DataNodes Scaling up or down is as easy as updating the Desired value (assuming you don’t have any AS metrics) You can also do Auto scaling based on specific metrics from CloudWatch

Slide 4

Slide 4 text

By default you would spin up On Demand instances on AWS If you’re running 100 c3.2xlarge instances, you’re spending 100 * 0.420 = $42 per hour and $1008 (100 * 0.420 * 24) per day Problem 1 - Cost of On Demand Instances

Slide 5

Slide 5 text

Use Spot Instances (if applicable)

Slide 6

Slide 6 text

They lease unused hardware at a lower cost as Spot instances No guarantees on how long they’re available Spot Prices are highly volatile But, highly cost effective if used right Spot’s “Demand vs Supply” is local to it’s Spot Market AWS Spot Primer

Slide 7

Slide 7 text

For a cluster, the spot markets can be viewed in the following dimensions ● Instance Types, Regions and Availability Zones The number of spot markets is product of all the above numbers. Example - Requirement for 36 CPUs per instance ● Instance Types - [d2.8xlarge, c4.8xlarge] ● AZs - [us-east-1a, us-east-1b, us-east-1c, …] ● Region - [us-east, us-west, …] - 9 regions ● Total in US-EAST (alone) => 2 * 1 * 5 = 10 spot markets AWS Spot Markets

Slide 8

Slide 8 text

High Risk and Reward game While you get the best value in terms of cost of the same compute, there’s no guarantee on when the machines are going to be taken away There’s always Spot Termination Notice, but unfortunately not all applications are AWS-aware like Hadoop in our case Problem 2 - Spot Outages

Slide 9

Slide 9 text

Split the Spot Instances Across AZs

Slide 10

Slide 10 text

We had this running for a while with good success We saw our AWS bill was gradually increasing on “Data Transfer” section Because HDFS write pipeline is not very AWS-Cross-AZ-Data-Transfer-Cost aware With HDFS on Spot, each machine going down meant replication will kick in from start Problem 3 - Cost of Data Transfer

Slide 11

Slide 11 text

Back to Single AZ Find a way around Spot outages within a AZ

Slide 12

Slide 12 text

Deep Dive github.com/ind9/matsya

Slide 13

Slide 13 text

You run Spot clusters to save on costs Clusters span across AZs to protect against Spot price fluctuations Results in HUGE data transfer costs ASG always try to evenly distribute the machines and doesn’t take cost into account Matsya - Motivation

Slide 14

Slide 14 text

Goal - Always optimize for cost and keep the fleet running Scala app that monitors spot prices and moves the ASG to cheapest AZ Meant to be run as a CRON task Can fallback to OD (if required) Posts notifications to Slack when migrating Matsya

Slide 15

Slide 15 text

How Matsya Works?

Slide 16

Slide 16 text

How Matsya Works? ASG

Slide 17

Slide 17 text

How Matsya Works? Spot ASG

Slide 18

Slide 18 text

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

Slide 19

Slide 19 text

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

Slide 20

Slide 20 text

How Matsya Works? OD ASG us-east-1a us-east-1c ... Spot ASG (Optional)

Slide 21

Slide 21 text

matsya { working-dir = “local_run” slack-webhook = “http://hooks.slack.com/services/foo/bar/baz” clusters = [{ name = “Staging Hadoop Cluster” spot-asg = “as-hadoop-staging-spot” od-asg = “as-hadoop-staging-od” machine-type = “c3.2xlarge” bid-price = 0.420 od-price = 0.420 max-threshold = 0.99 nr-of-times = 3 fallback-to-od = false subnets = { “us-east-1a” = “subnet-east-1a” “us-east-1b” = “subnet-east-1b” “us-east-1c” = “subnet-east-1c” } }] } Matsya - Configuration

Slide 22

Slide 22 text

Deployed across the board for 6 months now Along with Vamana, enabled us to achieve ● ~50% of AWS Infrastructure is on Spot ● 100% of Hadoop MR workloads is on Spot Matsya at Indix

Slide 23

Slide 23 text

Support for other Spot products - Spot Fleet and Spot Blocks More notification systems Multiple Region support Multiple Product support Minimum number of OD instances Work In Progress Questions? github.com/ind9/matsya

Slide 24

Slide 24 text

http://serverfault.com/questions/448746/ec2-auto-scaling-with-spot-and-on-demand-instances http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html https://aws.amazon.com/ec2/spot/bid-advisor/ References