Matsya - Apr '16

An Introduction j.mp/to-matsya-rootconf

Ashwanth Kumar Dev/Ops @ Indix Hindu Mythology Fan OSS Contributor
ashwanthkumar.in

Typical Hadoop Setup on AWS An Auto Scaling Group (ASG)
based set of instances running TaskTrackers and DataNodes Scaling up or down is as easy as updating the Desired value (assuming you don’t have any AS metrics) You can also do Auto scaling based on specific metrics from CloudWatch

By default you would spin up On Demand instances on
AWS If you’re running 100 c3.2xlarge instances, you’re spending 100 * 0.420 = $42 per hour and $1008 (100 * 0.420 * 24) per day Problem 1 - Cost of On Demand Instances

Use Spot Instances (if applicable)

They lease unused hardware at a lower cost as Spot
instances No guarantees on how long they’re available Spot Prices are highly volatile But, highly cost effective if used right Spot’s “Demand vs Supply” is local to it’s Spot Market AWS Spot Primer

For a cluster, the spot markets can be viewed in
the following dimensions • Instance Types, Regions and Availability Zones The number of spot markets is product of all the above numbers. Example - Requirement for 36 CPUs per instance • Instance Types - [d2.8xlarge, c4.8xlarge] • AZs - [us-east-1a, us-east-1b, us-east-1c, …] • Region - [us-east, us-west, …] - 9 regions • Total in US-EAST (alone) => 2 * 1 * 5 = 10 spot markets AWS Spot Markets

High Risk and Reward game While you get the best
value in terms of cost of the same compute, there’s no guarantee on when the machines are going to be taken away There’s always Spot Termination Notice, but unfortunately not all applications are AWS-aware like Hadoop in our case Problem 2 - Spot Outages

Split the Spot Instances Across AZs

We had this running for a while with good success
We saw our AWS bill was gradually increasing on “Data Transfer” section Because HDFS write pipeline is not very AWS-Cross-AZ-Data-Transfer-Cost aware With HDFS on Spot, each machine going down meant replication will kick in from start Problem 3 - Cost of Data Transfer

Back to Single AZ Find a way around Spot outages
within a AZ

Deep Dive github.com/ind9/matsya

You run Spot clusters to save on costs Clusters span
across AZs to protect against Spot price fluctuations Results in HUGE data transfer costs ASG always try to evenly distribute the machines and doesn’t take cost into account Matsya - Motivation

Goal - Always optimize for cost and keep the fleet
running Scala app that monitors spot prices and moves the ASG to cheapest AZ Meant to be run as a CRON task Can fallback to OD (if required) Posts notifications to Slack when migrating Matsya

How Matsya Works?

How Matsya Works? ASG

How Matsya Works? Spot ASG

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

How Matsya Works? OD ASG us-east-1a us-east-1c ... Spot ASG
(Optional)

matsya { working-dir = “local_run” slack-webhook = “http://hooks.slack.com/services/foo/bar/baz” clusters =
[{ name = “Staging Hadoop Cluster” spot-asg = “as-hadoop-staging-spot” od-asg = “as-hadoop-staging-od” machine-type = “c3.2xlarge” bid-price = 0.420 od-price = 0.420 max-threshold = 0.99 nr-of-times = 3 fallback-to-od = false subnets = { “us-east-1a” = “subnet-east-1a” “us-east-1b” = “subnet-east-1b” “us-east-1c” = “subnet-east-1c” } }] } Matsya - Configuration

Deployed across the board for 6 months now Along with
Vamana, enabled us to achieve • ~50% of AWS Infrastructure is on Spot • 100% of Hadoop MR workloads is on Spot Matsya at Indix

Support for other Spot products - Spot Fleet and Spot
Blocks More notification systems Multiple Region support Multiple Product support Minimum number of OD instances Work In Progress Questions? github.com/ind9/matsya

http://serverfault.com/questions/448746/ec2-auto-scaling-with-spot-and-on-demand-instances http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html https://aws.amazon.com/ec2/spot/bid-advisor/ References

Matsya - Apr '16

Matsya - Apr '16

Ashwanth Kumar

More Decks by Ashwanth Kumar

Other Decks in Technology

Featured

Transcript

An Introduction j.mp/to-matsya-rootconf

Ashwanth Kumar Dev/Ops @ Indix Hindu Mythology Fan OSS Contributor

Typical Hadoop Setup on AWS An Auto Scaling Group (ASG)

By default you would spin up On Demand instances on

Use Spot Instances (if applicable)

They lease unused hardware at a lower cost as Spot

For a cluster, the spot markets can be viewed in

High Risk and Reward game While you get the best

Split the Spot Instances Across AZs

We had this running for a while with good success

Back to Single AZ Find a way around Spot outages

Deep Dive github.com/ind9/matsya

You run Spot clusters to save on costs Clusters span

Goal - Always optimize for cost and keep the fleet

How Matsya Works?

How Matsya Works? ASG

How Matsya Works? Spot ASG

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

How Matsya Works? us-east-1a us-east-1c ... Spot ASG

How Matsya Works? OD ASG us-east-1a us-east-1c ... Spot ASG

matsya { working-dir = “local_run” slack-webhook = “http://hooks.slack.com/services/foo/bar/baz” clusters =

Deployed across the board for 6 months now Along with

Support for other Spot products - Spot Fleet and Spot

http://serverfault.com/questions/448746/ec2-auto-scaling-with-spot-and-on-demand-instances http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html https://aws.amazon.com/ec2/spot/bid-advisor/ References