Building the World's Largest Websites with Consul and Terraform

Building the World’s Largest Websites with Consul and
Terraform

@mitchellh Mitchell Hashimoto

DC EVOLUTION Challenges of the modern datacenter

RISING DATACENTER COMPLEXITY DC

RISING DATACENTER COMPLEXITY DC VM VM VM VM VM VM
VM VM VM VM VM VM VM VM VM VM

RISING DATACENTER COMPLEXITY DC VM VM VM VM VM VM
VM VM VM VM VM VM VM VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

RISING DATACENTER COMPLEXITY DC DNS Database CDN

RISING DATACENTER COMPLEXITY DC-01 DC-02

RISING DATACENTER COMPLEXITY DC-01 DC-02 VM VM VM VM VM
VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C

RISING DATACENTER COMPLEXITY IaaS PaaS SaaS

RISING DATACENTER COMPLEXITY

CONSUL consul.io

Service discovery, conﬁguraCon, and orchestraCon made easy. Distributed,
highly available, and datacenter-‐aware.

QuesCons that Consul Answers • Where is the service foo?
(ex. Where is the database?) • What is the health status of service foo? • What is the health status of the machine/node foo? • What is the list of all currently running machines? • What is the conﬁguraCon of service foo? • Is anyone else currently performing operaCon foo?

terraform.io TERRAFORM

Build, combine, and launch infrastructure safely and eﬃciently. terraform.io

What If I asked you to… • create a
completely isolated second environment to run an applicaCon (staging, QA, dev, etc.)? • deploy a complex new applicaCon? • update an exisCng complex applicaCon? • document how our infrastructure is architected? • delegate some ops to smaller teams? (Core IT vs. App IT)

SCALABILITY, RESILIENCY, DETERMINISM

SCALABILITY • ExpectaCon of high QPS per resource •
CPU, memory are valuable resources • One less server for uClity = one more server for   serving customers • Push vs. Pull, a.k.a. edge triggered changes

RESILIENCY • Probability of failure goes up for scale
• Embrace failure and make it acceptable • Constant change at some scale • Self-‐healing systems become much more   important (automaCc anC-‐entropy) • Central sources of truth become liabiliCes

DETERMINISM • Understand the full eﬀect of a change
• Predictable (but not necessarily strict) ordering  of a change. • LimiCng surprises that can cause downCme

CHALLENGE #1 DECENTRALIZED SERVICE CONFIGURATION

CONFIG MGMT SERVER TRADITIONAL SERVICE CONFIGURATION Pull-‐based, long intervals,
computaConally expensive WEB 1 WEB 2 WEB N 14:00 14:07 14:03

CONSUL CONSUL K/V + CONSUL-‐TEMPLATE Push-‐based, “instant”, predictable computaConal cost
WEB 1 WEB 2 WEB N 14:00:00.311 14:00:00.731 14:00:00.415

CONSUL-‐TEMPLATE Template Example global daemon maxconn {{key "haproxy/maxconn"}} defaults mode
{{key "haproxy/mode"}}{{range ls "haproxy/timeouts"}} timeout {{.Key}} {{.Value}}{{end}} listen http-in bind *:8000{{range service "release.web"}} server {{.Node}} {{.Address}}:{{.Port}}{{end}}

CONSUL-‐TEMPLATE Execute (as a service) $ consul-template \ -consul demo.consul.io
\ -template “haproxy.ctmpl:/etc/haproxy/haproxy.conf:restart haproxy” -dry

STEP BY STEP 1. Config management puts down configuraCon
template 2. consul-‐template runs as a service 3. Edge triggers config changes, restarts service

CHALLENGE #2 SCALABLE SERVICE DISCOVERY

ZERO TTL DNS • Long-‐held connecCons to minimize DNS
overhead • Zero TTL ensures most up-‐to-‐date informaCon

RESILIENCY • Low-‐TTL DNS records • Ensures availability even
if Consul is unavailable • Required for short-‐held connecCons since DNS lookup overhead is too high with zero TTL

CONSUL AGENT OPTION #1: CONSUL SETTINGS Per-‐service, stale reads on
non-‐leaders WEB PROCESS dns query CONSUL   LEADER CONSUL   STANDBY

CONSUL AGENT OPTION #2: DNSMASQ + CONSUL Global, works if
Consul is down WEB PROCESS dns query DNSMASQ

OPTION #3: APPLICATION-‐LEVEL CACHE Works if almost everything is down,
strict control over cache Cmes WEB PROCESS dns query IN-‐MEM CACHE CONSUL AGENT

BEST OPTION? The ﬁrst two opCons are usually good
enough, will buy you a lot of runway

CHALLENGE #3 MONITORING AT SCALE

MONITORING SERVICE TRADITIONAL MONITORING Pushes informaCon into a silo
WEB 1 WEB 2 WEB N

CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer
WEB 1 WEB 2 WEB N

WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6

WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.5 10.0.1.6

WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul

WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6 host: web.service.consul

CONSUL CONSUL MONITORING + ALERTING Via Atlas (HashiCorp paid oﬀering)
WEB 1 WEB N ATLAS EMAIL SLACK PAGERDUTY

CONSUL MONITORING + ALERTING Atlas UI

CONSUL MONITORING + ALERTING History

CONSUL MONITORING + ALERTING Email, PagerDuty, Slack, etc.

CHALLENGE #4 SERVICE RESILIENCY   VIA DISTRIBUTED LOCKING

CONSUL CONSUL DISTRIBUTED LOCKING Ensure “at most N” tasks WEB
1 WEB 2 WEB LEADER

CONSUL CONSUL DISTRIBUTED LOCKING Ensure “at most N” tasks WEB
1 WEB LEADER WEB N

API Or CLI CLI Example $ consul lock locks/ ./configure-f5

DISTRIBUTED LOCKS • Building block for distributed systems •
Complexity hidden from downstream applicaCons, like a mutex stdlib

CHALLENGE #5 SERVICE ORCHESTRATION   VIA EVENTS, WATCHES

CONSUL CONSUL EVENTS Edge-‐triggered, sent to all nodes, extremely cheap
WEB 1 WEB 2 WEB N consul event deploy

CONSUL CONSUL WATCH Waits for speciﬁc events, executes script WEB
1 WEB 2 WEB N consul event deploy

CONSUL CONSUL EXEC Runs script on speciﬁc nodes WEB 1
WEB 2 DATABASE consul exec -service=“web” ./script.sh

Consul Watch Wait for events, do something $ consul watch
-type=event -name=deploy ./deploy.sh …

USE CASES • Deploys • OperaConal tasks •
Conﬁgure external services

CHALLENGE #6 DETERMINISTIC LARGE-‐ SCALE INFRASTRUCTURE CHANGE

LARGE SCALE INFRA UPDATE • Unexpected inter-‐dependencies • Cross-‐cloud
changes • Ordering for minimal disrupCon • Expected Cme for complete rollout

Terraform Plan What are you going to do? + digitalocean_droplet.web
backups: "" => "<computed>" image: "" => "centos-5-8-x32" ipv4_address: "" => "<computed>" ipv4_address_private: "" => "<computed>" name: "" => "tf-web" private_networking: "" => "<computed>" region: "" => "sfo1" size: "" => "512mb" status: "" => "<computed>" + dnsimple_record.hello domain: "" => "example.com" domain_id: "" => "<computed>" hostname: "" => "<computed>" name: "" => "test"

Terraform Graph What order are you going to do things?
$ terraform graph …

CHALLENGE #7 DELEGATING OPS TO MULTIPLE TEAMS

OPS DELEGATION • “Core” operaCons teams • ApplicaCon operaCons
teams • Eliminate shadow ops • Safely make changes without   negaCvely aﬀecCng others • Share operaCons knowledge

Modules Unit of knowledge sharing module “consul” { source =
“github.com/hashicorp/consul/terraform/aws” servers = 5 } output “consul-address” { value = “${module.consul.server_address}” }

Remote State Unit of resource sharing resource “terraform_remote_state” “consul” {
backend = "atlas" config { path = “hashicorp/consul-prod” } } output “consul-address” { value = “${terraform_remote_state.consul.addr}” }

CHALLENGE #8 SERVICE COMPOSITION, INFRASTRUCTURE ORCHESTRATION

SERVICE COMPOSITION • Modern infrastructures are almost always  “mulC-‐provider”: DNS
in CloudFlare, compute  in AWS, etc. • Infrastructure change requires composing   data from mulCple services, execuCng change  in mulCple services

Service ComposiUon ConnecCng mulCple service providers resource “aws_instance” “web” { 
# … } resource “cloudflare_record” “www” { domain = “foo.com” name = “www” value = “${aws_instance.web.private_ip}” type = “A” }

Logical Resources Now you’re thinking in graphs resource “template_file” “data”
{ filename = “data.tpl” vars { address = “${var.addr}” } } resource “aws_instance” “web” {  user_data = “${template_file.data.rendered}” } resource “cloudflare_record” “www” { domain = “foo.com” name = “www” value = “${aws_instance.web.private_ip}” type = “A” }

CHALLENGE #9 HISTORY OF INFRASTRUCTUE CHANGE

HISTORY OF INFRASTRUCTURE CHANGE Via Atlas (HashiCorp paid oﬀering)

HISTORY OF INFRA CHANGE • See who did what when
how • See what changed recently to diagnose  some monitoring event • Treat infrastructure as a sort of applicaCon

CHALLENGE #10 INFRASTRUCTURE COLLABORATION

INFRA COLLABORATION • Achieve applicaCon-‐like collaboraCon with  infrastructure change
• Code reviews, safe merges • Understanding the eﬀect of infrastructure   changes

INFRASTRUCTURE COLLABORATION Approve/deny plans, similar to pull requests, but for
infra

Thanks! QUESTIONS?

Building the World's Largest Websites with Cons...

Building the World's Largest Websites with Consul and Terraform

More Decks by Mitchell Hashimoto

Other Decks in Programming

Featured

Transcript