Building the World's Largest Websites

502828deee7e3b38ca1e527dded8a1a9?s=47 Seth Vargo
September 11, 2015

Building the World's Largest Websites

Today we are plagued by hundreds of choices when architecting a modern data center. Should our machines be virtual or physical? Should we use containers or Docker? Should we use a public cloud provider or a private cloud provider? Which configuration management tool is best to use? What about IaaS, PaaS, and SaaS? It would be manageable if these were binary choices; however, we often find ourselves in a hybrid environment. As more operations choices are added to your data center, whether through company acquisitions, a growing development team, or general technical debt, managing complexity between legacy and new systems becomes a nightmare. Yet the end goal is still the same — safely deploy your application to your infrastructure. We need to tame our data centers by managing change across systems, enforcing policies, and by establishing a workflow for both developers and operations engineers to build in a collaborative environment. This talk will discuss the problems faced in the modern data center, and how a set of innovative open source tooling can be used to tame the rising complexity curve. We will discuss the tools and tactics implored by some of the largest web-based companies using open source tools. Join me on an adventure with Vagrant, Consul, Terraform, and more as we take your data center from chaos to control.

502828deee7e3b38ca1e527dded8a1a9?s=128

Seth Vargo

September 11, 2015
Tweet

Transcript

  1. BUILDING THE WORLD'S LARGEST WEBSITES with Consul and Terraform

  2. SETH VARGO @sethvargo

  3. None
  4. CHALLENGE #0 EVOLUTION OF THE MODERN DATACENTER

  5. RISING DATACENTER COMPLEXITY DC

  6. RISING DATACENTER COMPLEXITY DC

  7. RISING DATACENTER COMPLEXITY DC VM VM VM VM VM VM

    VM VM VM VM VM VM VM VM VM VM
  8. RISING DATACENTER COMPLEXITY DC VM VM VM VM VM VM

    VM VM VM VM VM VM VM VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
  9. RISING DATACENTER COMPLEXITY DC DNS Database CDN

  10. RISING DATACENTER COMPLEXITY DC-01 DC-02

  11. RISING DATACENTER COMPLEXITY DC-01 DC-02 VM VM VM VM VM

    VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C
  12. RISING DATACENTER COMPLEXITY IaaS PaaS SaaS

  13. RISING DATACENTER COMPLEXITY

  14. CHALLENGE #1 DECENTRALIZED SERVICE CONFIG

  15. CONFIG MGMT SERVER TRADITIONAL SERVICE CONFIGURATION Pull-based, long intervals, computationally

    expensive WEB 1 WEB 2 WEB N 14:00 14:07 14:03
  16. CONSUL

  17. SERVICE DISCOVERY LOAD BALANCING HEALTH CHECKING KEY-VALUE CONFIGURATION SOLVES THE

    4 BASIC PROBLEMS
  18. CONSUL CONSUL K/V + CONSUL-TEMPLATE Push-based, “instant”, predictable computational cost

    WEB 1 WEB 2 WEB N 14:00:00.311 14:00:00.731 14:00:00.415
  19. DISTRIBUTED K/V STORE Allows for per-datacenter configuration

  20. CONSUL-TEMPLATE Template Example global daemon maxconn {{key "haproxy/maxconn"}} defaults mode

    {{key "haproxy/mode"}}{{range ls "haproxy/timeouts"}} timeout {{.Key}} {{.Value}}{{end}} listen http-in bind *:8000{{range service "release.web"}} server {{.Node}} {{.Address}}:{{.Port}}{{end}}
  21. CONSUL-TEMPLATE Execute (as a service) $ consul-template \ -consul demo.consul.io

    \ -template “haproxy.ctmpl:/etc/haproxy/haproxy.conf:restart haproxy” -dry
  22. STEP BY STEP 1. Config management tooling lays down configuration

    template 2. consul-template runs as a service 3. Edge triggers config changes, restarts service
  23. CHALLENGE #2 SCALABLE SERVICE DISCOVERY

  24. ZERO TTL DNS Long-held connections to minimize DNS overhead Zero

    TTL ensures most up-to-date information
  25. RESILIENCY Low-TTL DNS records Ensures availability even if Consul is

    unavailable Required for short-held connections since DNS lookup overhead is too high with zero TTL
  26. CONSUL AGENT OPTION #1: CONSUL SETTINGS Per-service, stale reads on

    non-leaders WEB PROCESS DNS query CONSUL 
 LEADER CONSUL 
 STANDBY
  27. CONSUL AGENT OPTION #2: DNSMASQ + CONSUL Global, works if

    Consul is down WEB PROCESS DNS query CONSUL 
 LEADER CONSUL 
 STANDBY DNSMASQ
  28. CONSUL AGENT OPTION #2: DNSMASQ + CONSUL Global, works if

    Consul is down WEB PROCESS DNS query CONSUL 
 LEADER CONSUL 
 STANDBY DNSMASQ
  29. CONSUL AGENT OPTION #3: APPLICATION-LEVEL CACHE Works if almost everything

    is down, strict control over cache times WEB PROCESS DNS query CONSUL 
 LEADER CONSUL 
 STANDBY IN-MEM CACHE
  30. CHALLENGE #3 MONITORING AT SCALE

  31. MONITORING SERVICE TRADITIONAL MONITORING Pushes information into a silo WEB

    1 WEB 2 WEB N
  32. MONITORING SERVICE TRADITIONAL MONITORING Pushes information into a silo WEB

    1 WEB 2 WEB N
  33. MONITORING SERVICE TRADITIONAL MONITORING Pushes information into a silo WEB

    1 WEB 2 WEB N
  34. MONITORING SERVICE TRADITIONAL MONITORING Pushes information into a silo WEB

    1 WEB 2 WEB N
  35. MONITORING SERVICE TRADITIONAL MONITORING Pushes information into a silo WEB

    1 WEB 2 WEB N U
  36. MONITORING SERVICE TRADITIONAL MONITORING Pushes information into a silo WEB

    1 WEB 2 WEB N U F F
  37. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N
  38. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N
  39. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6
  40. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6
  41. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.5 10.0.1.6
  42. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  43. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  44. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  45. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  46. CONSUL CONSUL MONITORING Removes unhealthy nodes from service discovery layer

    WEB 1 WEB 2 WEB N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6 host: web.service.consul
  47. ATLAS CONSUL CONSUL MONITORING + ALERTING via Atlas WEB 1

    WEB 2 WEB N
  48. CONSUL MONITORING + ALERTING Atlas UI

  49. CONSUL MONITORING + ALERTING Atlas UI

  50. CONSUL MONITORING + ALERTING Atlas UI

  51. CONSUL MONITORING + ALERTING Atlas UI

  52. CONSUL MONITORING + ALERTING History

  53. CHALLENGE #4 SERVICE RESILIENCY 
 VIA DISTRIBUTED LOCKING

  54. CONSUL LOCK Allows for a new kind of "HA" demo

     master consul lock [options] prefix child...
  55. CONSUL LOCK Making standby HA much simpler CONSUL VAULT 1

    VAULT 2 VAULT 3
  56. CONSUL LOCK Making standby HA much simpler CONSUL VAULT 1

    VAULT 2 VAULT 3 L L
  57. CONSUL LOCK Making standby HA much simpler CONSUL VAULT 1

    VAULT 2 VAULT 3 L
  58. CONSUL LOCK Making standby HA much simpler CONSUL VAULT 1

    VAULT 2 VAULT 3 L LEADER ELECTION
  59. CONSUL LOCK Solves the "exactly one of these must always

    be running" problem
  60. VM C C C C VM C C C C

    VM C C C C VM C C C C VM C C C C ROLLING RESTARTS/UPGRADES
  61. CHALLENGE #5 SERVICE ORCHESTRATION 
 VIA EVENTS & WATCHES

  62. CONSUL CONSUL EVENTS Edge-triggered, sent to all nodes, computationally cheap

    WEB 1 WEB 2 DATABASE consul event deploy
  63. CONSUL CONSUL WATCH Watch and execute script for specific events

    WEB 1 WEB 2 DATABASE consul event deploy
  64. CONSUL CONSUL EXEC Run arbitrary commands on nodes WEB 1

    WEB 2 DATABASE consul exec -service=web ./script.sh
  65. CONSUL WATCH Wait for event, then do something demo 

    master consul watch -type=event -name=deploy ./deploy.sh
  66. Deploys Operational tasks Configuring external services USE CASES

  67. CHALLENGE #6 DETERMINISTIC LARGE- SCALE INFRASTRUCTURE CHANGE

  68. LARGE SCALE UPDATE PROBLEMS UNEXPECTED INTER-DEPENDENCIES CROSS-CLOUD CHANGES ORDERING FOR

    MINIMAL DISRUPTION EXPECTED TIME FOR COMPLETE ROLLOUT
  69. WHAT IF I ASKED YOU TO...

  70. WHAT IF I ASKED YOU TO... CREATE AN EPHEMERAL ENVIRONMENT

    (STAGING, ETC)?
  71. WHAT IF I ASKED YOU TO... CREATE AN EPHEMERAL ENVIRONMENT

    (STAGING, ETC)? UPDATE AN EXISTING COMPLEX APPLICATION?
  72. WHAT IF I ASKED YOU TO... CREATE AN EPHEMERAL ENVIRONMENT

    (STAGING, ETC)? UPDATE AN EXISTING COMPLEX APPLICATION? DOCUMENT YOUR INFRASTRUCTURE ARCHITECTURE?
  73. WHAT IF I ASKED YOU TO... CREATE AN EPHEMERAL ENVIRONMENT

    (STAGING, ETC)? UPDATE AN EXISTING COMPLEX APPLICATION? DOCUMENT YOUR INFRASTRUCTURE ARCHITECTURE? DELEGATE SOME OPS TO SMALLER TEAMS (CORE VS. APP IT)?
  74. TERRAFORM

  75. TERRAFORM'S GOAL

  76. PROVIDE A SINGLE WORKFLOW

  77. WITH A UNIFIED VIEW

  78. USING INFRASTRUCTURE AS CODE

  79. THAT CAN BE ITERATED AND CHANGED SAFELY

  80. CAPABLE OF COMPLEX N-TIER APPLICATIONS

  81. HOW?

  82. DIGITAL OCEAN DROPLET WITH DNS USING DNS SIMPLE resource "digitalocean_droplet"

    "web" { name = "tf-web" size = "512mb" image = "centos-5-8-x32" region = "sfo1" } resource "dnsimple_record" "hello" { domain = "example.com" name = "test" value = "${digitalocean_droplet.web.ipv4_address}" type = "A" }
  83. DIGITAL OCEAN DROPLET WITH DNS USING DNS SIMPLE resource "digitalocean_droplet"

    "web" { name = "tf-web" size = "512mb" image = "centos-5-8-x32" region = "sfo1" } resource "dnsimple_record" "hello" { domain = "example.com" name = "test" value = "${digitalocean_droplet.web.ipv4_address}" type = "A" }
  84. DIGITAL OCEAN DROPLET WITH DNS USING DNS SIMPLE resource "digitalocean_droplet"

    "web" { name = "tf-web" size = "512mb" image = "centos-5-8-x32" region = "sfo1" } resource "dnsimple_record" "hello" { domain = "example.com" name = "test" value = "${digitalocean_droplet.web.ipv4_address}" type = "A" }
  85. DIGITAL OCEAN DROPLET WITH DNS USING DNS SIMPLE resource "digitalocean_droplet"

    "web" { name = "tf-web" size = "512mb" image = "centos-5-8-x32" region = "sfo1" } resource "dnsimple_record" "hello" { domain = "example.com" name = "test" value = "${digitalocean_droplet.web.ipv4_address}" type = "A" }
  86. HUMAN-FRIENDLY CONFIG* * JSON-COMPATIBLE FOR NON-HUMANS

  87. VCS-FRIENDLY FORMAT

  88. ENTIRE INFRASTRUCTURE... IN A SINGLE TEXT FILE

  89. TERRAFORM PLAN What are you going to do? demo 

    master terraform plan + digitalocean_droplet.web backups: "" => "<computed>" image: "" => "centos-5-8-x32" ipv4_address: "" => "<computed>" ipv4_address_private: "" => "<computed>" name: "" => "tf-web" private_networking: "" => "<computed>" region: "" => "sfo1" size: "" => "512mb" status: "" => "<computed>"
  90. TERRAFORM GRAPH What order are you going to do things?

    demo  master terraform graph digraph { compound = "true" newrank = "true" subgraph "root" { "[root] aws_instance.haproxy" [label = "aws_instance.haproxy", shape = "box"] "[root] aws_instance.web" [label = "aws_instance.web", shape = "box"] "[root] aws_internet_gateway.terraform-tutorial" [label = "aws_internet_gateway.terraform-tutorial", shape = "box"] "[root] aws_route_table.terraform-tutorial" [label =
  91. CHALLENGE #7 DELEGATING OPS TO MULTIPLE TEAMS

  92. OPERATIONS DELEGATION "CORE" OPERATIONS TEAMS APPLICATION OPERATIONS TEAMS ELIMINATE SHADOW

    OPS SAFELY MAKE CHANGES SHARE OPERATIONS KNOWLEDGE
  93. TERRAFORM MODULES module "consul" { source = "github.com/hashicorp/consul/terraform/aws" servers =

    5 version = "0.4.0" }
  94. TERRAFORM MODULES module "consul" { source = "github.com/hashicorp/consul/terraform/aws" servers =

    5 version = "0.4.0" } resource "dnsimple_record" "consul" { domain = "example.com" name = "consul" value = "${module.consul.ip_address}" type = "A" }
  95. TERRAFORM REMOTE STATE resource "terraform_remote_state" "consul" { backend = "atlas"

    config { path = "hashicorp/consul-prod" } } output "consul-address" { value = "${terraform_remote_state.consul.addr}" }
  96. CHALLENGE #8 SERVICE COMPOSITION, INFRASTRUCTURE ORCHESTRATION

  97. SERVICE COMPOSITION Modern infrastructures are almost always "multi-provider": DNS in

    CloudFlare, compute in AWS, etc. Infrastructure change requires composing data from multiple services, executing change in multiple services
  98. SERVICE COMPOSITION // Terraform allows you to combine multiple external

    providers and // their outputs into a single pipeline resource "aws_instance" "web" {
 // Existing resource attributes } resource "cloudflare_record" "www" { domain = "foo.com" name = "www" value = "${aws_instance.web.private_ip}" type = "A" }
  99. LOGICAL RESOURCES // In additional to physical resources, Terraform also

    has logical // resources such as templates resource "template_file" "data" { filename = "data.tpl" vars { address = "${var.addr}" } } resource "aws_instance" "web" {
 user_data = "${template_file.data.rendered}" }
  100. CHALLENGE #9 HISTORY OF CHANGES

  101. HISTORY OF INFRASTRUCTURE CHANGE Atlas by HashiCorp

  102. HISTORY OF INFRASTRUCTURE CHANGE Atlas by HashiCorp Who is making

    changes?
  103. HISTORY OF INFRASTRUCTURE CHANGE Atlas by HashiCorp How did changes

    occur?
  104. HISTORY OF INFRASTRUCTURE CHANGE Atlas by HashiCorp SCM-like workflow

  105. CHALLENGE #10 INFRASTRUCTURE COLLABORATION

  106. INFRASTRUCTURE COLLABORATION Approve plans - similar to pull requests, but

    for infrastructure SCM integration
  107. INFRASTRUCTURE COLLABORATION Approve plans - similar to pull requests, but

    for infrastructure Infrastructure change review
  108. INFRASTRUCTURE COLLABORATION Approve plans - similar to pull requests, but

    for infrastructure Ability to "gate" process
  109. SETH VARGO @sethvargo QUESTIONS?