Building the World's Largest Websites with Consul and Terraform

Building the World's Largest Websites with Consul and Terraform

This is a talk given at Xebicon.nl

A lot of context is missing but Xebia said they'd be uploading video of the talk shortly. Google to see if its up; the talk title is the same as the title of this deck. The talk is much more informative, including real world cases from some of our biggest users.

2828f28fb012308a7786eee83b8293c5?s=128

Mitchell Hashimoto

June 04, 2015
Tweet

Transcript

  1. Building  the  World’s   Largest  Websites  with   Consul  and

     Terraform
  2. @mitchellh Mitchell  Hashimoto

  3. None
  4. DC  EVOLUTION   Challenges  of  the  modern  datacenter

  5. RISING  DATACENTER  COMPLEXITY DC

  6. RISING  DATACENTER  COMPLEXITY DC

  7. RISING  DATACENTER  COMPLEXITY DC VM VM VM VM VM VM

    VM VM VM VM VM VM VM VM VM VM
  8. RISING  DATACENTER  COMPLEXITY DC VM VM VM VM VM VM

    VM VM VM VM VM VM VM VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
  9. RISING  DATACENTER  COMPLEXITY DC DNS Database CDN

  10. RISING  DATACENTER  COMPLEXITY DC-01 DC-02

  11. RISING  DATACENTER  COMPLEXITY DC-01 DC-02 VM VM VM VM VM

    VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C
  12. RISING  DATACENTER  COMPLEXITY IaaS PaaS SaaS

  13. RISING  DATACENTER  COMPLEXITY

  14. CONSUL consul.io

  15. Service  discovery,  configuraCon,  and   orchestraCon  made  easy.  Distributed,  

    highly  available,  and  datacenter-­‐aware.
  16. QuesCons  that  Consul  Answers • Where  is  the  service  foo?

     (ex.  Where  is  the  database?)   • What  is  the  health  status  of  service  foo?   • What  is  the  health  status  of  the  machine/node  foo?   • What  is  the  list  of  all  currently  running  machines?   • What  is  the  configuraCon  of  service  foo?   • Is  anyone  else  currently  performing  operaCon  foo?  
  17. terraform.io TERRAFORM

  18. Build,  combine,  and  launch   infrastructure  safely  and  efficiently. terraform.io

  19. What  If  I  asked  you  to…   • create  a

     completely  isolated  second  environment  to  run  an  applicaCon   (staging,  QA,  dev,  etc.)?   • deploy  a  complex  new  applicaCon?     • update  an  exisCng  complex  applicaCon?     • document  how  our  infrastructure  is  architected?     • delegate  some  ops  to  smaller  teams?  (Core  IT  vs.  App  IT)
  20. SCALABILITY,     RESILIENCY,   DETERMINISM

  21. SCALABILITY • ExpectaCon  of  high  QPS  per  resource   •

    CPU,  memory  are  valuable  resources   • One  less  server  for  uClity  =  one  more  server  for  
 serving  customers   • Push  vs.  Pull,  a.k.a.  edge  triggered  changes
  22. RESILIENCY • Probability  of  failure  goes  up  for  scale  

    • Embrace  failure  and  make  it  acceptable   • Constant  change  at  some  scale   • Self-­‐healing  systems  become  much  more  
 important  (automaCc  anC-­‐entropy)   • Central  sources  of  truth  become  liabiliCes
  23. DETERMINISM • Understand  the  full  effect  of  a  change  

    • Predictable  (but  not  necessarily  strict)  ordering
 of  a  change.   • LimiCng  surprises  that  can  cause  downCme
  24.  CHALLENGE  #1   DECENTRALIZED  SERVICE   CONFIGURATION

  25. CONFIG   MGMT  SERVER TRADITIONAL  SERVICE  CONFIGURATION Pull-­‐based,  long  intervals,

     computaConally  expensive WEB  1 WEB  2 WEB  N 14:00 14:07 14:03
  26. CONSUL CONSUL  K/V  +  CONSUL-­‐TEMPLATE Push-­‐based,  “instant”,  predictable  computaConal  cost

    WEB  1 WEB  2 WEB  N 14:00:00.311 14:00:00.731 14:00:00.415
  27. CONSUL-­‐TEMPLATE Template  Example global daemon maxconn {{key "haproxy/maxconn"}} defaults mode

    {{key "haproxy/mode"}}{{range ls "haproxy/timeouts"}} timeout {{.Key}} {{.Value}}{{end}} listen http-in bind *:8000{{range service "release.web"}} server {{.Node}} {{.Address}}:{{.Port}}{{end}}
  28. CONSUL-­‐TEMPLATE Execute  (as  a  service) $ consul-template \ -consul demo.consul.io

    \ -template “haproxy.ctmpl:/etc/haproxy/haproxy.conf:restart haproxy” -dry
  29. STEP  BY  STEP 1. Config  management  puts  down   configuraCon

     template   2. consul-­‐template  runs  as  a  service   3. Edge  triggers  config  changes,  restarts   service
  30.  CHALLENGE  #2   SCALABLE  SERVICE   DISCOVERY

  31. ZERO  TTL  DNS • Long-­‐held  connecCons  to  minimize  DNS  

    overhead   • Zero  TTL  ensures  most  up-­‐to-­‐date   informaCon
  32. RESILIENCY • Low-­‐TTL  DNS  records   • Ensures  availability  even

     if  Consul  is   unavailable   • Required  for  short-­‐held  connecCons  since   DNS  lookup  overhead  is  too  high  with  zero   TTL
  33. CONSUL  AGENT OPTION  #1:  CONSUL  SETTINGS Per-­‐service,  stale  reads  on

     non-­‐leaders WEB  PROCESS dns  query CONSUL  
 LEADER CONSUL  
 STANDBY
  34. CONSUL  AGENT OPTION  #2:  DNSMASQ  +  CONSUL Global,  works  if

     Consul  is  down WEB  PROCESS dns  query DNSMASQ
  35. OPTION  #3:  APPLICATION-­‐LEVEL  CACHE Works  if  almost  everything  is  down,

     strict  control  over  cache  Cmes WEB  PROCESS dns  query IN-­‐MEM  CACHE CONSUL  AGENT
  36. BEST  OPTION? The  first  two  opCons  are  usually   good

     enough,  will  buy  you  a  lot  of  runway
  37.  CHALLENGE  #3   MONITORING  AT  SCALE

  38. MONITORING   SERVICE TRADITIONAL  MONITORING Pushes  informaCon  into  a  silo

    WEB  1 WEB  2 WEB  N
  39. MONITORING   SERVICE TRADITIONAL  MONITORING Pushes  informaCon  into  a  silo

    WEB  1 WEB  2 WEB  N
  40. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N
  41. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N
  42. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6
  43. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6
  44. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6
  45. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  46. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  47. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  48. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  49. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6 host: web.service.consul
  50. CONSUL CONSUL  MONITORING  +  ALERTING Via  Atlas  (HashiCorp  paid  offering)

    WEB  1 WEB  N ATLAS EMAIL SLACK PAGERDUTY
  51. CONSUL  MONITORING  +  ALERTING Atlas  UI

  52. CONSUL  MONITORING  +  ALERTING History

  53. CONSUL  MONITORING  +  ALERTING Email,  PagerDuty,  Slack,  etc.

  54.  CHALLENGE  #4   SERVICE  RESILIENCY  
 VIA  DISTRIBUTED   LOCKING

     
  55. CONSUL CONSUL  DISTRIBUTED  LOCKING Ensure  “at  most  N”  tasks WEB

     1 WEB  2 WEB  LEADER
  56. CONSUL CONSUL  DISTRIBUTED  LOCKING Ensure  “at  most  N”  tasks WEB

     1 WEB  2 WEB  LEADER
  57. CONSUL CONSUL  DISTRIBUTED  LOCKING Ensure  “at  most  N”  tasks WEB

     1 WEB  LEADER WEB  N
  58. API  Or  CLI CLI  Example $ consul lock locks/ ./configure-f5

  59. DISTRIBUTED  LOCKS • Building  block  for  distributed  systems   •

    Complexity  hidden  from  downstream   applicaCons,  like  a  mutex  stdlib
  60.  CHALLENGE  #5   SERVICE   ORCHESTRATION  
 VIA  EVENTS,  WATCHES

  61. CONSUL CONSUL  EVENTS Edge-­‐triggered,  sent  to  all  nodes,  extremely  cheap

    WEB  1 WEB  2 WEB  N consul event deploy
  62. CONSUL CONSUL  WATCH Waits  for  specific  events,  executes  script WEB

     1 WEB  2 WEB  N consul event deploy
  63. CONSUL CONSUL  EXEC Runs  script  on  specific  nodes WEB  1

    WEB  2 DATABASE consul exec -service=“web” ./script.sh
  64. Consul  Watch Wait  for  events,  do  something $ consul watch

    -type=event -name=deploy ./deploy.sh …
  65. USE  CASES • Deploys   • OperaConal  tasks   •

    Configure  external  services
  66.  CHALLENGE  #6   DETERMINISTIC  LARGE-­‐ SCALE  INFRASTRUCTURE   CHANGE

  67. LARGE  SCALE  INFRA  UPDATE • Unexpected  inter-­‐dependencies   • Cross-­‐cloud

     changes   • Ordering  for  minimal  disrupCon   • Expected  Cme  for  complete  rollout
  68. Terraform  Plan What  are  you  going  to  do? + digitalocean_droplet.web

    backups: "" => "<computed>" image: "" => "centos-5-8-x32" ipv4_address: "" => "<computed>" ipv4_address_private: "" => "<computed>" name: "" => "tf-web" private_networking: "" => "<computed>" region: "" => "sfo1" size: "" => "512mb" status: "" => "<computed>" + dnsimple_record.hello domain: "" => "example.com" domain_id: "" => "<computed>" hostname: "" => "<computed>" name: "" => "test"
  69. Terraform  Graph What  order  are  you  going  to  do  things?

    $ terraform graph …
  70.  CHALLENGE  #7   DELEGATING  OPS     TO  MULTIPLE  TEAMS

  71. OPS  DELEGATION • “Core”  operaCons  teams   • ApplicaCon  operaCons

     teams   • Eliminate  shadow  ops   • Safely  make  changes  without  
 negaCvely  affecCng  others   • Share  operaCons  knowledge
  72. Modules Unit  of  knowledge  sharing module “consul” { source =

    “github.com/hashicorp/consul/terraform/aws” servers = 5 } output “consul-address” { value = “${module.consul.server_address}” }
  73. Remote  State Unit  of  resource  sharing resource “terraform_remote_state” “consul” {

    backend = "atlas" config { path = “hashicorp/consul-prod” } } output “consul-address” { value = “${terraform_remote_state.consul.addr}” }
  74.  CHALLENGE  #8   SERVICE  COMPOSITION,   INFRASTRUCTURE   ORCHESTRATION

  75. SERVICE  COMPOSITION • Modern  infrastructures  are  almost  always
 “mulC-­‐provider”:  DNS

     in  CloudFlare,  compute
 in  AWS,  etc.     • Infrastructure  change  requires  composing  
 data  from  mulCple  services,  execuCng  change
 in  mulCple  services
  76. Service  ComposiUon ConnecCng  mulCple  service  providers resource “aws_instance” “web” {


    # … } resource “cloudflare_record” “www” { domain = “foo.com” name = “www” value = “${aws_instance.web.private_ip}” type = “A” }
  77. Logical  Resources Now  you’re  thinking  in  graphs resource “template_file” “data”

    { filename = “data.tpl” vars { address = “${var.addr}” } } resource “aws_instance” “web” {
 user_data = “${template_file.data.rendered}” } resource “cloudflare_record” “www” { domain = “foo.com” name = “www” value = “${aws_instance.web.private_ip}” type = “A” }
  78.  CHALLENGE  #9   HISTORY  OF   INFRASTRUCTUE   CHANGE

  79. HISTORY  OF  INFRASTRUCTURE  CHANGE Via  Atlas  (HashiCorp  paid  offering)

  80. HISTORY  OF  INFRA  CHANGE • See  who  did  what  when

     how   • See  what  changed  recently  to  diagnose
 some  monitoring  event   • Treat  infrastructure  as  a  sort  of  applicaCon
  81.  CHALLENGE  #10   INFRASTRUCTURE   COLLABORATION

  82. INFRA  COLLABORATION • Achieve  applicaCon-­‐like  collaboraCon  with
 infrastructure  change  

    • Code  reviews,  safe  merges   • Understanding  the  effect  of  infrastructure  
 changes
  83. INFRASTRUCTURE  COLLABORATION Approve/deny  plans,  similar  to  pull  requests,  but  for

     infra
  84. Thanks! QUESTIONS?