Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building the World's Largest Websites with Consul and Terraform

Building the World's Largest Websites with Consul and Terraform

This is a talk given at Xebicon.nl

A lot of context is missing but Xebia said they'd be uploading video of the talk shortly. Google to see if its up; the talk title is the same as the title of this deck. The talk is much more informative, including real world cases from some of our biggest users.

Mitchell Hashimoto

June 04, 2015
Tweet

More Decks by Mitchell Hashimoto

Other Decks in Programming

Transcript

  1. Building  the  World’s  
    Largest  Websites  with  
    Consul  and  Terraform

    View Slide

  2. @mitchellh
    Mitchell  Hashimoto

    View Slide

  3. View Slide

  4. DC  EVOLUTION  
    Challenges  of  the  modern  datacenter

    View Slide

  5. RISING  DATACENTER  COMPLEXITY
    DC

    View Slide

  6. RISING  DATACENTER  COMPLEXITY
    DC

    View Slide

  7. RISING  DATACENTER  COMPLEXITY
    DC
    VM
    VM
    VM
    VM VM
    VM
    VM
    VM VM
    VM
    VM
    VM VM
    VM
    VM
    VM

    View Slide

  8. RISING  DATACENTER  COMPLEXITY
    DC
    VM
    VM
    VM
    VM VM
    VM
    VM
    VM VM
    VM
    VM
    VM VM
    VM
    VM
    VM
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C

    View Slide

  9. RISING  DATACENTER  COMPLEXITY
    DC DNS
    Database
    CDN

    View Slide

  10. RISING  DATACENTER  COMPLEXITY
    DC-01 DC-02

    View Slide

  11. RISING  DATACENTER  COMPLEXITY
    DC-01 DC-02
    VM
    VM
    VM
    VM VM
    VM
    VM
    VM
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C
    C C

    View Slide

  12. RISING  DATACENTER  COMPLEXITY
    IaaS PaaS SaaS

    View Slide

  13. RISING  DATACENTER  COMPLEXITY

    View Slide

  14. CONSUL
    consul.io

    View Slide

  15. Service  discovery,  configuraCon,  and  
    orchestraCon  made  easy.  Distributed,  
    highly  available,  and  datacenter-­‐aware.

    View Slide

  16. QuesCons  that  Consul  Answers
    • Where  is  the  service  foo?  (ex.  Where  is  the  database?)  
    • What  is  the  health  status  of  service  foo?  
    • What  is  the  health  status  of  the  machine/node  foo?  
    • What  is  the  list  of  all  currently  running  machines?  
    • What  is  the  configuraCon  of  service  foo?  
    • Is  anyone  else  currently  performing  operaCon  foo?  

    View Slide

  17. terraform.io
    TERRAFORM

    View Slide

  18. Build,  combine,  and  launch  
    infrastructure  safely  and  efficiently.
    terraform.io

    View Slide

  19. What  If  I  asked  you  to…  
    • create  a  completely  isolated  second  environment  to  run  an  applicaCon  
    (staging,  QA,  dev,  etc.)?  
    • deploy  a  complex  new  applicaCon?    
    • update  an  exisCng  complex  applicaCon?    
    • document  how  our  infrastructure  is  architected?    
    • delegate  some  ops  to  smaller  teams?  (Core  IT  vs.  App  IT)

    View Slide

  20. SCALABILITY,    
    RESILIENCY,  
    DETERMINISM

    View Slide

  21. SCALABILITY
    • ExpectaCon  of  high  QPS  per  resource  
    • CPU,  memory  are  valuable  resources  
    • One  less  server  for  uClity  =  one  more  server  for  

    serving  customers  
    • Push  vs.  Pull,  a.k.a.  edge  triggered  changes

    View Slide

  22. RESILIENCY
    • Probability  of  failure  goes  up  for  scale  
    • Embrace  failure  and  make  it  acceptable  
    • Constant  change  at  some  scale  
    • Self-­‐healing  systems  become  much  more  

    important  (automaCc  anC-­‐entropy)  
    • Central  sources  of  truth  become  liabiliCes

    View Slide

  23. DETERMINISM
    • Understand  the  full  effect  of  a  change  
    • Predictable  (but  not  necessarily  strict)  ordering

    of  a  change.  
    • LimiCng  surprises  that  can  cause  downCme

    View Slide

  24.  CHALLENGE  #1  
    DECENTRALIZED  SERVICE  
    CONFIGURATION

    View Slide

  25. CONFIG  
    MGMT  SERVER
    TRADITIONAL  SERVICE  CONFIGURATION
    Pull-­‐based,  long  intervals,  computaConally  expensive
    WEB  1
    WEB  2
    WEB  N
    14:00
    14:07
    14:03

    View Slide

  26. CONSUL
    CONSUL  K/V  +  CONSUL-­‐TEMPLATE
    Push-­‐based,  “instant”,  predictable  computaConal  cost
    WEB  1
    WEB  2
    WEB  N
    14:00:00.311
    14:00:00.731
    14:00:00.415

    View Slide

  27. CONSUL-­‐TEMPLATE
    Template  Example
    global
    daemon
    maxconn {{key "haproxy/maxconn"}}
    defaults
    mode {{key "haproxy/mode"}}{{range ls "haproxy/timeouts"}}
    timeout {{.Key}} {{.Value}}{{end}}
    listen http-in
    bind *:8000{{range service "release.web"}}
    server {{.Node}} {{.Address}}:{{.Port}}{{end}}

    View Slide

  28. CONSUL-­‐TEMPLATE
    Execute  (as  a  service)
    $ consul-template \
    -consul demo.consul.io \
    -template “haproxy.ctmpl:/etc/haproxy/haproxy.conf:restart haproxy”
    -dry

    View Slide

  29. STEP  BY  STEP
    1. Config  management  puts  down  
    configuraCon  template  
    2. consul-­‐template  runs  as  a  service  
    3. Edge  triggers  config  changes,  restarts  
    service

    View Slide

  30.  CHALLENGE  #2  
    SCALABLE  SERVICE  
    DISCOVERY

    View Slide

  31. ZERO  TTL  DNS
    • Long-­‐held  connecCons  to  minimize  DNS  
    overhead  
    • Zero  TTL  ensures  most  up-­‐to-­‐date  
    informaCon

    View Slide

  32. RESILIENCY
    • Low-­‐TTL  DNS  records  
    • Ensures  availability  even  if  Consul  is  
    unavailable  
    • Required  for  short-­‐held  connecCons  since  
    DNS  lookup  overhead  is  too  high  with  zero  
    TTL

    View Slide

  33. CONSUL  AGENT
    OPTION  #1:  CONSUL  SETTINGS
    Per-­‐service,  stale  reads  on  non-­‐leaders
    WEB  PROCESS
    dns  query
    CONSUL  

    LEADER
    CONSUL  

    STANDBY

    View Slide

  34. CONSUL  AGENT
    OPTION  #2:  DNSMASQ  +  CONSUL
    Global,  works  if  Consul  is  down
    WEB  PROCESS
    dns  query
    DNSMASQ

    View Slide

  35. OPTION  #3:  APPLICATION-­‐LEVEL  CACHE
    Works  if  almost  everything  is  down,  strict  control  over  cache  Cmes
    WEB  PROCESS
    dns  query
    IN-­‐MEM  CACHE
    CONSUL  AGENT

    View Slide

  36. BEST  OPTION?
    The  first  two  opCons  are  usually  
    good  enough,  will  buy  you  a  lot  of  runway

    View Slide

  37.  CHALLENGE  #3  
    MONITORING  AT  SCALE

    View Slide

  38. MONITORING  
    SERVICE
    TRADITIONAL  MONITORING
    Pushes  informaCon  into  a  silo
    WEB  1
    WEB  2
    WEB  N

    View Slide

  39. MONITORING  
    SERVICE
    TRADITIONAL  MONITORING
    Pushes  informaCon  into  a  silo
    WEB  1
    WEB  2
    WEB  N

    View Slide

  40. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N

    View Slide

  41. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N

    View Slide

  42. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N
    dig web.service.consul
    10.0.1.4
    10.0.1.5
    10.0.1.6

    View Slide

  43. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N
    dig web.service.consul
    10.0.1.4
    10.0.1.5
    10.0.1.6

    View Slide

  44. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N
    dig web.service.consul
    10.0.1.5
    10.0.1.6

    View Slide

  45. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N
    dig web.service.consul
    10.0.1.5
    10.0.1.6
    host: web.service.consul

    View Slide

  46. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N
    dig web.service.consul
    10.0.1.5
    10.0.1.6
    host: web.service.consul

    View Slide

  47. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N
    dig web.service.consul
    10.0.1.5
    10.0.1.6
    host: web.service.consul

    View Slide

  48. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N
    dig web.service.consul
    10.0.1.5
    10.0.1.6
    host: web.service.consul

    View Slide

  49. CONSUL
    CONSUL  MONITORING
    Removes  unhealthy  nodes  from  service  discovery  layer
    WEB  1
    WEB  2
    WEB  N
    dig web.service.consul
    10.0.1.4
    10.0.1.5
    10.0.1.6
    host: web.service.consul

    View Slide

  50. CONSUL
    CONSUL  MONITORING  +  ALERTING
    Via  Atlas  (HashiCorp  paid  offering)
    WEB  1
    WEB  N
    ATLAS
    EMAIL
    SLACK
    PAGERDUTY

    View Slide

  51. CONSUL  MONITORING  +  ALERTING
    Atlas  UI

    View Slide

  52. CONSUL  MONITORING  +  ALERTING
    History

    View Slide

  53. CONSUL  MONITORING  +  ALERTING
    Email,  PagerDuty,  Slack,  etc.

    View Slide

  54.  CHALLENGE  #4  
    SERVICE  RESILIENCY  

    VIA  DISTRIBUTED  
    LOCKING  

    View Slide

  55. CONSUL
    CONSUL  DISTRIBUTED  LOCKING
    Ensure  “at  most  N”  tasks
    WEB  1
    WEB  2
    WEB  LEADER

    View Slide

  56. CONSUL
    CONSUL  DISTRIBUTED  LOCKING
    Ensure  “at  most  N”  tasks
    WEB  1
    WEB  2
    WEB  LEADER

    View Slide

  57. CONSUL
    CONSUL  DISTRIBUTED  LOCKING
    Ensure  “at  most  N”  tasks
    WEB  1
    WEB  LEADER
    WEB  N

    View Slide

  58. API  Or  CLI
    CLI  Example
    $ consul lock locks/ ./configure-f5

    View Slide

  59. DISTRIBUTED  LOCKS
    • Building  block  for  distributed  systems  
    • Complexity  hidden  from  downstream  
    applicaCons,  like  a  mutex  stdlib

    View Slide

  60.  CHALLENGE  #5  
    SERVICE  
    ORCHESTRATION  

    VIA  EVENTS,  WATCHES

    View Slide

  61. CONSUL
    CONSUL  EVENTS
    Edge-­‐triggered,  sent  to  all  nodes,  extremely  cheap
    WEB  1
    WEB  2
    WEB  N
    consul event deploy

    View Slide

  62. CONSUL
    CONSUL  WATCH
    Waits  for  specific  events,  executes  script
    WEB  1
    WEB  2
    WEB  N
    consul event deploy

    View Slide

  63. CONSUL
    CONSUL  EXEC
    Runs  script  on  specific  nodes
    WEB  1
    WEB  2
    DATABASE
    consul exec -service=“web”
    ./script.sh

    View Slide

  64. Consul  Watch
    Wait  for  events,  do  something
    $ consul watch -type=event -name=deploy ./deploy.sh

    View Slide

  65. USE  CASES
    • Deploys  
    • OperaConal  tasks  
    • Configure  external  services

    View Slide

  66.  CHALLENGE  #6  
    DETERMINISTIC  LARGE-­‐
    SCALE  INFRASTRUCTURE  
    CHANGE

    View Slide

  67. LARGE  SCALE  INFRA  UPDATE
    • Unexpected  inter-­‐dependencies  
    • Cross-­‐cloud  changes  
    • Ordering  for  minimal  disrupCon  
    • Expected  Cme  for  complete  rollout

    View Slide

  68. Terraform  Plan
    What  are  you  going  to  do?
    + digitalocean_droplet.web
    backups: "" => ""
    image: "" => "centos-5-8-x32"
    ipv4_address: "" => ""
    ipv4_address_private: "" => ""
    name: "" => "tf-web"
    private_networking: "" => ""
    region: "" => "sfo1"
    size: "" => "512mb"
    status: "" => ""
    + dnsimple_record.hello
    domain: "" => "example.com"
    domain_id: "" => ""
    hostname: "" => ""
    name: "" => "test"

    View Slide

  69. Terraform  Graph
    What  order  are  you  going  to  do  things?
    $ terraform graph

    View Slide

  70.  CHALLENGE  #7  
    DELEGATING  OPS    
    TO  MULTIPLE  TEAMS

    View Slide

  71. OPS  DELEGATION
    • “Core”  operaCons  teams  
    • ApplicaCon  operaCons  teams  
    • Eliminate  shadow  ops  
    • Safely  make  changes  without  

    negaCvely  affecCng  others  
    • Share  operaCons  knowledge

    View Slide

  72. Modules
    Unit  of  knowledge  sharing
    module “consul” {
    source = “github.com/hashicorp/consul/terraform/aws”
    servers = 5
    }
    output “consul-address” {
    value = “${module.consul.server_address}”
    }

    View Slide

  73. Remote  State
    Unit  of  resource  sharing
    resource “terraform_remote_state” “consul” {
    backend = "atlas"
    config {
    path = “hashicorp/consul-prod”
    }
    }
    output “consul-address” {
    value = “${terraform_remote_state.consul.addr}”
    }

    View Slide

  74.  CHALLENGE  #8  
    SERVICE  COMPOSITION,  
    INFRASTRUCTURE  
    ORCHESTRATION

    View Slide

  75. SERVICE  COMPOSITION
    • Modern  infrastructures  are  almost  always

    “mulC-­‐provider”:  DNS  in  CloudFlare,  compute

    in  AWS,  etc.    
    • Infrastructure  change  requires  composing  

    data  from  mulCple  services,  execuCng  change

    in  mulCple  services

    View Slide

  76. Service  ComposiUon
    ConnecCng  mulCple  service  providers
    resource “aws_instance” “web” {

    # …
    }
    resource “cloudflare_record” “www” {
    domain = “foo.com”
    name = “www”
    value = “${aws_instance.web.private_ip}”
    type = “A”
    }

    View Slide

  77. Logical  Resources
    Now  you’re  thinking  in  graphs
    resource “template_file” “data” {
    filename = “data.tpl”
    vars {
    address = “${var.addr}”
    }
    }
    resource “aws_instance” “web” {

    user_data = “${template_file.data.rendered}”
    }
    resource “cloudflare_record” “www” {
    domain = “foo.com”
    name = “www”
    value = “${aws_instance.web.private_ip}”
    type = “A”
    }

    View Slide

  78.  CHALLENGE  #9  
    HISTORY  OF  
    INFRASTRUCTUE  
    CHANGE

    View Slide

  79. HISTORY  OF  INFRASTRUCTURE  CHANGE
    Via  Atlas  (HashiCorp  paid  offering)

    View Slide

  80. HISTORY  OF  INFRA  CHANGE
    • See  who  did  what  when  how  
    • See  what  changed  recently  to  diagnose

    some  monitoring  event  
    • Treat  infrastructure  as  a  sort  of  applicaCon

    View Slide

  81.  CHALLENGE  #10  
    INFRASTRUCTURE  
    COLLABORATION

    View Slide

  82. INFRA  COLLABORATION
    • Achieve  applicaCon-­‐like  collaboraCon  with

    infrastructure  change  
    • Code  reviews,  safe  merges  
    • Understanding  the  effect  of  infrastructure  

    changes

    View Slide

  83. INFRASTRUCTURE  COLLABORATION
    Approve/deny  plans,  similar  to  pull  requests,  but  for  infra

    View Slide

  84. Thanks!
    QUESTIONS?

    View Slide