Slide 1

Slide 1 text

Building  the  World’s   Largest  Websites  with   Consul  and  Terraform

Slide 2

Slide 2 text

@mitchellh Mitchell  Hashimoto

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

DC  EVOLUTION   Challenges  of  the  modern  datacenter

Slide 5

Slide 5 text

RISING  DATACENTER  COMPLEXITY DC

Slide 6

Slide 6 text

RISING  DATACENTER  COMPLEXITY DC

Slide 7

Slide 7 text

RISING  DATACENTER  COMPLEXITY DC VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM

Slide 8

Slide 8 text

RISING  DATACENTER  COMPLEXITY DC VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

Slide 9

Slide 9 text

RISING  DATACENTER  COMPLEXITY DC DNS Database CDN

Slide 10

Slide 10 text

RISING  DATACENTER  COMPLEXITY DC-01 DC-02

Slide 11

Slide 11 text

RISING  DATACENTER  COMPLEXITY DC-01 DC-02 VM VM VM VM VM VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C

Slide 12

Slide 12 text

RISING  DATACENTER  COMPLEXITY IaaS PaaS SaaS

Slide 13

Slide 13 text

RISING  DATACENTER  COMPLEXITY

Slide 14

Slide 14 text

CONSUL consul.io

Slide 15

Slide 15 text

Service  discovery,  configuraCon,  and   orchestraCon  made  easy.  Distributed,   highly  available,  and  datacenter-­‐aware.

Slide 16

Slide 16 text

QuesCons  that  Consul  Answers • Where  is  the  service  foo?  (ex.  Where  is  the  database?)   • What  is  the  health  status  of  service  foo?   • What  is  the  health  status  of  the  machine/node  foo?   • What  is  the  list  of  all  currently  running  machines?   • What  is  the  configuraCon  of  service  foo?   • Is  anyone  else  currently  performing  operaCon  foo?  

Slide 17

Slide 17 text

terraform.io TERRAFORM

Slide 18

Slide 18 text

Build,  combine,  and  launch   infrastructure  safely  and  efficiently. terraform.io

Slide 19

Slide 19 text

What  If  I  asked  you  to…   • create  a  completely  isolated  second  environment  to  run  an  applicaCon   (staging,  QA,  dev,  etc.)?   • deploy  a  complex  new  applicaCon?     • update  an  exisCng  complex  applicaCon?     • document  how  our  infrastructure  is  architected?     • delegate  some  ops  to  smaller  teams?  (Core  IT  vs.  App  IT)

Slide 20

Slide 20 text

SCALABILITY,     RESILIENCY,   DETERMINISM

Slide 21

Slide 21 text

SCALABILITY • ExpectaCon  of  high  QPS  per  resource   • CPU,  memory  are  valuable  resources   • One  less  server  for  uClity  =  one  more  server  for  
 serving  customers   • Push  vs.  Pull,  a.k.a.  edge  triggered  changes

Slide 22

Slide 22 text

RESILIENCY • Probability  of  failure  goes  up  for  scale   • Embrace  failure  and  make  it  acceptable   • Constant  change  at  some  scale   • Self-­‐healing  systems  become  much  more  
 important  (automaCc  anC-­‐entropy)   • Central  sources  of  truth  become  liabiliCes

Slide 23

Slide 23 text

DETERMINISM • Understand  the  full  effect  of  a  change   • Predictable  (but  not  necessarily  strict)  ordering
 of  a  change.   • LimiCng  surprises  that  can  cause  downCme

Slide 24

Slide 24 text

 CHALLENGE  #1   DECENTRALIZED  SERVICE   CONFIGURATION

Slide 25

Slide 25 text

CONFIG   MGMT  SERVER TRADITIONAL  SERVICE  CONFIGURATION Pull-­‐based,  long  intervals,  computaConally  expensive WEB  1 WEB  2 WEB  N 14:00 14:07 14:03

Slide 26

Slide 26 text

CONSUL CONSUL  K/V  +  CONSUL-­‐TEMPLATE Push-­‐based,  “instant”,  predictable  computaConal  cost WEB  1 WEB  2 WEB  N 14:00:00.311 14:00:00.731 14:00:00.415

Slide 27

Slide 27 text

CONSUL-­‐TEMPLATE Template  Example global daemon maxconn {{key "haproxy/maxconn"}} defaults mode {{key "haproxy/mode"}}{{range ls "haproxy/timeouts"}} timeout {{.Key}} {{.Value}}{{end}} listen http-in bind *:8000{{range service "release.web"}} server {{.Node}} {{.Address}}:{{.Port}}{{end}}

Slide 28

Slide 28 text

CONSUL-­‐TEMPLATE Execute  (as  a  service) $ consul-template \ -consul demo.consul.io \ -template “haproxy.ctmpl:/etc/haproxy/haproxy.conf:restart haproxy” -dry

Slide 29

Slide 29 text

STEP  BY  STEP 1. Config  management  puts  down   configuraCon  template   2. consul-­‐template  runs  as  a  service   3. Edge  triggers  config  changes,  restarts   service

Slide 30

Slide 30 text

 CHALLENGE  #2   SCALABLE  SERVICE   DISCOVERY

Slide 31

Slide 31 text

ZERO  TTL  DNS • Long-­‐held  connecCons  to  minimize  DNS   overhead   • Zero  TTL  ensures  most  up-­‐to-­‐date   informaCon

Slide 32

Slide 32 text

RESILIENCY • Low-­‐TTL  DNS  records   • Ensures  availability  even  if  Consul  is   unavailable   • Required  for  short-­‐held  connecCons  since   DNS  lookup  overhead  is  too  high  with  zero   TTL

Slide 33

Slide 33 text

CONSUL  AGENT OPTION  #1:  CONSUL  SETTINGS Per-­‐service,  stale  reads  on  non-­‐leaders WEB  PROCESS dns  query CONSUL  
 LEADER CONSUL  
 STANDBY

Slide 34

Slide 34 text

CONSUL  AGENT OPTION  #2:  DNSMASQ  +  CONSUL Global,  works  if  Consul  is  down WEB  PROCESS dns  query DNSMASQ

Slide 35

Slide 35 text

OPTION  #3:  APPLICATION-­‐LEVEL  CACHE Works  if  almost  everything  is  down,  strict  control  over  cache  Cmes WEB  PROCESS dns  query IN-­‐MEM  CACHE CONSUL  AGENT

Slide 36

Slide 36 text

BEST  OPTION? The  first  two  opCons  are  usually   good  enough,  will  buy  you  a  lot  of  runway

Slide 37

Slide 37 text

 CHALLENGE  #3   MONITORING  AT  SCALE

Slide 38

Slide 38 text

MONITORING   SERVICE TRADITIONAL  MONITORING Pushes  informaCon  into  a  silo WEB  1 WEB  2 WEB  N

Slide 39

Slide 39 text

MONITORING   SERVICE TRADITIONAL  MONITORING Pushes  informaCon  into  a  silo WEB  1 WEB  2 WEB  N

Slide 40

Slide 40 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N

Slide 41

Slide 41 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N

Slide 42

Slide 42 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6

Slide 43

Slide 43 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6

Slide 44

Slide 44 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6

Slide 45

Slide 45 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul

Slide 46

Slide 46 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul

Slide 47

Slide 47 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul

Slide 48

Slide 48 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul

Slide 49

Slide 49 text

CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6 host: web.service.consul

Slide 50

Slide 50 text

CONSUL CONSUL  MONITORING  +  ALERTING Via  Atlas  (HashiCorp  paid  offering) WEB  1 WEB  N ATLAS EMAIL SLACK PAGERDUTY

Slide 51

Slide 51 text

CONSUL  MONITORING  +  ALERTING Atlas  UI

Slide 52

Slide 52 text

CONSUL  MONITORING  +  ALERTING History

Slide 53

Slide 53 text

CONSUL  MONITORING  +  ALERTING Email,  PagerDuty,  Slack,  etc.

Slide 54

Slide 54 text

 CHALLENGE  #4   SERVICE  RESILIENCY  
 VIA  DISTRIBUTED   LOCKING  

Slide 55

Slide 55 text

CONSUL CONSUL  DISTRIBUTED  LOCKING Ensure  “at  most  N”  tasks WEB  1 WEB  2 WEB  LEADER

Slide 56

Slide 56 text

CONSUL CONSUL  DISTRIBUTED  LOCKING Ensure  “at  most  N”  tasks WEB  1 WEB  2 WEB  LEADER

Slide 57

Slide 57 text

CONSUL CONSUL  DISTRIBUTED  LOCKING Ensure  “at  most  N”  tasks WEB  1 WEB  LEADER WEB  N

Slide 58

Slide 58 text

API  Or  CLI CLI  Example $ consul lock locks/ ./configure-f5

Slide 59

Slide 59 text

DISTRIBUTED  LOCKS • Building  block  for  distributed  systems   • Complexity  hidden  from  downstream   applicaCons,  like  a  mutex  stdlib

Slide 60

Slide 60 text

 CHALLENGE  #5   SERVICE   ORCHESTRATION  
 VIA  EVENTS,  WATCHES

Slide 61

Slide 61 text

CONSUL CONSUL  EVENTS Edge-­‐triggered,  sent  to  all  nodes,  extremely  cheap WEB  1 WEB  2 WEB  N consul event deploy

Slide 62

Slide 62 text

CONSUL CONSUL  WATCH Waits  for  specific  events,  executes  script WEB  1 WEB  2 WEB  N consul event deploy

Slide 63

Slide 63 text

CONSUL CONSUL  EXEC Runs  script  on  specific  nodes WEB  1 WEB  2 DATABASE consul exec -service=“web” ./script.sh

Slide 64

Slide 64 text

Consul  Watch Wait  for  events,  do  something $ consul watch -type=event -name=deploy ./deploy.sh …

Slide 65

Slide 65 text

USE  CASES • Deploys   • OperaConal  tasks   • Configure  external  services

Slide 66

Slide 66 text

 CHALLENGE  #6   DETERMINISTIC  LARGE-­‐ SCALE  INFRASTRUCTURE   CHANGE

Slide 67

Slide 67 text

LARGE  SCALE  INFRA  UPDATE • Unexpected  inter-­‐dependencies   • Cross-­‐cloud  changes   • Ordering  for  minimal  disrupCon   • Expected  Cme  for  complete  rollout

Slide 68

Slide 68 text

Terraform  Plan What  are  you  going  to  do? + digitalocean_droplet.web backups: "" => "" image: "" => "centos-5-8-x32" ipv4_address: "" => "" ipv4_address_private: "" => "" name: "" => "tf-web" private_networking: "" => "" region: "" => "sfo1" size: "" => "512mb" status: "" => "" + dnsimple_record.hello domain: "" => "example.com" domain_id: "" => "" hostname: "" => "" name: "" => "test"

Slide 69

Slide 69 text

Terraform  Graph What  order  are  you  going  to  do  things? $ terraform graph …

Slide 70

Slide 70 text

 CHALLENGE  #7   DELEGATING  OPS     TO  MULTIPLE  TEAMS

Slide 71

Slide 71 text

OPS  DELEGATION • “Core”  operaCons  teams   • ApplicaCon  operaCons  teams   • Eliminate  shadow  ops   • Safely  make  changes  without  
 negaCvely  affecCng  others   • Share  operaCons  knowledge

Slide 72

Slide 72 text

Modules Unit  of  knowledge  sharing module “consul” { source = “github.com/hashicorp/consul/terraform/aws” servers = 5 } output “consul-address” { value = “${module.consul.server_address}” }

Slide 73

Slide 73 text

Remote  State Unit  of  resource  sharing resource “terraform_remote_state” “consul” { backend = "atlas" config { path = “hashicorp/consul-prod” } } output “consul-address” { value = “${terraform_remote_state.consul.addr}” }

Slide 74

Slide 74 text

 CHALLENGE  #8   SERVICE  COMPOSITION,   INFRASTRUCTURE   ORCHESTRATION

Slide 75

Slide 75 text

SERVICE  COMPOSITION • Modern  infrastructures  are  almost  always
 “mulC-­‐provider”:  DNS  in  CloudFlare,  compute
 in  AWS,  etc.     • Infrastructure  change  requires  composing  
 data  from  mulCple  services,  execuCng  change
 in  mulCple  services

Slide 76

Slide 76 text

Service  ComposiUon ConnecCng  mulCple  service  providers resource “aws_instance” “web” {
 # … } resource “cloudflare_record” “www” { domain = “foo.com” name = “www” value = “${aws_instance.web.private_ip}” type = “A” }

Slide 77

Slide 77 text

Logical  Resources Now  you’re  thinking  in  graphs resource “template_file” “data” { filename = “data.tpl” vars { address = “${var.addr}” } } resource “aws_instance” “web” {
 user_data = “${template_file.data.rendered}” } resource “cloudflare_record” “www” { domain = “foo.com” name = “www” value = “${aws_instance.web.private_ip}” type = “A” }

Slide 78

Slide 78 text

 CHALLENGE  #9   HISTORY  OF   INFRASTRUCTUE   CHANGE

Slide 79

Slide 79 text

HISTORY  OF  INFRASTRUCTURE  CHANGE Via  Atlas  (HashiCorp  paid  offering)

Slide 80

Slide 80 text

HISTORY  OF  INFRA  CHANGE • See  who  did  what  when  how   • See  what  changed  recently  to  diagnose
 some  monitoring  event   • Treat  infrastructure  as  a  sort  of  applicaCon

Slide 81

Slide 81 text

 CHALLENGE  #10   INFRASTRUCTURE   COLLABORATION

Slide 82

Slide 82 text

INFRA  COLLABORATION • Achieve  applicaCon-­‐like  collaboraCon  with
 infrastructure  change   • Code  reviews,  safe  merges   • Understanding  the  effect  of  infrastructure  
 changes

Slide 83

Slide 83 text

INFRASTRUCTURE  COLLABORATION Approve/deny  plans,  similar  to  pull  requests,  but  for  infra

Slide 84

Slide 84 text

Thanks! QUESTIONS?