Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HashiConf 2016 - Consul @ Target

Danny Parker
September 08, 2016

HashiConf 2016 - Consul @ Target

Slides from Target's talk at HashiConf 2016.

Danny Parker

September 08, 2016
Tweet

Other Decks in Technology

Transcript

  1. @

  2. About Target • 2 Enterprise Data Centers • 38 Distribution

    Centers • 1795 Stores • ~$70 billion revenue • target.com / mobile • $1 billion investment
  3. Target’s Engineering Journey • DevOps: Product-driven teams • Focus on

    in-house engineering talent • On track to hire 1000 engineers this year
  4. 2014 - Guest Facing Cloud POC • Small, low risk

    POC Application • How fast can we rebuild and deploy a public facing app? • Prove the value of speed to the business • IAAS templates + chef cookbooks
  5. Cloud Architecture • 2 Regions x 2 Environments – target.com

    applications • Shared Platform Infrastructure - Product teams deploy at will • The cloud platform manages: • Deployment pipeline • Service discovery • Log aggregation • Metrics collection • Data delivery Consul Log Aggregation Metrics C* / Kafka VPN Applications Applications Applications x4
  6. Consul POC • Compelling Features • Service registry • Service

    health checks • Real-time updates • Installed it over a weekend and started integrating to our demo • 4x faster deploy time • Parallel deployment of all VMs • Chef runs no longer had issues with dependencies • Health checks / Consul template (haproxy demo) • A++++ Would recommend • Started integrating it everywhere
  7. Monitoring Stack • Sensu / Graphite for alerts and metrics

    • Consul DNS on all servers on the platform: • monitoring-rabbitmq.service.consul:4567 • servicewatch: service sensu-client restart • graphite.service.consul:2003 • Result: • Rebuild the monitoring stack 10-20x a day to rapidly iterate • All clients automatically reconnect, no noticeable impact
  8. Old Internal Architecture • Started 4 years ago with Chef

    and 30 VMs • Chef (and lots of Chef search) configured: • Nginx • API -> API traffic • Logging/Monitoring (Sensu/ELK) • Database (Cassandra)
  9. Current Internal Architecture • 100+ APIs, microservices architecture • 3k+

    instances • Nginx • Consul-template for upstream • Consul kv for specific API settings • Microservices • One service might talk to many (aggregators) • Local HAproxy + consul template • Consul has become the source of truth Nginx/Varnish Fastly API Consul Cassandra
  10. Preparing our APIs • Better health checks • /health •

    up/down • database connections • disk • Be careful - you could DDoS yourself with consul • Consistency • Same ports, config locations, etc • Our APIs should be like “legos”
  11. Data Platform • Cassandra/ELK/Kafka • Provide teams with a consul

    DNS name • We can alter the underlying nodes • No more chef search or static code • Cassandra seed discovery • Ask consul KV if the cluster exists • If not, you are the seed • Be careful with heavy health checks • nodetool status can take a long time
  12. Deployment Statistics • Prod • 3600+ consul agents • 450+

    services • Non-Prod • 4000 consul agents • 500 services • Every server deployed takes advantage of consul
  13. “Immutable” Cassandra • Config files generated from consul kv •

    cassandra-env.sh • cassandra.yaml • KV path from user data • Allows a generic image to be very flexible • Speeds build time • Reduces errors • ~100 node cluster
  14. Consul Certificate Signing • Utilizes Consul kv • Security managed

    server has consul key watch • Consul clients post a CSR to a particular kv path • Server responds with a signed cert in a different kv path • Scriptable certificate signing • Valid certificates signed by internal PKI, security approved
  15. Network Considerations • Nodes that couldn’t satisfy full mesh •

    Internal • caused 500mb/s+ network traffic • disabled consul agent • Cloud • Lots of flapping services due to raft protocol and network routing • Result: Make all private addresses routable to all others
  16. Consul DNS • Dnsmasq • resolv.conf points to dnsmasq •

    Dnsmasq directs consul zones to consul cluster • <name>.service.consul • Allow separate consul-template hosts file • Supports de-duplication
  17. Immutable Image Pipeline • Bake all consul templates into an

    image - immutable / autoscalable deployment • Envconsul => set environment variables at runtime and drive all config • Push Consul KV to clusters automatically through jenkins and github • Vault integration - trustme api • trustme provides a consul token with the proper ACL based on tags / environment variables • Cloud and environment agnostic
  18. Lessons Learned • Work on your elevator pitch • Use

    Consul everywhere • Know what to do if you lose your masters • Focus on fundamentals