Upgrade to Pro — share decks privately, control downloads, hide ads and more …

It's time for us to move: The story of migrating Hosted Chef to AWS

It's time for us to move: The story of migrating Hosted Chef to AWS

Hosted Chef is one of the biggest Chef installations there is, with tens of thousands of organizations managing hundreds of thousands of Chef clients. By 2015, Hosted Chef had been growing exponentially for several years, and it was quickly outgrowing its home. It was time for a change, and so last October we migrated Hosted Chef from its original data center into AWS. As if the migration of a large production service wasn't enough, we were using an aging code base with practices and procedures that were years old, with references to CouchDB and workarounds from Chef 0.9! It was time to modernize all of our cookbooks, start using modern features, and generally rewrite everything at the same time. This talk is the story of that migration, the decisions we made, the challenges we faced, and the spectacular results. I'll cover what worked and what didn't go so well, and along the way I'll share some critical insights that will be useful to anyone running a large Chef installation in a cloud environment such as AWS.

E6ba9f8a392923525d3426a7b738cf9c?s=128

Mark Harrison

July 13, 2016
Tweet

Transcript

  1. IT'S TIME FOR US TO MOVE THE STORY OF MIGRATING

    HOSTED CHEF TO AWS Mark Harrison (@mivok) Senior Systems Administrator, CHEF
  2. HOSTED CHEF Chef's Software as a Service product for Chef

    Server The largest chef installation in the world ~100,000 orgs ~400,000 nodes Very spikey load profile
  3. "NORMAL" WEBSITE

  4. HOSTED CHEF

  5. HOSTED CIRCA 2010

  6. HOSTED CIRCA 2011

  7. WHY AWS? Outgrowing our current infrastructure Cost Flexibility - not

    being locked into contracts Ability to scale up as needed
  8. DEV TIME Developers were spending lots of time supporting hosted

    Hosted looked nothing like an on-prem install The hosted chef cookbooks were showing their age
  9. NORMAL ON-PREM CHEF INSTALL

  10. OLD HOSTED

  11. DEV TIME Devs were spending time on: Troubleshooting Deploys Upgrades/migrations

    Hosted specific code
  12. NEW HOSTED

  13. THE MIGRATION

  14. TASKS Get things running in AWS Migrate data Monitor everything

  15. SETTING THINGS UP IN AWS $ sudo dpkg -i chef-server-core_12.X.Y-1.deb

    $ sudo chef-server-ctl reconfigure $ sudo chef-server-ctl start $ drink scotch
  16. SPLIT UP BACKEND SERVICES

  17. DATABASE - USE RDS

  18. SEARCH 3 main components (*) Solr Rabbitmq Chef-Expander Keep these

    together on the backend machine * or just elasticsearch with newer chef versions
  19. REDIS/ELASTICACHE

  20. FRONTENDS Frontends run a few different services: Erchef (chef API)

    Bifrost (authorization) Oc-id (external auth) Reporting frontend Manage (web UI) Nginx (tying everything together)
  21. OTHER COMPONENTS Cookbook storage ELB Support box

  22. PROVISIONING

  23. AUTOSCALING GROUPS

  24. AMIS ASGs spin up instances based on an image (AMI)

    We use packer to build the AMIs Build images on every deploy 90% of setup is done in the image Per instance config done after boot with chef
  25. INSTANCE CONFIGURATION WITH CHEF After boot, machines register with chef

    Validation key method is used We use policyfiles to manage config Named run lists for build time vs run time config
  26. TERRAFORM Manages all AWS resources Makes provisioning easy, lets you

    recreate infrastructure Same terraform config for each environment
  27. EXAMPLE

  28. BLUE/GREEN DEPLOYS Rebuild image in packer Bring up new instances

    in non-live ASG Verify things are working Swap out ASGs in load balancer
  29. AUTOSCALING GROUPS... AGAIN Great for stateless services (frontends) Less great

    for stateful services Still usable, but you need to take more care Still usable for single machines
  30. THE MIGRATION PROCESS

  31. MIGRATION TASKS Move databases chef server chef reporting data Move

    search/solr content Flip DNS
  32. DATABASE MIGRATION - CHEF SERVER

  33. DATABASE MIGRATION - CHEF SERVER Chef server DB is small

    (<100GB) Couldn't do streaming replication to RDS Had to transfer the whole db To minimize transfer time, we set up temporary instances in AWS and set up replication to them
  34. DATABASE MIGRATION - REPORTING

  35. DATABASE MIGRATION - REPORTING Reporting database is much bigger (~2-3TB)

    Projected growth was more than RDS supported We used normal instances running postgres We could use streaming replication
  36. DATABASE MIGRATION - REPORTING First we needed a base backup

    Old data center had Gigabit connection At gigabit speeds, 1TB takes ~3 hours Things aren't perfect, so say 1 day for the full transfer
  37. 3 DAYS LATER... Latency transferring half way across the US

    Normal hosted traffic ate 4-500Mb/s Reporting DB server disks were saturated Actual transfer rates were more like 10MB/s Rsync died regularly
  38. MIGRATING DATA - REPORTING Image source: xkcd.com/612

  39. MIGRATING DATA - SOLR

  40. MIGRATING DATA - REDIS

  41. MIGRATION DAY - EARLIER IN THE DAY Lower DNS TTLs

    to 60s Do a test run Make final config changes from the test domain to the real one
  42. MIGRATION DAY Update status.chef.io Enable 503/maintenance mode Migrate DBs and

    Solr Test to make sure everything worked Final load test Flip DNS Un-503 hosted Update status.chef.io
  43. LOAD TESTING Home grown tool - chef swarm Simulates many

    chef runs at once Used to simulate peak load on new hosted We tested against an org that we kept out of maintenance mode
  44. LOAD TESTING

  45. RESULTS

  46. RESULTS

  47. RESULTS

  48. QUESTIONS?