Upgrade to Pro — share decks privately, control downloads, hide ads and more …

It's time for us to move: The story of migrating Hosted Chef to AWS

It's time for us to move: The story of migrating Hosted Chef to AWS

Hosted Chef is one of the biggest Chef installations there is, with tens of thousands of organizations managing hundreds of thousands of Chef clients. By 2015, Hosted Chef had been growing exponentially for several years, and it was quickly outgrowing its home. It was time for a change, and so last October we migrated Hosted Chef from its original data center into AWS. As if the migration of a large production service wasn't enough, we were using an aging code base with practices and procedures that were years old, with references to CouchDB and workarounds from Chef 0.9! It was time to modernize all of our cookbooks, start using modern features, and generally rewrite everything at the same time. This talk is the story of that migration, the decisions we made, the challenges we faced, and the spectacular results. I'll cover what worked and what didn't go so well, and along the way I'll share some critical insights that will be useful to anyone running a large Chef installation in a cloud environment such as AWS.

Mark Harrison

July 13, 2016
Tweet

More Decks by Mark Harrison

Other Decks in Technology

Transcript

  1. IT'S TIME FOR US TO MOVE THE STORY OF MIGRATING

    HOSTED CHEF TO AWS Mark Harrison (@mivok) Senior Systems Administrator, CHEF
  2. HOSTED CHEF Chef's Software as a Service product for Chef

    Server The largest chef installation in the world ~100,000 orgs ~400,000 nodes Very spikey load profile
  3. WHY AWS? Outgrowing our current infrastructure Cost Flexibility - not

    being locked into contracts Ability to scale up as needed
  4. DEV TIME Developers were spending lots of time supporting hosted

    Hosted looked nothing like an on-prem install The hosted chef cookbooks were showing their age
  5. SETTING THINGS UP IN AWS $ sudo dpkg -i chef-server-core_12.X.Y-1.deb

    $ sudo chef-server-ctl reconfigure $ sudo chef-server-ctl start $ drink scotch
  6. SEARCH 3 main components (*) Solr Rabbitmq Chef-Expander Keep these

    together on the backend machine * or just elasticsearch with newer chef versions
  7. FRONTENDS Frontends run a few different services: Erchef (chef API)

    Bifrost (authorization) Oc-id (external auth) Reporting frontend Manage (web UI) Nginx (tying everything together)
  8. AMIS ASGs spin up instances based on an image (AMI)

    We use packer to build the AMIs Build images on every deploy 90% of setup is done in the image Per instance config done after boot with chef
  9. INSTANCE CONFIGURATION WITH CHEF After boot, machines register with chef

    Validation key method is used We use policyfiles to manage config Named run lists for build time vs run time config
  10. TERRAFORM Manages all AWS resources Makes provisioning easy, lets you

    recreate infrastructure Same terraform config for each environment
  11. BLUE/GREEN DEPLOYS Rebuild image in packer Bring up new instances

    in non-live ASG Verify things are working Swap out ASGs in load balancer
  12. AUTOSCALING GROUPS... AGAIN Great for stateless services (frontends) Less great

    for stateful services Still usable, but you need to take more care Still usable for single machines
  13. DATABASE MIGRATION - CHEF SERVER Chef server DB is small

    (<100GB) Couldn't do streaming replication to RDS Had to transfer the whole db To minimize transfer time, we set up temporary instances in AWS and set up replication to them
  14. DATABASE MIGRATION - REPORTING Reporting database is much bigger (~2-3TB)

    Projected growth was more than RDS supported We used normal instances running postgres We could use streaming replication
  15. DATABASE MIGRATION - REPORTING First we needed a base backup

    Old data center had Gigabit connection At gigabit speeds, 1TB takes ~3 hours Things aren't perfect, so say 1 day for the full transfer
  16. 3 DAYS LATER... Latency transferring half way across the US

    Normal hosted traffic ate 4-500Mb/s Reporting DB server disks were saturated Actual transfer rates were more like 10MB/s Rsync died regularly
  17. MIGRATION DAY - EARLIER IN THE DAY Lower DNS TTLs

    to 60s Do a test run Make final config changes from the test domain to the real one
  18. MIGRATION DAY Update status.chef.io Enable 503/maintenance mode Migrate DBs and

    Solr Test to make sure everything worked Final load test Flip DNS Un-503 hosted Update status.chef.io
  19. LOAD TESTING Home grown tool - chef swarm Simulates many

    chef runs at once Used to simulate peak load on new hosted We tested against an org that we kept out of maintenance mode