Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOps Vancouver 2016: Scaling out Continuous D...

DevOps Vancouver 2016: Scaling out Continuous Delivery

Continuous delivery can be a simple practice when you are a small development team with a low scale web application. However, as teams scale up in size, commit and test volumes increase, and production architecture becomes more complex, continuous delivery becomes harder to achieve. Tools that once handled the load begin to fail, individual steps in the delivery process that were once fast turn into bottlenecks, and cultural practices and processes break down.

This talk will describe how Shopify faced these scaling challenges as the team and its application infrastructure grew. A year ago, it could take hours for a developer to go from commit to fully deployment application. In response, the team re-architected its entire delivery pipeline to hand these scaling challenges. We cover how moving to a container-based build and creating the open source Shipit deployment tool allowed the team to regain its continuous delivery cadence.

Today, every developer on a team of hundreds can deliver a change from commit to production in ten minutes. Each deploy includes execution of over 45,000 tests in a massively parallel build farm, and deploying the application to hundreds of servers in multiple data centres. Thirty or more production releases now run in a typical workday, and developer happiness has increased. We will cover the steps taken to get here, the technology choices made, and the stumbling blocks faced along the way.

John Arthorne

April 16, 2016
Tweet

More Decks by John Arthorne

Other Decks in Technology

Transcript

  1. Key Platform Characteristics ~800 M Daily HTTP Traffic +150% rpm

    Flash Sale Traffic Spikes <100 ms Storefront Response Time
  2. Continuous Delivery at Shopify Shopify has a long running culture

    of continuous delivery 2013: Capistrano deploy by dedicated ops team 2014: 3rd party CI, scalability and production fidelity a problem 2015: Building out new generation deploy pipeline
  3. A Pipeline Built for Speed and Scale Container Build Git

    Push Automated Tests Deploy 5s 90s 200s 300s Commit to deployed in ~10 minutes Every developer can deploy to production Goals:
  4. Why a Fast Pipeline is Important Less wait time for

    developers Faster time to a fix for customers Continuous uptime requires many small changes which magnifies wait times But the #1 reason: keeping the batch size small
  5. Batch Size vs Pipeline Speed 160 commits merged to shopify

    master on a busy day Commit every 3 minutes assuming 8 hour work day 3 minute deploy required for smallest batch size Builds have to keep getting faster to keep batch size down
  6. Why Small Batches are Important Decreased chance of failure in

    a given batch Faster time to find root cause when deploy causes problems Forces optimization of release process Higher chance of clean rollback #1 reason: making developers feel invested in deploy process
  7. Container Build 90s Commit to deployed in ~10 minutes Every

    developer can deploy to production Automated Tests Deploy 200s 300s Git Push 5s Container Build Goals:
  8. Locutus Locutus is a docker image building service for Shopify.

    Locutus receives GitHub webhooks on each push. On each push, it pulls the new source, builds a container, and pushes it to our docker registry. It has a few levels of caching to make builds faster and deploys smaller.
  9. Building Containers Baking 1500 cakes/day requires a lot of ovens

    (3.25 per minute average this week) Container layers optimized for both build time, and reducing I/O on deploy Built for commit to branch, then again for master Duplicate build when there are pending base image changes (each change built 4 times)
  10. Container Build 90s Automated Tests Deploy 200s 300s Git Push

    5s Testing Commit to deployed in ~10 minutes Every developer can deploy to production Goals:
  11. Buildkite Hosted build and test orchestration service Test agents run

    in parallel on our own EC2 boxes Agents pull tests from Redis queue Ruby tests + Browser tests run with Selenium/Chrome 95 C4.8xlarge VMs 1472 Peak agents 44k Tests/Build
  12. Test Flow BuildKite GitHub Locutus Docker Registry BK Agent Web

    Hook Wait for Container Container Ready Start Tests Fetch Container Tests Done Status Update Tests
  13. Scrooge Monitors Buildkite agents Starts/stops EC2 nodes based on current

    demand Ensure we only pay for the capacity we need Also determines level of parallelism for a given test run Tests are 48-192 way parallel
  14. Container Build 90s Automated Tests Deploy 200s 300s Git Push

    5s Deploy Commit to deployed in ~10 minutes Every developer can deploy to production Goals:
  15. Deploy Flow Entire deploy orchestrated using Capistrano Revision file dictates

    which container to run Containers restarted using sv-rollout / runit Each node fetches its container from docker registry through local DC caching proxy Containers start on each node ready to run 40 Peak deploys/day 289 Machines deployed 1500+ Production Containers
  16. ShipIt Open source deploy orchestration tool Allows humans to participate

    in deployment process Provides easy visibility, rollback, locking Integrated with our Slack bot for ChatOps notifications and triggers
  17. Deploying Containers Containers can make terrible deployment vehicles if not

    used carefully Docker can be flaky, need to be fault tolerant, retry, use canary container sv-rollout for container rollout, with canaries and parallel deploys Image caching close to production machines critical
  18. DC’s sb290 Putting it all Together: Pipeline Architecture BuildKite GitHub

    Amazon EC2 bk95 Locutus Docker Registry bk1 bk2 Redis Test Q loc1 loc7 loc2 Locutus Agents Buildkite Agents sb1 Docker Cache ShipIt
  19. Summary Maintaining a fast deploy pipeline is a challenge as

    a team scales up The Shopify deploy pipeline is heavily optimized for speed to keep deploy batches small Tests tend to scale well with a lot of parallelism and hardware thrown at it Fast container build and deployment is a major challenge and requires careful optimization
  20. We’re Hiring ! 70+ open positions in a wide range

    of disciplines Ottawa, Montreal, Toronto, Waterloo, Remote https://shopify.com/careers