DevOps Toronto 2016: Scaling out Continuous Delivery

DevOps Toronto 2016: Scaling out Continuous Delivery

A5f3383a1a0c7e6d3df7f06361e39a5c?s=128

John Arthorne

May 27, 2016
Tweet

Transcript

  1. Scaling out Continuous Delivery DevOpsDays Toronto 2016 John Arthorne @jarthorne

    http://jarthorn.github.io/
  2. Shopify Makes Software for Commerce

  3. Key Platform Characteristics ~850 M Daily HTTP Traffic +150% rpm

    Flash Sale Traffic Spikes <100 ms Storefront Response Time
  4. Continuous Delivery at Shopify Shopify has a long running culture

    of rapid delivery 2013: Capistrano deploy by dedicated ops team 2014: 3rd party CI, struggling with scalability and production fidelity 2015: Building out new generation deploy pipeline
  5. A Pipeline Built for Speed and Scale Container Build Git

    Push Automated Tests Deploy 5s 90s 200s 300s Commit to deployed in ~10 minutes Every developer can deploy to production Goals:
  6. Why a Fast Pipeline is Important Less wait time for

    developers Faster time to a fix for customers Continuous uptime requires many small changes which magnifies wait times But the #1 reason: keeping the batch size small
  7. Batch Size vs Pipeline Speed 160 commits merged to shopify

    master on a busy day Commit every 3 minutes assuming 8 hour work day 3 minute deploy required for smallest batch size Builds have to keep getting faster to keep batch size down
  8. Why Small Batches are Important Decreased chance of failure in

    a given batch Faster time to find root cause when deploy causes problems Forces optimization of release process Higher chance of clean rollback #1 reason: making developers feel invested in deploy process
  9. Container Build 90s Commit to deployed in ~10 minutes Every

    developer can deploy to production Automated Tests Deploy 200s 300s Git Push 5s Container Build Goals:
  10. Locutus Locutus is a docker image building service for Shopify.

    Locutus receives GitHub webhooks on each push. On each push, it pulls the new source, builds a container, and pushes it to our docker registry. It has a few levels of caching to make builds faster and deploys smaller.
  11. Building Containers Building 1500 containers/day requires a lot of compute

    (>3 per minute during work day) Container layers optimized for build and deploy time Built for commit to branch, then again for master
  12. Container Build 90s Automated Tests Deploy 200s 300s Git Push

    5s Testing Commit to deployed in ~10 minutes Every developer can deploy to production Goals:
  13. Buildkite Hosted build and test orchestration service Test agents run

    in parallel on our own EC2 boxes Agents pull tests from Redis queue Ruby tests + Browser tests run with Selenium/Chrome 102 C4.8xlarge VMs 1472 Peak agents 45k Tests/Build
  14. Buildkite Details

  15. Test Flow BuildKite GitHub Locutus Docker Registry BK Agent Web

    Hook Wait for Container Container Ready Start Tests Fetch Container Tests Done Status Update Tests
  16. Scrooge Monitors Buildkite agents Starts/stops EC2 nodes based on current

    demand Ensure we only pay for the capacity we need Also determines level of parallelism for a given test run Tests are 48-192 way parallel
  17. Container Build 90s Automated Tests Deploy 200s 300s Git Push

    5s Deploy Commit to deployed in ~10 minutes Every developer can deploy to production Goals:
  18. Deploy Flow Deploy kicked off using Capistrano Revision file dictates

    which container to run Containers restarted using sv-rollout / runit Each node fetches its container from docker registry through local DC caching proxy Containers start on each node ready to run 40 Peak deploys/day 289 Machines deployed 1500+ Production Containers
  19. ShipIt Open source deploy orchestration tool Allows humans to participate

    in deployment process Provides easy visibility, rollback, locking Integrated with our Slack bot for ChatOps notifications and triggers
  20. ShipIt Deploy View

  21. Deploying Containers Containers can make terrible deployment vehicles if not

    used carefully Docker can be flaky, need to be fault tolerant, retry, use canary container sv-rollout for container rollout, with canaries and parallel deploys Docker image caching close to production machines critical
  22. DC’s sb290 Putting it all Together: Pipeline Architecture BuildKite GitHub

    Amazon EC2 bk95 Locutus Docker Registry bk1 bk2 Redis Test Q loc1 loc7 loc2 Locutus Agents Buildkite Agents sb1 Docker Cache ShipIt
  23. Summary Maintaining a fast deploy pipeline is a challenge as

    a team scales up The Shopify deploy pipeline is heavily optimized for speed to keep deploy batches small Tests tend to scale well with a lot of parallelism and hardware thrown at it Fast container build and deployment is a major challenge and requires careful optimization
  24. We’re Hiring ! 50+ open positions in a wide range

    of disciplines Ottawa, Montreal, Toronto, Waterloo, Remote https://shopify.com/careers
  25. Questions? DevOpsDays Toronto 2016 John Arthorne @jarthorne http://jarthorn.github.io/