Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testing Rails at Scale

Testing Rails at Scale

A6cafa6da915d3b3f513e63f1ad2c127?s=128

Emil Stolarsky

May 04, 2016
Tweet

Transcript

  1. Testing Rails at Scale BY @E MILS TO LARSK Y

  2. 2

  3. 3 Shopify 243,000+ S H O P S $14B+ TOTA

    L G M V 300M+ U N I Q U E V I S I T S / M O N T H 1000+ E M P LOY E E S
  4. 4 CI Systems

  5. 5 Scheduler Compute

  6. 6 Scheduler Compute

  7. 7 Scheduler Compute

  8. 8 Managed Provider

  9. 9 Managed Provider • Multi-tenant • Closed system • Examples

    – CircleCI, Codeship, Hosted TravisCI
  10. 10 Unmanaged Provider

  11. 11 Unmanaged Provider • Self-hosted • Open system • Examples

    – Jenkins, TravisCI, Strider
  12. 12 Daily CI Stats 50,000+ C O N TA I

    N E R S B O OT E D F O R T E S T I N G 700 B U I L D S 42,000+ T E S T S P E R B U I L D 5 min B U I L D T I M E
  13. 13 Shopify using a Hosted Provider • 20+ minute build

    times • Flakiness from resource starvation • Expensive
  14. 14 A N EW HOPE

  15. 15 Beginning of a Journey • Bring build times under

    5 minutes • Restore confidence in our CI • Maintain current budget
  16. 16

  17. 17 c4.8xlarge Webhooks Code push Agent Instructions

  18. 18 Compute Cluster 5.4 TB M E M O R

    Y 3240 C P U C O R E S 90 F L E E T S I Z E AT P E A K c4.8xlarge I N S TA N C E T Y P E
  19. 19 Instances • AWS Hosted • Managed with Chef •

    Memory bound • IO Optimizations
  20. 20 SCROOGE

  21. 21 Auto Scaling with Scrooge c4.8xlarge Capacity Requirements Scrooge Boot/Shutdown

    Nodes c4.8xlarge
  22. • AWS specific optimizations • Improve utilization • Not one

    size fits all 22 Optimizing Cost
  23. 23 Graphing Productivity Active Buildkite Agents

  24. 24 Graphing Productivity Active Buildkite Agents ? ? ?

  25. 25 Graphing Productivity Active Buildkite Agents ? ? Lunch rush

    #1
  26. 26 Graphing Productivity Active Buildkite Agents ? Commit + Push

    Lunch rush #1
  27. 27 Graphing Productivity Active Buildkite Agents Lunch rush #2 Commit

    + Push Lunch rush #1
  28. 28 Docker • Boot speedup • Test isolation • Distribution

  29. 29 Building Containers with Locutus • Implements custom docker build

    API • Single EC2 machine • Forced debt repayment
  30. 30 Test Distribution • Tests allocated based on container index

    • Ruby tests and browser tests are run on seperate containers • Outliers inflated build times
  31. 31 Artifacts • Artifacts are uploaded to S3 by Buildkite

    Agents • Events log into Kafka & StatsD • Data tools are used to identify flaky tests
  32. 32 Capacity Requirements Scrooge Boot/Shutdown Nodes Agent Instructions Webhooks Pull

    Containers Pull Revision
  33. D OC K ER S T R IK ES BAC

    K
  34. 34 Rebel base is under Attack • Shipping second provider

    brought confusion • Locutus capacity issues • Tests times were still high
  35. 35 Battling Confusion • Botched rollout • Instability further eroded

    developer confidence
  36. 36 Clustering Locutus • Make it linearly scalable • Keep

    it stateless(-ish)
  37. 37 Locutus Diagram Worker Worker Worker Worker Worker Pool Cache

    Ring Coordinator Docker Registry Container push New containers
  38. 38 Test Distribution v2 • Loads all tests into Redis

    • Containers pull work off queue • No more container specialization
  39. 39 Capacity Requirements Scrooge Boot/Shutdown Nodes Agent Instructions Webhooks Pull

    Containers Code push webhook
  40. 40 RETU RN OF TH E STA BLE B UI

    LD
  41. 41 Docker • No one tests starting 10,000’s of containers/day

    • Instability further eroded developer confidence • Every new version of docker had major bugs
  42. 42 Handling Infrastructure Failures • At non-trivial scale, you’re guaranteed

    failures • Swallow infrastructure failures, never test failures • We still see 100+ container failures a day
  43. 43 Treating Servers as Pets 1. Wait for reports to

    stream in of build issues 2. Flag node as in maintenance 3. Manually take node out of rotation 4. ssh into the node and follow playbook steps to cleanup disk
  44. 44 Treating Servers as Cattle 1. Auto detect the failures

    2. Node removes itself from rotation 3. Node runs script to cleanup disk
  45. 45 I love the internet.

  46. 46

  47. 47

  48. 48 Test Distribution v3 • Containers record the tests they

    ran • Allow flakey tests to be rerun • Ensure no tests are lost
  49. 49 Capacity Requirements Scrooge Boot/Shutdown Nodes Agent Instructions Webhooks Pull

    Containers Code push webhook
  50. CONCU LSO N

  51. 51 Don’t build your own CI • Build times <10

    minutes • Small application
  52. 52 Build your own CI • Build times >15 minutes

    • Monolithic Application • Parallelization Limits
  53. 53 Lessons Learned • Commit 100% • Beware of Rabbit

    holes • Pets vs. Cattle
  54. 54 Blank Slide Thanks! Fo llow m e o n

    Tw it te r @Em ilSt ol arsky
  55. 55 Credits • Image of shipping containers: https://goo.gl/bXCn1X, https://goo.gl/cDDnYy •

    Images of Google DCs: https://goo.gl/UHVRc • Image of bank vault: https://goo.gl/fFN5EJ • Locutus: http://goo.gl/UyoJxx • Warehouse: https://goo.gl/5DiiR1 • Egyptian Temple: https://goo.gl/GjbLcq • Star wars: http://goo.gl/474wYG • Sinking container ship: http://goo.gl/U7rdR8, http://goo.gl/wlzlrm • Cats: http://goo.gl/9p2JXo, https://goo.gl/Ylhl60 • Cattle: http://goo.gl/IBdXmx • Star Wars: http://goo.gl/LatPEj • Creative Commons License: https://goo.gl/sZ7V7x