Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testing Rails at Scale

Testing Rails at Scale

Emil Stolarsky

May 04, 2016
Tweet

More Decks by Emil Stolarsky

Other Decks in Technology

Transcript

  1. 2

  2. 3 Shopify 243,000+ S H O P S $14B+ TOTA

    L G M V 300M+ U N I Q U E V I S I T S / M O N T H 1000+ E M P LOY E E S
  3. 12 Daily CI Stats 50,000+ C O N TA I

    N E R S B O OT E D F O R T E S T I N G 700 B U I L D S 42,000+ T E S T S P E R B U I L D 5 min B U I L D T I M E
  4. 13 Shopify using a Hosted Provider • 20+ minute build

    times • Flakiness from resource starvation • Expensive
  5. 15 Beginning of a Journey • Bring build times under

    5 minutes • Restore confidence in our CI • Maintain current budget
  6. 16

  7. 18 Compute Cluster 5.4 TB M E M O R

    Y 3240 C P U C O R E S 90 F L E E T S I Z E AT P E A K c4.8xlarge I N S TA N C E T Y P E
  8. 19 Instances • AWS Hosted • Managed with Chef •

    Memory bound • IO Optimizations
  9. 29 Building Containers with Locutus • Implements custom docker build

    API • Single EC2 machine • Forced debt repayment
  10. 30 Test Distribution • Tests allocated based on container index

    • Ruby tests and browser tests are run on seperate containers • Outliers inflated build times
  11. 31 Artifacts • Artifacts are uploaded to S3 by Buildkite

    Agents • Events log into Kafka & StatsD • Data tools are used to identify flaky tests
  12. 34 Rebel base is under Attack • Shipping second provider

    brought confusion • Locutus capacity issues • Tests times were still high
  13. 37 Locutus Diagram Worker Worker Worker Worker Worker Pool Cache

    Ring Coordinator Docker Registry Container push New containers
  14. 38 Test Distribution v2 • Loads all tests into Redis

    • Containers pull work off queue • No more container specialization
  15. 41 Docker • No one tests starting 10,000’s of containers/day

    • Instability further eroded developer confidence • Every new version of docker had major bugs
  16. 42 Handling Infrastructure Failures • At non-trivial scale, you’re guaranteed

    failures • Swallow infrastructure failures, never test failures • We still see 100+ container failures a day
  17. 43 Treating Servers as Pets 1. Wait for reports to

    stream in of build issues 2. Flag node as in maintenance 3. Manually take node out of rotation 4. ssh into the node and follow playbook steps to cleanup disk
  18. 44 Treating Servers as Cattle 1. Auto detect the failures

    2. Node removes itself from rotation 3. Node runs script to cleanup disk
  19. 46

  20. 47

  21. 48 Test Distribution v3 • Containers record the tests they

    ran • Allow flakey tests to be rerun • Ensure no tests are lost
  22. 51 Don’t build your own CI • Build times <10

    minutes • Small application
  23. 52 Build your own CI • Build times >15 minutes

    • Monolithic Application • Parallelization Limits
  24. 54 Blank Slide Thanks! Fo llow m e o n

    Tw it te r @Em ilSt ol arsky
  25. 55 Credits • Image of shipping containers: https://goo.gl/bXCn1X, https://goo.gl/cDDnYy •

    Images of Google DCs: https://goo.gl/UHVRc • Image of bank vault: https://goo.gl/fFN5EJ • Locutus: http://goo.gl/UyoJxx • Warehouse: https://goo.gl/5DiiR1 • Egyptian Temple: https://goo.gl/GjbLcq • Star wars: http://goo.gl/474wYG • Sinking container ship: http://goo.gl/U7rdR8, http://goo.gl/wlzlrm • Cats: http://goo.gl/9p2JXo, https://goo.gl/Ylhl60 • Cattle: http://goo.gl/IBdXmx • Star Wars: http://goo.gl/LatPEj • Creative Commons License: https://goo.gl/sZ7V7x