Testing Rails at Scale

Testing Rails at Scale

A6cafa6da915d3b3f513e63f1ad2c127?s=128

Emil Stolarsky

May 04, 2016
Tweet

Transcript

  1. 2.

    2

  2. 3.

    3 Shopify 243,000+ S H O P S $14B+ TOTA

    L G M V 300M+ U N I Q U E V I S I T S / M O N T H 1000+ E M P LOY E E S
  3. 9.
  4. 12.

    12 Daily CI Stats 50,000+ C O N TA I

    N E R S B O OT E D F O R T E S T I N G 700 B U I L D S 42,000+ T E S T S P E R B U I L D 5 min B U I L D T I M E
  5. 13.

    13 Shopify using a Hosted Provider • 20+ minute build

    times • Flakiness from resource starvation • Expensive
  6. 15.

    15 Beginning of a Journey • Bring build times under

    5 minutes • Restore confidence in our CI • Maintain current budget
  7. 16.

    16

  8. 18.

    18 Compute Cluster 5.4 TB M E M O R

    Y 3240 C P U C O R E S 90 F L E E T S I Z E AT P E A K c4.8xlarge I N S TA N C E T Y P E
  9. 19.

    19 Instances • AWS Hosted • Managed with Chef •

    Memory bound • IO Optimizations
  10. 29.

    29 Building Containers with Locutus • Implements custom docker build

    API • Single EC2 machine • Forced debt repayment
  11. 30.

    30 Test Distribution • Tests allocated based on container index

    • Ruby tests and browser tests are run on seperate containers • Outliers inflated build times
  12. 31.

    31 Artifacts • Artifacts are uploaded to S3 by Buildkite

    Agents • Events log into Kafka & StatsD • Data tools are used to identify flaky tests
  13. 34.

    34 Rebel base is under Attack • Shipping second provider

    brought confusion • Locutus capacity issues • Tests times were still high
  14. 37.

    37 Locutus Diagram Worker Worker Worker Worker Worker Pool Cache

    Ring Coordinator Docker Registry Container push New containers
  15. 38.

    38 Test Distribution v2 • Loads all tests into Redis

    • Containers pull work off queue • No more container specialization
  16. 41.

    41 Docker • No one tests starting 10,000’s of containers/day

    • Instability further eroded developer confidence • Every new version of docker had major bugs
  17. 42.

    42 Handling Infrastructure Failures • At non-trivial scale, you’re guaranteed

    failures • Swallow infrastructure failures, never test failures • We still see 100+ container failures a day
  18. 43.

    43 Treating Servers as Pets 1. Wait for reports to

    stream in of build issues 2. Flag node as in maintenance 3. Manually take node out of rotation 4. ssh into the node and follow playbook steps to cleanup disk
  19. 44.

    44 Treating Servers as Cattle 1. Auto detect the failures

    2. Node removes itself from rotation 3. Node runs script to cleanup disk
  20. 46.

    46

  21. 47.

    47

  22. 48.

    48 Test Distribution v3 • Containers record the tests they

    ran • Allow flakey tests to be rerun • Ensure no tests are lost
  23. 51.

    51 Don’t build your own CI • Build times <10

    minutes • Small application
  24. 52.

    52 Build your own CI • Build times >15 minutes

    • Monolithic Application • Parallelization Limits
  25. 54.

    54 Blank Slide Thanks! Fo llow m e o n

    Tw it te r @Em ilSt ol arsky
  26. 55.

    55 Credits • Image of shipping containers: https://goo.gl/bXCn1X, https://goo.gl/cDDnYy •

    Images of Google DCs: https://goo.gl/UHVRc • Image of bank vault: https://goo.gl/fFN5EJ • Locutus: http://goo.gl/UyoJxx • Warehouse: https://goo.gl/5DiiR1 • Egyptian Temple: https://goo.gl/GjbLcq • Star wars: http://goo.gl/474wYG • Sinking container ship: http://goo.gl/U7rdR8, http://goo.gl/wlzlrm • Cats: http://goo.gl/9p2JXo, https://goo.gl/Ylhl60 • Cattle: http://goo.gl/IBdXmx • Star Wars: http://goo.gl/LatPEj • Creative Commons License: https://goo.gl/sZ7V7x