$30 off During Our Annual Pro Sale. View Details »

Testing Rails at Scale

Testing Rails at Scale

Emil Stolarsky

May 04, 2016
Tweet

More Decks by Emil Stolarsky

Other Decks in Technology

Transcript

  1. Testing Rails at Scale
    BY @E MILS TO LARSK Y

    View Slide

  2. 2

    View Slide

  3. 3
    Shopify
    243,000+
    S H O P S
    $14B+
    TOTA L G M V
    300M+
    U N I Q U E V I S I T S / M O N T H
    1000+
    E M P LOY E E S

    View Slide

  4. 4
    CI Systems

    View Slide

  5. 5
    Scheduler
    Compute

    View Slide

  6. 6
    Scheduler
    Compute

    View Slide

  7. 7
    Scheduler
    Compute

    View Slide

  8. 8
    Managed Provider

    View Slide

  9. 9
    Managed Provider
    • Multi-tenant
    • Closed system
    • Examples – CircleCI, Codeship, Hosted TravisCI

    View Slide

  10. 10
    Unmanaged Provider

    View Slide

  11. 11
    Unmanaged Provider
    • Self-hosted
    • Open system
    • Examples – Jenkins, TravisCI, Strider

    View Slide

  12. 12
    Daily CI Stats
    50,000+
    C O N TA I N E R S B O OT E D
    F O R T E S T I N G
    700
    B U I L D S
    42,000+
    T E S T S P E R B U I L D
    5 min
    B U I L D T I M E

    View Slide

  13. 13
    Shopify using a Hosted Provider
    • 20+ minute build times
    • Flakiness from resource starvation
    • Expensive

    View Slide

  14. 14
    A N EW HOPE

    View Slide

  15. 15
    Beginning of a Journey
    • Bring build times under 5 minutes
    • Restore confidence in our CI
    • Maintain current budget

    View Slide

  16. 16

    View Slide

  17. 17
    c4.8xlarge
    Webhooks
    Code push
    Agent Instructions

    View Slide

  18. 18
    Compute Cluster
    5.4 TB
    M E M O R Y
    3240
    C P U C O R E S
    90
    F L E E T S I Z E AT P E A K
    c4.8xlarge
    I N S TA N C E T Y P E

    View Slide

  19. 19
    Instances
    • AWS Hosted
    • Managed with Chef
    • Memory bound
    • IO Optimizations

    View Slide

  20. 20
    SCROOGE

    View Slide

  21. 21
    Auto Scaling with Scrooge
    c4.8xlarge
    Capacity
    Requirements
    Scrooge
    Boot/Shutdown
    Nodes
    c4.8xlarge

    View Slide

  22. • AWS specific optimizations
    • Improve utilization
    • Not one size fits all
    22
    Optimizing Cost

    View Slide

  23. 23
    Graphing Productivity
    Active Buildkite Agents

    View Slide

  24. 24
    Graphing Productivity
    Active Buildkite Agents
    ?
    ?
    ?

    View Slide

  25. 25
    Graphing Productivity
    Active Buildkite Agents
    ?
    ?
    Lunch rush #1

    View Slide

  26. 26
    Graphing Productivity
    Active Buildkite Agents
    ?
    Commit + Push
    Lunch rush #1

    View Slide

  27. 27
    Graphing Productivity
    Active Buildkite Agents
    Lunch rush #2
    Commit + Push
    Lunch rush #1

    View Slide

  28. 28
    Docker
    • Boot speedup
    • Test isolation
    • Distribution

    View Slide

  29. 29
    Building Containers with Locutus
    • Implements custom docker build API
    • Single EC2 machine
    • Forced debt repayment

    View Slide

  30. 30
    Test Distribution
    • Tests allocated based on container index
    • Ruby tests and browser tests are run on seperate
    containers
    • Outliers inflated build times

    View Slide

  31. 31
    Artifacts
    • Artifacts are uploaded to S3 by Buildkite Agents
    • Events log into Kafka & StatsD
    • Data tools are used to identify flaky tests

    View Slide

  32. 32
    Capacity
    Requirements
    Scrooge
    Boot/Shutdown
    Nodes
    Agent Instructions
    Webhooks
    Pull Containers
    Pull
    Revision

    View Slide

  33. D OC K ER S T R IK ES BAC K

    View Slide

  34. 34
    Rebel base is under Attack
    • Shipping second provider brought confusion
    • Locutus capacity issues
    • Tests times were still high

    View Slide

  35. 35
    Battling Confusion
    • Botched rollout
    • Instability further eroded developer confidence

    View Slide

  36. 36
    Clustering Locutus
    • Make it linearly scalable
    • Keep it stateless(-ish)

    View Slide

  37. 37
    Locutus Diagram
    Worker
    Worker
    Worker
    Worker
    Worker Pool
    Cache Ring
    Coordinator
    Docker Registry
    Container push
    New containers

    View Slide

  38. 38
    Test Distribution v2
    • Loads all tests into Redis
    • Containers pull work off queue
    • No more container specialization

    View Slide

  39. 39
    Capacity
    Requirements
    Scrooge
    Boot/Shutdown
    Nodes
    Agent Instructions
    Webhooks
    Pull
    Containers
    Code push
    webhook

    View Slide

  40. 40
    RETU RN OF TH E
    STA BLE B UI LD

    View Slide

  41. 41
    Docker
    • No one tests starting 10,000’s of containers/day
    • Instability further eroded developer confidence
    • Every new version of docker had major bugs

    View Slide

  42. 42
    Handling Infrastructure Failures
    • At non-trivial scale, you’re guaranteed failures
    • Swallow infrastructure failures, never test failures
    • We still see 100+ container failures a day

    View Slide

  43. 43
    Treating Servers as Pets
    1. Wait for reports to stream in of build issues
    2. Flag node as in maintenance
    3. Manually take node out of rotation
    4. ssh into the node and follow playbook steps to
    cleanup disk

    View Slide

  44. 44
    Treating Servers as Cattle
    1. Auto detect the failures
    2. Node removes itself from rotation
    3. Node runs script to cleanup disk

    View Slide

  45. 45
    I love the internet.

    View Slide

  46. 46

    View Slide

  47. 47

    View Slide

  48. 48
    Test Distribution v3
    • Containers record the tests they ran
    • Allow flakey tests to be rerun
    • Ensure no tests are lost

    View Slide

  49. 49
    Capacity
    Requirements
    Scrooge
    Boot/Shutdown
    Nodes
    Agent Instructions
    Webhooks
    Pull
    Containers
    Code push
    webhook

    View Slide

  50. CONCU LSO N

    View Slide

  51. 51
    Don’t build your own CI
    • Build times <10 minutes
    • Small application

    View Slide

  52. 52
    Build your own CI
    • Build times >15 minutes
    • Monolithic Application
    • Parallelization Limits

    View Slide

  53. 53
    Lessons Learned
    • Commit 100%
    • Beware of Rabbit holes
    • Pets vs. Cattle

    View Slide

  54. 54
    Blank Slide
    Thanks!
    Fo llow m e o n Tw it te r @Em ilSt ol arsky

    View Slide

  55. 55
    Credits
    • Image of shipping containers: https://goo.gl/bXCn1X, https://goo.gl/cDDnYy
    • Images of Google DCs: https://goo.gl/UHVRc
    • Image of bank vault: https://goo.gl/fFN5EJ
    • Locutus: http://goo.gl/UyoJxx
    • Warehouse: https://goo.gl/5DiiR1
    • Egyptian Temple: https://goo.gl/GjbLcq
    • Star wars: http://goo.gl/474wYG
    • Sinking container ship: http://goo.gl/U7rdR8, http://goo.gl/wlzlrm
    • Cats: http://goo.gl/9p2JXo, https://goo.gl/Ylhl60
    • Cattle: http://goo.gl/IBdXmx
    • Star Wars: http://goo.gl/LatPEj
    • Creative Commons License: https://goo.gl/sZ7V7x

    View Slide