Slide 1

Slide 1 text

Testing Rails at Scale BY @E MILS TO LARSK Y

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

3 Shopify 243,000+ S H O P S $14B+ TOTA L G M V 300M+ U N I Q U E V I S I T S / M O N T H 1000+ E M P LOY E E S

Slide 4

Slide 4 text

4 CI Systems

Slide 5

Slide 5 text

5 Scheduler Compute

Slide 6

Slide 6 text

6 Scheduler Compute

Slide 7

Slide 7 text

7 Scheduler Compute

Slide 8

Slide 8 text

8 Managed Provider

Slide 9

Slide 9 text

9 Managed Provider • Multi-tenant • Closed system • Examples – CircleCI, Codeship, Hosted TravisCI

Slide 10

Slide 10 text

10 Unmanaged Provider

Slide 11

Slide 11 text

11 Unmanaged Provider • Self-hosted • Open system • Examples – Jenkins, TravisCI, Strider

Slide 12

Slide 12 text

12 Daily CI Stats 50,000+ C O N TA I N E R S B O OT E D F O R T E S T I N G 700 B U I L D S 42,000+ T E S T S P E R B U I L D 5 min B U I L D T I M E

Slide 13

Slide 13 text

13 Shopify using a Hosted Provider • 20+ minute build times • Flakiness from resource starvation • Expensive

Slide 14

Slide 14 text

14 A N EW HOPE

Slide 15

Slide 15 text

15 Beginning of a Journey • Bring build times under 5 minutes • Restore confidence in our CI • Maintain current budget

Slide 16

Slide 16 text

16

Slide 17

Slide 17 text

17 c4.8xlarge Webhooks Code push Agent Instructions

Slide 18

Slide 18 text

18 Compute Cluster 5.4 TB M E M O R Y 3240 C P U C O R E S 90 F L E E T S I Z E AT P E A K c4.8xlarge I N S TA N C E T Y P E

Slide 19

Slide 19 text

19 Instances • AWS Hosted • Managed with Chef • Memory bound • IO Optimizations

Slide 20

Slide 20 text

20 SCROOGE

Slide 21

Slide 21 text

21 Auto Scaling with Scrooge c4.8xlarge Capacity Requirements Scrooge Boot/Shutdown Nodes c4.8xlarge

Slide 22

Slide 22 text

• AWS specific optimizations • Improve utilization • Not one size fits all 22 Optimizing Cost

Slide 23

Slide 23 text

23 Graphing Productivity Active Buildkite Agents

Slide 24

Slide 24 text

24 Graphing Productivity Active Buildkite Agents ? ? ?

Slide 25

Slide 25 text

25 Graphing Productivity Active Buildkite Agents ? ? Lunch rush #1

Slide 26

Slide 26 text

26 Graphing Productivity Active Buildkite Agents ? Commit + Push Lunch rush #1

Slide 27

Slide 27 text

27 Graphing Productivity Active Buildkite Agents Lunch rush #2 Commit + Push Lunch rush #1

Slide 28

Slide 28 text

28 Docker • Boot speedup • Test isolation • Distribution

Slide 29

Slide 29 text

29 Building Containers with Locutus • Implements custom docker build API • Single EC2 machine • Forced debt repayment

Slide 30

Slide 30 text

30 Test Distribution • Tests allocated based on container index • Ruby tests and browser tests are run on seperate containers • Outliers inflated build times

Slide 31

Slide 31 text

31 Artifacts • Artifacts are uploaded to S3 by Buildkite Agents • Events log into Kafka & StatsD • Data tools are used to identify flaky tests

Slide 32

Slide 32 text

32 Capacity Requirements Scrooge Boot/Shutdown Nodes Agent Instructions Webhooks Pull Containers Pull Revision

Slide 33

Slide 33 text

D OC K ER S T R IK ES BAC K

Slide 34

Slide 34 text

34 Rebel base is under Attack • Shipping second provider brought confusion • Locutus capacity issues • Tests times were still high

Slide 35

Slide 35 text

35 Battling Confusion • Botched rollout • Instability further eroded developer confidence

Slide 36

Slide 36 text

36 Clustering Locutus • Make it linearly scalable • Keep it stateless(-ish)

Slide 37

Slide 37 text

37 Locutus Diagram Worker Worker Worker Worker Worker Pool Cache Ring Coordinator Docker Registry Container push New containers

Slide 38

Slide 38 text

38 Test Distribution v2 • Loads all tests into Redis • Containers pull work off queue • No more container specialization

Slide 39

Slide 39 text

39 Capacity Requirements Scrooge Boot/Shutdown Nodes Agent Instructions Webhooks Pull Containers Code push webhook

Slide 40

Slide 40 text

40 RETU RN OF TH E STA BLE B UI LD

Slide 41

Slide 41 text

41 Docker • No one tests starting 10,000’s of containers/day • Instability further eroded developer confidence • Every new version of docker had major bugs

Slide 42

Slide 42 text

42 Handling Infrastructure Failures • At non-trivial scale, you’re guaranteed failures • Swallow infrastructure failures, never test failures • We still see 100+ container failures a day

Slide 43

Slide 43 text

43 Treating Servers as Pets 1. Wait for reports to stream in of build issues 2. Flag node as in maintenance 3. Manually take node out of rotation 4. ssh into the node and follow playbook steps to cleanup disk

Slide 44

Slide 44 text

44 Treating Servers as Cattle 1. Auto detect the failures 2. Node removes itself from rotation 3. Node runs script to cleanup disk

Slide 45

Slide 45 text

45 I love the internet.

Slide 46

Slide 46 text

46

Slide 47

Slide 47 text

47

Slide 48

Slide 48 text

48 Test Distribution v3 • Containers record the tests they ran • Allow flakey tests to be rerun • Ensure no tests are lost

Slide 49

Slide 49 text

49 Capacity Requirements Scrooge Boot/Shutdown Nodes Agent Instructions Webhooks Pull Containers Code push webhook

Slide 50

Slide 50 text

CONCU LSO N

Slide 51

Slide 51 text

51 Don’t build your own CI • Build times <10 minutes • Small application

Slide 52

Slide 52 text

52 Build your own CI • Build times >15 minutes • Monolithic Application • Parallelization Limits

Slide 53

Slide 53 text

53 Lessons Learned • Commit 100% • Beware of Rabbit holes • Pets vs. Cattle

Slide 54

Slide 54 text

54 Blank Slide Thanks! Fo llow m e o n Tw it te r @Em ilSt ol arsky

Slide 55

Slide 55 text

55 Credits • Image of shipping containers: https://goo.gl/bXCn1X, https://goo.gl/cDDnYy • Images of Google DCs: https://goo.gl/UHVRc • Image of bank vault: https://goo.gl/fFN5EJ • Locutus: http://goo.gl/UyoJxx • Warehouse: https://goo.gl/5DiiR1 • Egyptian Temple: https://goo.gl/GjbLcq • Star wars: http://goo.gl/474wYG • Sinking container ship: http://goo.gl/U7rdR8, http://goo.gl/wlzlrm • Cats: http://goo.gl/9p2JXo, https://goo.gl/Ylhl60 • Cattle: http://goo.gl/IBdXmx • Star Wars: http://goo.gl/LatPEj • Creative Commons License: https://goo.gl/sZ7V7x