Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
Testing Rails at Scale
Emil Stolarsky
May 04, 2016
Technology
2
3.4k
Testing Rails at Scale
Emil Stolarsky
May 04, 2016
Tweet
Share
More Decks by Emil Stolarsky
See All by Emil Stolarsky
es
0
27
es
0
100
es
0
41
Other Decks in Technology
See All in Technology
tsuyo
0
480
hgsgtk
1
850
supership
0
150
ishiayaya
PRO
0
800
yoku0825
PRO
2
580
tatsy
0
110
brtriver
1
410
gunnargrosch
0
200
gracia
0
880
halhira
1
100
aizurage
0
100
aamine
4
830
Featured
See All Featured
jmmastey
10
700
morganepeng
94
14k
tammielis
237
23k
lara
590
61k
pauljervisheath
195
15k
chrislema
173
14k
frogandcode
128
20k
keavy
107
14k
destraynor
146
19k
addyosmani
312
21k
mthomps
38
2.3k
rmw
12
860
Transcript
Testing Rails at Scale BY @E MILS TO LARSK Y
2
3 Shopify 243,000+ S H O P S $14B+ TOTA
L G M V 300M+ U N I Q U E V I S I T S / M O N T H 1000+ E M P LOY E E S
4 CI Systems
5 Scheduler Compute
6 Scheduler Compute
7 Scheduler Compute
8 Managed Provider
9 Managed Provider • Multi-tenant • Closed system • Examples
– CircleCI, Codeship, Hosted TravisCI
10 Unmanaged Provider
11 Unmanaged Provider • Self-hosted • Open system • Examples
– Jenkins, TravisCI, Strider
12 Daily CI Stats 50,000+ C O N TA I
N E R S B O OT E D F O R T E S T I N G 700 B U I L D S 42,000+ T E S T S P E R B U I L D 5 min B U I L D T I M E
13 Shopify using a Hosted Provider • 20+ minute build
times • Flakiness from resource starvation • Expensive
14 A N EW HOPE
15 Beginning of a Journey • Bring build times under
5 minutes • Restore confidence in our CI • Maintain current budget
16
17 c4.8xlarge Webhooks Code push Agent Instructions
18 Compute Cluster 5.4 TB M E M O R
Y 3240 C P U C O R E S 90 F L E E T S I Z E AT P E A K c4.8xlarge I N S TA N C E T Y P E
19 Instances • AWS Hosted • Managed with Chef •
Memory bound • IO Optimizations
20 SCROOGE
21 Auto Scaling with Scrooge c4.8xlarge Capacity Requirements Scrooge Boot/Shutdown
Nodes c4.8xlarge
• AWS specific optimizations • Improve utilization • Not one
size fits all 22 Optimizing Cost
23 Graphing Productivity Active Buildkite Agents
24 Graphing Productivity Active Buildkite Agents ? ? ?
25 Graphing Productivity Active Buildkite Agents ? ? Lunch rush
#1
26 Graphing Productivity Active Buildkite Agents ? Commit + Push
Lunch rush #1
27 Graphing Productivity Active Buildkite Agents Lunch rush #2 Commit
+ Push Lunch rush #1
28 Docker • Boot speedup • Test isolation • Distribution
29 Building Containers with Locutus • Implements custom docker build
API • Single EC2 machine • Forced debt repayment
30 Test Distribution • Tests allocated based on container index
• Ruby tests and browser tests are run on seperate containers • Outliers inflated build times
31 Artifacts • Artifacts are uploaded to S3 by Buildkite
Agents • Events log into Kafka & StatsD • Data tools are used to identify flaky tests
32 Capacity Requirements Scrooge Boot/Shutdown Nodes Agent Instructions Webhooks Pull
Containers Pull Revision
D OC K ER S T R IK ES BAC
K
34 Rebel base is under Attack • Shipping second provider
brought confusion • Locutus capacity issues • Tests times were still high
35 Battling Confusion • Botched rollout • Instability further eroded
developer confidence
36 Clustering Locutus • Make it linearly scalable • Keep
it stateless(-ish)
37 Locutus Diagram Worker Worker Worker Worker Worker Pool Cache
Ring Coordinator Docker Registry Container push New containers
38 Test Distribution v2 • Loads all tests into Redis
• Containers pull work off queue • No more container specialization
39 Capacity Requirements Scrooge Boot/Shutdown Nodes Agent Instructions Webhooks Pull
Containers Code push webhook
40 RETU RN OF TH E STA BLE B UI
LD
41 Docker • No one tests starting 10,000’s of containers/day
• Instability further eroded developer confidence • Every new version of docker had major bugs
42 Handling Infrastructure Failures • At non-trivial scale, you’re guaranteed
failures • Swallow infrastructure failures, never test failures • We still see 100+ container failures a day
43 Treating Servers as Pets 1. Wait for reports to
stream in of build issues 2. Flag node as in maintenance 3. Manually take node out of rotation 4. ssh into the node and follow playbook steps to cleanup disk
44 Treating Servers as Cattle 1. Auto detect the failures
2. Node removes itself from rotation 3. Node runs script to cleanup disk
45 I love the internet.
46
47
48 Test Distribution v3 • Containers record the tests they
ran • Allow flakey tests to be rerun • Ensure no tests are lost
49 Capacity Requirements Scrooge Boot/Shutdown Nodes Agent Instructions Webhooks Pull
Containers Code push webhook
CONCU LSO N
51 Don’t build your own CI • Build times <10
minutes • Small application
52 Build your own CI • Build times >15 minutes
• Monolithic Application • Parallelization Limits
53 Lessons Learned • Commit 100% • Beware of Rabbit
holes • Pets vs. Cattle
54 Blank Slide Thanks! Fo llow m e o n
Tw it te r @Em ilSt ol arsky
55 Credits • Image of shipping containers: https://goo.gl/bXCn1X, https://goo.gl/cDDnYy •
Images of Google DCs: https://goo.gl/UHVRc • Image of bank vault: https://goo.gl/fFN5EJ • Locutus: http://goo.gl/UyoJxx • Warehouse: https://goo.gl/5DiiR1 • Egyptian Temple: https://goo.gl/GjbLcq • Star wars: http://goo.gl/474wYG • Sinking container ship: http://goo.gl/U7rdR8, http://goo.gl/wlzlrm • Cats: http://goo.gl/9p2JXo, https://goo.gl/Ylhl60 • Cattle: http://goo.gl/IBdXmx • Star Wars: http://goo.gl/LatPEj • Creative Commons License: https://goo.gl/sZ7V7x