Slide 1

Slide 1 text

From Zero To Capacity Planning 2016 Edition!

Slide 2

Slide 2 text

@Randommood INES 
 Sombra

Slide 3

Slide 3 text

Globally distributed and Highly available

Slide 4

Slide 4 text

NOT AN INFRA person

Slide 5

Slide 5 text

INSTRUMENT MONITOR & ALERT PLAN & PREDICT Why care? About Capacity planning ✨ ✨

Slide 6

Slide 6 text

Capacity planning 
 101

Slide 7

Slide 7 text

Defining Capacity planning Measuring, planning, & managing system growth Determines what your system needs & when From the observation of actual traffic. Use current performance as baseline for predictions Must happen regardless of what you might optimize in the future

Slide 8

Slide 8 text

a Fastly POP

Slide 9

Slide 9 text

I Rule! Evaluates weekly global POPs performance & makes projections Weekly plublishes capacity performance report Plans for our physical capacity & transit capacity Meet Catharine

Slide 10

Slide 10 text

Planning Our Capacity Contextual metrics - Network Capacity (Gb) 
 - Ordered Network Capability (Gb) 
 - Planned Network Capacity (Gb)
 - RPS Capacity (k) 
 - Network peak (Gb) 
 - RPS peak (k) 
 - Site CPU Peak (%) 
 - Network Utilization (%) Over 30%: flagged, Over 70%: Red status

Slide 11

Slide 11 text

Fastly Insights Our ability to correctly plan for capacity is critical to our bottom line Capacity doesn’t just involve hardware; software & transfer optimizations matter People affect capacity

Slide 12

Slide 12 text

allspaW’s Admiration society

Slide 13

Slide 13 text

ARE WE RIGHT NOW? We have to be this fast & reliable 
 X per second & Y% Uptime MEASURE HOW/RELIABLE WE ARE HARDWARE SOFTWARE ARCHITECTURE CHANGE / ADD / REMOVE FIGURE OUT HOW TO STAY FAST/RELIABLE ENOUGH Yes! No! Allspaw's Wisdom From The Art of Capacity Planning

Slide 14

Slide 14 text

System’s Ceiling: critical level of a resource that cannot be crossed without failure. Find yours Another form of Capacity Planning: Controlled load testing Predictions = ceilings + historical data Allspaw's Wisdom

Slide 15

Slide 15 text

Allspaw's Wisdom System architecture can affect your ability to add capacity Identify & track your application’s metrics Tying metrics to user behavior is helpful If you don’t have ways to measure your current capacity you can’t plan

Slide 16

Slide 16 text

& Putting things in practice Findings

Slide 17

Slide 17 text

Unexpected Challenges The goal when adding capacity is no service disruption Localhost is the goddamn devil Gap from metric/graph to insight can be huge Slowness is the nemesis of distributed system

Slide 18

Slide 18 text

more Insights Capacity tied to murky organizational structure is both good & bad (but mostly bad) Mind your system dependencies: practice defensive system design & architecture New SLAs can be tricky CAPACITY PLANNING ALERTING MONITORING

Slide 19

Slide 19 text

more Insights Possible to have plenty of capacity and a slow site nonetheless Projections & curve fitting are guesses Keep track of API calls & their rates Always gonna be spikes & hiccups. Take the bad with the good & plan for it

Slide 20

Slide 20 text

TL;DR Is a process not a one time event Pushes you to better understand your system, its capacity & its boundaries - that is good! Proactivity is best Capacity planning Request lifecycle gets tricky System boundaries, dependencies & SLAs must be discussed Your system’s capacity may bound other systems capacity Distributed systems

Slide 21

Slide 21 text

github.com/Randommood/ZerotoCapacityPlanning Special Thanks to: Catharine Strauss, Alan Kasindorf, Matt Whiteley, Caitie McCaffrey, Thom Mahoney, Mike O’Neill, Devon O’Dell, Katherine Daniels, Nathan Taylor, Bruce Spang, and Greg Bako Thank you !