Zero to Capacity Planning

Slide 1

Slide 1 text

From Zero To Capacity Planning

Slide 2

Slide 2 text

@Randommood INES   Sombra

Slide 3

Slide 3 text

Globally distributed and Highly available

Slide 4

Slide 4 text

Why capacity planning? Or a journey of discovery and ingenuity

Slide 5

Slide 5 text

The views reflected in this talk are not to be considered a reflection of the skills of my coworkers who are extremely nice human beings and way better at capacity planning than I am. NOT A monitoring person

Slide 6

Slide 6 text

INSTRUMENT MONITOR & ALERT PLAN & PREDICT The Road to Capacity planning ?

Slide 7

Slide 7 text

Findings Books 0 Day One Some Learning Our Discoveries Rituals & Myths Asking Around Bringing it Home our Path today Checking The Edge

Slide 8

Slide 8 text

zero … Oh shit!

Slide 9

Slide 9 text

a convenient ”situation” Handles State Many Clients Other systems depend on this service to be: up, healthy, and available! A bit F*cked

Slide 10

Slide 10 text

Our   World Edge Core ✨ ✨

Slide 11

Slide 11 text

a Fastly POP

Slide 12

Slide 12 text

I Rule the Edge! Evaluates weekly global POPs performance & makes projections Publishes capacity performance report in clear location Plans for our physical capacity & transit capacity Meet Catharine

Slide 13

Slide 13 text

Planning Our Capacity Some metrics - Network Capacity (Gb)   - Ordered Network Capability (Gb)   - Planned Network Capacity (Gb)  - RPS Capacity (k)   - Network peak (Gb)   - RPS peak (k)   - Site CPU Peak (%)   - Network Utilization (%) Over 30%: flagged, Over 70%: Red status

Slide 14

Slide 14 text

Edge Insights Our ability to correctly plan for capacity is critical to our bottom line Capacity doesn’t just involve hardware; software optimizations matter People affect capacity

Slide 15

Slide 15 text

Hitting The Books

Slide 16

Slide 16 text

Defining Capacity planning Measuring, planning, & managing system growth Determines what your system needs & when From the observation of actual traffic. Use current performance as baseline. Must happen regardless of what you might optimize

Slide 17

Slide 17 text

ARE WE RIGHT NOW? We have to be this fast & reliable   X per second & Y% Uptime MEASURE HOW/RELIABLE WE ARE HARDWARE SOFTWARE ARCHITECTURE CHANGE / ADD / REMOVE FIGURE OUT HOW TO STAY FAST/RELIABLE ENOUGH Yes! No! Allspaw's Wisdom From The Art of Capacity Planning

Slide 18

Slide 18 text

System’s Ceiling: critical level of a resource that cannot be crossed without failure. Find yours Another form of Capacity Planning: Controlled load testing Predictions: ceilings + historical data Allspaw's Wisdom

Slide 19

Slide 19 text

Allspaw's Wisdom System architecture can affect your ability to add capacity Identify & track your application’s metrics Tying metrics to user behavior is helpful If you don’t have ways to measure your current capacity you can’t plan

Slide 20

Slide 20 text

Little’s Law & Capacity planning L = λW Capacity (L), Throughput (λ), and Latency (W) Applies to stable systems Use this information to better understand our workload and to define constraints

Slide 21

Slide 21 text

Literature Insights Possible to have plenty of capacity and a slow site nonetheless Projections & curve fitting are guesses Keep track of API calls & their rate Always gonna be spikes & hiccups. Take the bad with the good & plan for it

Slide 22

Slide 22 text

Rituals & Myths

Slide 23

Slide 23 text

Crowdsourcing Capacity planning

Slide 24

Slide 24 text

Crowdsourcing Capacity planning

Slide 25

Slide 25 text

Industry Insights Hard to extrapolate general advice into something applicable for my situation Simplicity & ability to reason are the only things I could trust Confusing community stance on the ROI of capacity planning

Slide 26

Slide 26 text

& Putting things in practice Findings

Slide 27

Slide 27 text

Step One Step Two steps followed Documented system architecture & request lifecycle Formalized: clients, SLAs, & operational requirements Discovery Confirmed constraints & determined strategy Parallelized capacity & optimizations tasks Organized a team Gauging & Planning

Slide 28

Slide 28 text

Edge Core APP / API APP / API LB LB COORDINATOR A COORDINATOR B COORDINATOR C CACHE LON CACHE DFW CACHE FRA CACHE LAX CACHE AMS CACHE SYD REQUEST flow

Slide 29

Slide 29 text

Step Four steps followed Start process again Tons of tuning left to do. We know we have suboptimal configs! re-Evaluation Step Three Doubled RAM: our constrained resource Horizontally scaled to 3 servers + 1 canary Capacity expansion

Slide 30

Slide 30 text

System Before

Slide 31

Slide 31 text

System After

Slide 32

Slide 32 text

System Before System After

Slide 33

Slide 33 text

System Before System After

Slide 34

Slide 34 text

Unexpected Challenges Our goal when adding capacity was no service disruption. Localhost is the goddamn devil Gap from metric/graph to insight can be huge Slowness is the nemesis of distributed system

Slide 35

Slide 35 text

The Oprah Problem Developing operational insights into non-owned system under pressure is not great Use playbooks, debug.md, rotations, & rollout owners Proactivity and clarity are your best tools Everyone gets more capacity!

Slide 36

Slide 36 text

Some Insights Anything API driven ought to carry a rate limit - We can easily DDOS ourselves! Monitor and alert on expensive API actions Mind your system dependencies: practice defensive system design & architecture CAPACITY PLANNING ALERTING MONITORING

Slide 37

Slide 37 text

Some Findings Capacity tied to murky organizational structure is both good & bad (but mostly bad) Mind your error descriptions! Cheeky today ⇒ misleading tomorrow!

Slide 38

Slide 38 text

Finding my system’s ceiling is still tricky Services owned by engineers means you need to level up on Ops skills Back to re-evaluate setup to get more out of this new capacity Performance testing ought to be done on the core’s side (& edge) My Insights

Slide 39

Slide 39 text

TL;DR Is a process not a one time event Pushes you to better understand your system, its capacity & its boundaries - that is good! Proactivity is best Capacity planning Request lifecycle gets tricky System boundaries, dependencies & SLAs must be discussed Your system’s capacity may bound other systems capacity Distributed systems

Slide 40

Slide 40 text

github.com/Randommood/ZerotoCapacityPlanning Special Thanks to: Catharine Strauss, Alan Kasindorf, Matt Whiteley, Caitie McCaffrey, Thom Mahoney, Mike O’Neill, Devon O’Dell, Katherine Daniels, Nathan Taylor, Bruce Spang, and Greg Bako Thank you !

Slide 41

Slide 41 text

github.com/Randommood/ZerotoCapacityPlanning