Zero to Capacity Planning

Zero to Capacity Planning

C64a0152c9b0928e62d88f0bb5eb8138?s=128

Ines Sombra

June 15, 2015
Tweet

Transcript

  1. From Zero To Capacity Planning

  2. @Randommood INES 
 Sombra

  3. Globally distributed and Highly available

  4. Why capacity planning? Or a journey of discovery and ingenuity

  5. The views reflected in this talk are not to be

    considered a reflection of the skills of my coworkers who are extremely nice human beings and way better at capacity planning than I am. NOT A monitoring person
  6. INSTRUMENT MONITOR & ALERT PLAN & PREDICT The Road to

    Capacity planning ?
  7. Findings Books 0 Day One Some Learning Our Discoveries Rituals

    & Myths Asking Around Bringing it Home our Path today Checking The Edge
  8. zero … Oh shit!

  9. a convenient ”situation” Handles State Many Clients Other systems depend

    on this service to be: up, healthy, and available! A bit F*cked
  10. Our 
 World Edge Core ✨ ✨

  11. a Fastly POP

  12. I Rule the Edge! Evaluates weekly global POPs performance &

    makes projections Publishes capacity performance report in clear location Plans for our physical capacity & transit capacity Meet Catharine
  13. Planning Our Capacity Some metrics - Network Capacity (Gb) 


    - Ordered Network Capability (Gb) 
 - Planned Network Capacity (Gb)
 - RPS Capacity (k) 
 - Network peak (Gb) 
 - RPS peak (k) 
 - Site CPU Peak (%) 
 - Network Utilization (%) Over 30%: flagged, Over 70%: Red status
  14. Edge Insights Our ability to correctly plan for capacity is

    critical to our bottom line Capacity doesn’t just involve hardware; software optimizations matter People affect capacity
  15. Hitting The Books

  16. Defining Capacity planning Measuring, planning, & managing system growth Determines

    what your system needs & when From the observation of actual traffic. Use current performance as baseline. Must happen regardless of what you might optimize
  17. ARE WE RIGHT NOW? We have to be this fast

    & reliable 
 X per second & Y% Uptime MEASURE HOW/RELIABLE WE ARE HARDWARE SOFTWARE ARCHITECTURE CHANGE / ADD / REMOVE FIGURE OUT HOW TO STAY FAST/RELIABLE ENOUGH Yes! No! Allspaw's Wisdom From The Art of Capacity Planning
  18. System’s Ceiling: critical level of a resource that cannot be

    crossed without failure. Find yours Another form of Capacity Planning: Controlled load testing Predictions: ceilings + historical data Allspaw's Wisdom
  19. Allspaw's Wisdom System architecture can affect your ability to add

    capacity Identify & track your application’s metrics Tying metrics to user behavior is helpful If you don’t have ways to measure your current capacity you can’t plan
  20. Little’s Law & Capacity planning L = λW Capacity (L),

    Throughput (λ), and Latency (W) Applies to stable systems Use this information to better understand our workload and to define constraints
  21. Literature Insights Possible to have plenty of capacity and a

    slow site nonetheless Projections & curve fitting are guesses Keep track of API calls & their rate Always gonna be spikes & hiccups. Take the bad with the good & plan for it
  22. Rituals & Myths

  23. Crowdsourcing Capacity planning

  24. Crowdsourcing Capacity planning

  25. Industry Insights Hard to extrapolate general advice into something applicable

    for my situation Simplicity & ability to reason are the only things I could trust Confusing community stance on the ROI of capacity planning
  26. & Putting things in practice Findings

  27. Step One Step Two steps followed Documented system architecture &

    request lifecycle Formalized: clients, SLAs, & operational requirements Discovery Confirmed constraints & determined strategy Parallelized capacity & optimizations tasks Organized a team Gauging & Planning
  28. Edge Core APP / API APP / API LB LB

    COORDINATOR A COORDINATOR B COORDINATOR C CACHE LON CACHE DFW CACHE FRA CACHE LAX CACHE AMS CACHE SYD REQUEST flow
  29. Step Four steps followed Start process again Tons of tuning

    left to do. We know we have suboptimal configs! re-Evaluation Step Three Doubled RAM: our constrained resource Horizontally scaled to 3 servers + 1 canary Capacity expansion
  30. System Before

  31. System After

  32. System Before System After

  33. System Before System After

  34. Unexpected Challenges Our goal when adding capacity was no service

    disruption. Localhost is the goddamn devil Gap from metric/graph to insight can be huge Slowness is the nemesis of distributed system
  35. The Oprah Problem Developing operational insights into non-owned system under

    pressure is not great Use playbooks, debug.md, rotations, & rollout owners Proactivity and clarity are your best tools Everyone gets more capacity!
  36. Some Insights Anything API driven ought to carry a rate

    limit - We can easily DDOS ourselves! Monitor and alert on expensive API actions Mind your system dependencies: practice defensive system design & architecture CAPACITY PLANNING ALERTING MONITORING
  37. Some Findings Capacity tied to murky organizational structure is both

    good & bad (but mostly bad) Mind your error descriptions! Cheeky today ⇒ misleading tomorrow!
  38. Finding my system’s ceiling is still tricky Services owned by

    engineers means you need to level up on Ops skills Back to re-evaluate setup to get more out of this new capacity Performance testing ought to be done on the core’s side (& edge) My Insights
  39. TL;DR Is a process not a one time event Pushes

    you to better understand your system, its capacity & its boundaries - that is good! Proactivity is best Capacity planning Request lifecycle gets tricky System boundaries, dependencies & SLAs must be discussed Your system’s capacity may bound other systems capacity Distributed systems
  40. github.com/Randommood/ZerotoCapacityPlanning Special Thanks to: Catharine Strauss, Alan Kasindorf, Matt Whiteley,

    Caitie McCaffrey, Thom Mahoney, Mike O’Neill, Devon O’Dell, Katherine Daniels, Nathan Taylor, Bruce Spang, and Greg Bako Thank you !
  41. github.com/Randommood/ZerotoCapacityPlanning