Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zero to Capacity Planning

Zero to Capacity Planning

Ines Sombra

June 15, 2015
Tweet

More Decks by Ines Sombra

Other Decks in Technology

Transcript

  1. From Zero
    To Capacity Planning

    View Slide

  2. @Randommood
    INES 

    Sombra

    View Slide

  3. Globally distributed and Highly available

    View Slide

  4. Why capacity
    planning?
    Or a journey of discovery and ingenuity

    View Slide

  5. The views reflected in this talk
    are not to be considered a
    reflection of the skills of my
    coworkers who are extremely
    nice human beings and way
    better at capacity planning
    than I am.

    NOT A monitoring
    person



    View Slide

  6. INSTRUMENT
    MONITOR &
    ALERT
    PLAN
    &
    PREDICT
    The Road to Capacity planning
    ?

    View Slide

  7. Findings
    Books
    0
    Day One
    Some Learning
    Our Discoveries
    Rituals
    & Myths
    Asking Around
    Bringing it Home
    our Path today
    Checking The
    Edge

    View Slide

  8. zero
    … Oh shit!

    View Slide

  9. a convenient ”situation”
    Handles State
    Many Clients
    Other systems depend on this service to be: up, healthy, and available!
    A bit F*cked

    View Slide

  10. Our 

    World
    Edge Core
    ✨ ✨

    View Slide

  11. a Fastly POP

    View Slide

  12. I Rule the
    Edge!
    Evaluates weekly global
    POPs performance &
    makes projections
    Publishes capacity
    performance report in
    clear location
    Plans for our physical
    capacity & transit
    capacity
    Meet Catharine

    View Slide

  13. Planning Our Capacity
    Some metrics
    - Network Capacity (Gb) 

    - Ordered Network Capability (Gb) 

    - Planned Network Capacity (Gb)

    - RPS Capacity (k) 

    - Network peak (Gb) 

    - RPS peak (k) 

    - Site CPU Peak (%) 

    - Network Utilization (%)
    Over 30%: flagged, Over 70%:
    Red status

    View Slide

  14. Edge Insights
    Our ability to correctly plan for
    capacity is critical to our
    bottom line
    Capacity doesn’t just involve
    hardware; software
    optimizations matter
    People affect capacity

    View Slide

  15. Hitting
    The
    Books

    View Slide

  16. Defining Capacity planning
    Measuring, planning, & managing system growth
    Determines what your system needs & when
    From the observation of actual traffic. Use current
    performance as baseline.
    Must happen regardless of what you might
    optimize

    View Slide

  17. ARE
    WE RIGHT
    NOW?
    We have to be
    this fast & reliable 

    X per second & Y%
    Uptime
    MEASURE HOW/RELIABLE WE ARE
    HARDWARE
    SOFTWARE
    ARCHITECTURE
    CHANGE / ADD / REMOVE
    FIGURE OUT
    HOW TO STAY
    FAST/RELIABLE
    ENOUGH
    Yes!
    No!
    Allspaw's Wisdom
    From The Art of Capacity Planning

    View Slide

  18. System’s Ceiling: critical level of a
    resource that cannot be crossed
    without failure. Find yours
    Another form of Capacity Planning:
    Controlled load testing
    Predictions: ceilings + historical data
    Allspaw's Wisdom

    View Slide

  19. Allspaw's Wisdom
    System architecture can affect your
    ability to add capacity
    Identify & track your application’s
    metrics
    Tying metrics to user behavior is helpful
    If you don’t have ways to measure
    your current capacity you can’t plan

    View Slide

  20. Little’s Law & Capacity planning
    L = λW
    Capacity (L), Throughput (λ),
    and Latency (W)
    Applies to stable systems
    Use this information to better
    understand our workload and to
    define constraints

    View Slide

  21. Literature Insights
    Possible to have plenty of capacity and
    a slow site nonetheless
    Projections & curve fitting are guesses
    Keep track of API calls & their rate
    Always gonna be spikes & hiccups.
    Take the bad with the good & plan for it

    View Slide

  22. Rituals
    &
    Myths

    View Slide

  23. Crowdsourcing Capacity planning

    View Slide

  24. Crowdsourcing Capacity planning

    View Slide

  25. Industry Insights
    Hard to extrapolate general
    advice into something
    applicable for my situation
    Simplicity & ability to reason are
    the only things I could trust
    Confusing community stance on
    the ROI of capacity planning

    View Slide

  26. & Putting things in practice
    Findings

    View Slide

  27. Step One Step Two
    steps followed
    Documented system
    architecture &
    request lifecycle
    Formalized: clients,
    SLAs, & operational
    requirements
    Discovery
    Confirmed constraints
    & determined strategy
    Parallelized capacity
    & optimizations tasks
    Organized a team
    Gauging & Planning

    View Slide

  28. Edge
    Core
    APP / API APP / API
    LB LB
    COORDINATOR A COORDINATOR B COORDINATOR C

    CACHE
    LON
    CACHE
    DFW
    CACHE
    FRA
    CACHE
    LAX
    CACHE
    AMS
    CACHE
    SYD
    REQUEST flow


    View Slide

  29. Step Four
    steps followed
    Start process again
    Tons of tuning left to
    do. We know we
    have suboptimal
    configs!
    re-Evaluation
    Step Three
    Doubled RAM: our
    constrained resource
    Horizontally scaled to 3
    servers + 1 canary
    Capacity expansion

    View Slide

  30. System Before

    View Slide

  31. System After

    View Slide

  32. System Before System After

    View Slide

  33. System Before System After

    View Slide

  34. Unexpected Challenges
    Our goal when adding capacity
    was no service disruption.
    Localhost is the goddamn devil
    Gap from metric/graph to
    insight can be huge
    Slowness is the nemesis of
    distributed system

    View Slide

  35. The Oprah Problem
    Developing operational
    insights into non-owned
    system under pressure is
    not great
    Use playbooks,
    debug.md, rotations, &
    rollout owners
    Proactivity and clarity
    are your best tools
    Everyone
    gets more
    capacity!

    View Slide

  36. Some Insights
    Anything API driven ought to
    carry a rate limit - We can
    easily DDOS ourselves!
    Monitor and alert on
    expensive API actions
    Mind your system
    dependencies: practice
    defensive system design &
    architecture
    CAPACITY
    PLANNING
    ALERTING
    MONITORING

    View Slide

  37. Some Findings
    Capacity tied to murky
    organizational structure
    is both good & bad
    (but mostly bad)
    Mind your error
    descriptions! Cheeky
    today ⇒ misleading
    tomorrow!

    View Slide

  38. Finding my system’s ceiling is still tricky
    Services owned by engineers means
    you need to level up on Ops skills
    Back to re-evaluate setup to get more
    out of this new capacity
    Performance testing ought to be done
    on the core’s side (& edge)
    My Insights

    View Slide

  39. TL;DR
    Is a process not a one
    time event
    Pushes you to better
    understand your
    system, its capacity &
    its boundaries - that is
    good!
    Proactivity is best
    Capacity planning
    Request lifecycle gets
    tricky
    System boundaries,
    dependencies & SLAs
    must be discussed
    Your system’s capacity
    may bound other
    systems capacity
    Distributed systems

    View Slide

  40. github.com/Randommood/ZerotoCapacityPlanning
    Special Thanks to: Catharine Strauss,
    Alan Kasindorf, Matt Whiteley,
    Caitie McCaffrey, Thom Mahoney,
    Mike O’Neill, Devon O’Dell,
    Katherine Daniels, Nathan Taylor,
    Bruce Spang, and Greg Bako
    Thank you !

    View Slide

  41. github.com/Randommood/ZerotoCapacityPlanning

    View Slide