Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Servers are doomed to fail

JBD
May 17, 2019

Servers are doomed to fail

JBD

May 17, 2019
Tweet

More Decks by JBD

Other Decks in Technology

Transcript

  1. Servers are
    doomed to fail
    Jaana B. Dogan
    [email protected]
    @rakyll

    View full-size slide

  2. Serverless is also
    doomed to fail
    Jaana B. Dogan
    [email protected]
    @rakyll

    View full-size slide

  3. Systems are
    doomed to fail
    Jaana B. Dogan
    [email protected]
    @rakyll

    View full-size slide

  4. Is failure OK?
    Is failure an
    unexpected case?

    View full-size slide

  5. Failure is not an exception.
    Systems change all
    the time.

    View full-size slide

  6. “I haven’t touched the code
    for a century, it should just
    work.”
    Said no one ever.

    View full-size slide

  7. Failure is expected.
    Yes, it is.

    View full-size slide

  8. @rakyll
    monitoring
    debugging
    postmortem

    View full-size slide

  9. Monitoring is about saying if
    something is broken.

    View full-size slide

  10. “99.99% of the requests
    should return in 100ms.”

    View full-size slide

  11. Debugging is
    collaborative.

    View full-size slide

  12. Debugging comes in flavors.
    Logs Traces Metrics
    ...

    View full-size slide

  13. Blameless?
    Focus on identifying
    problems.

    View full-size slide

  14. Collaboration
    Design for
    collaboration.

    View full-size slide

  15. Design
    for failure
    Set SLOs, plan for
    instrumentation, plan
    for debugging.

    View full-size slide

  16. Cross-stack
    debugging
    Accountability
    across stack with high
    cardinality data. speakerdeck.com/rakyll/rpc-metrics-at-google

    View full-size slide

  17. Correlation
    Jump from
    monitoring/debugging
    data to data.

    View full-size slide

  18. On-call
    debugging
    Jump from distributed
    tracing data to on-call
    information.
    who to page?

    View full-size slide

  19. Dynamic
    collection
    Capability to enable
    more collection in
    production when
    needed.

    View full-size slide

  20. Continuous
    collection
    Continuously collect
    signals, generate
    fleet-wide analysis
    reports.

    View full-size slide

  21. Introspection
    Introspection pages
    provided from the
    services.

    View full-size slide

  22. @rakyll
    monitoring
    debugging
    postmortem

    View full-size slide

  23. Thank you
    Jaana B. Dogan
    Google
    [email protected]

    View full-size slide