Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Servers are doomed to fail

JBD
May 17, 2019

Servers are doomed to fail

JBD

May 17, 2019
Tweet

More Decks by JBD

Other Decks in Technology

Transcript

  1. Servers are
    doomed to fail
    Jaana B. Dogan
    [email protected]
    @rakyll

    View Slide

  2. Serverless is also
    doomed to fail
    Jaana B. Dogan
    [email protected]
    @rakyll

    View Slide

  3. Systems are
    doomed to fail
    Jaana B. Dogan
    [email protected]
    @rakyll

    View Slide

  4. Is failure OK?
    Is failure an
    unexpected case?

    View Slide

  5. Failure is not an exception.
    Systems change all
    the time.

    View Slide

  6. “I haven’t touched the code
    for a century, it should just
    work.”
    Said no one ever.

    View Slide

  7. Failure is expected.
    Yes, it is.

    View Slide

  8. View Slide

  9. @rakyll
    monitoring
    debugging
    postmortem

    View Slide

  10. Monitoring is about saying if
    something is broken.

    View Slide

  11. “99.99% of the requests
    should return in 100ms.”

    View Slide

  12. @rakyll

    View Slide

  13. @rakyll

    View Slide

  14. Debugging

    View Slide

  15. Debugging is
    collaborative.

    View Slide

  16. Debugging comes in flavors.
    Logs Traces Metrics
    ...

    View Slide

  17. Postmortems

    View Slide

  18. Postmortems

    View Slide

  19. Postmortems

    View Slide

  20. Blameless?
    Focus on identifying
    problems.

    View Slide

  21. Collaboration
    Design for
    collaboration.

    View Slide

  22. Design
    for failure
    Set SLOs, plan for
    instrumentation, plan
    for debugging.

    View Slide

  23. Cross-stack
    debugging
    Accountability
    across stack with high
    cardinality data. speakerdeck.com/rakyll/rpc-metrics-at-google

    View Slide

  24. Correlation
    Jump from
    monitoring/debugging
    data to data.

    View Slide

  25. On-call
    debugging
    Jump from distributed
    tracing data to on-call
    information.
    who to page?

    View Slide

  26. Dynamic
    collection
    Capability to enable
    more collection in
    production when
    needed.

    View Slide

  27. Continuous
    collection
    Continuously collect
    signals, generate
    fleet-wide analysis
    reports.

    View Slide

  28. Introspection
    Introspection pages
    provided from the
    services.

    View Slide

  29. @rakyll
    monitoring
    debugging
    postmortem

    View Slide

  30. Thank you
    Jaana B. Dogan
    Google
    [email protected]

    View Slide