Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Servers are doomed to fail

E7526ec3e801f8ba99f6746498a154a6?s=47 JBD
May 17, 2019

Servers are doomed to fail

E7526ec3e801f8ba99f6746498a154a6?s=128

JBD

May 17, 2019
Tweet

Transcript

  1. Servers are doomed to fail Jaana B. Dogan jbd@google.com @rakyll

  2. Serverless is also doomed to fail Jaana B. Dogan jbd@google.com

    @rakyll
  3. Systems are doomed to fail Jaana B. Dogan jbd@google.com @rakyll

  4. Is failure OK? Is failure an unexpected case?

  5. Failure is not an exception. Systems change all the time.

  6. “I haven’t touched the code for a century, it should

    just work.” Said no one ever.
  7. Failure is expected. Yes, it is.

  8. None
  9. @rakyll monitoring debugging postmortem

  10. Monitoring is about saying if something is broken.

  11. “99.99% of the requests should return in 100ms.”

  12. @rakyll

  13. @rakyll

  14. Debugging

  15. Debugging is collaborative.

  16. Debugging comes in flavors. Logs Traces Metrics ...

  17. Postmortems

  18. Postmortems

  19. Postmortems

  20. Blameless? Focus on identifying problems.

  21. Collaboration Design for collaboration.

  22. Design for failure Set SLOs, plan for instrumentation, plan for

    debugging.
  23. Cross-stack debugging Accountability across stack with high cardinality data. speakerdeck.com/rakyll/rpc-metrics-at-google

  24. Correlation Jump from monitoring/debugging data to data.

  25. On-call debugging Jump from distributed tracing data to on-call information.

    who to page?
  26. Dynamic collection Capability to enable more collection in production when

    needed.
  27. Continuous collection Continuously collect signals, generate fleet-wide analysis reports.

  28. Introspection Introspection pages provided from the services.

  29. @rakyll monitoring debugging postmortem

  30. Thank you Jaana B. Dogan Google jbd@google.com