Servers aredoomed to failJaana B. Dogan[email protected]@rakyll
View Slide
Serverless is alsodoomed to failJaana B. Dogan[email protected]@rakyll
Systems aredoomed to failJaana B. Dogan[email protected]@rakyll
Is failure OK?Is failure anunexpected case?
Failure is not an exception.Systems change allthe time.
“I haven’t touched the codefor a century, it should justwork.”Said no one ever.
Failure is expected.Yes, it is.
@rakyllmonitoringdebuggingpostmortem
Monitoring is about saying ifsomething is broken.
“99.99% of the requestsshould return in 100ms.”
@rakyll
Debugging
Debugging iscollaborative.
Debugging comes in flavors.Logs Traces Metrics...
Postmortems
Blameless?Focus on identifyingproblems.
CollaborationDesign forcollaboration.
Designfor failureSet SLOs, plan forinstrumentation, planfor debugging.
Cross-stackdebuggingAccountabilityacross stack with highcardinality data. speakerdeck.com/rakyll/rpc-metrics-at-google
CorrelationJump frommonitoring/debuggingdata to data.
On-calldebuggingJump from distributedtracing data to on-callinformation.who to page?
DynamiccollectionCapability to enablemore collection inproduction whenneeded.
ContinuouscollectionContinuously collectsignals, generatefleet-wide analysisreports.
IntrospectionIntrospection pagesprovided from theservices.
Thank youJaana B. DoganGoogle[email protected]