Slide 1

Slide 1 text

Servers are doomed to fail Jaana B. Dogan jbd@google.com @rakyll

Slide 2

Slide 2 text

Serverless is also doomed to fail Jaana B. Dogan jbd@google.com @rakyll

Slide 3

Slide 3 text

Systems are doomed to fail Jaana B. Dogan jbd@google.com @rakyll

Slide 4

Slide 4 text

Is failure OK? Is failure an unexpected case?

Slide 5

Slide 5 text

Failure is not an exception. Systems change all the time.

Slide 6

Slide 6 text

“I haven’t touched the code for a century, it should just work.” Said no one ever.

Slide 7

Slide 7 text

Failure is expected. Yes, it is.

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

@rakyll monitoring debugging postmortem

Slide 10

Slide 10 text

Monitoring is about saying if something is broken.

Slide 11

Slide 11 text

“99.99% of the requests should return in 100ms.”

Slide 12

Slide 12 text

@rakyll

Slide 13

Slide 13 text

@rakyll

Slide 14

Slide 14 text

Debugging

Slide 15

Slide 15 text

Debugging is collaborative.

Slide 16

Slide 16 text

Debugging comes in flavors. Logs Traces Metrics ...

Slide 17

Slide 17 text

Postmortems

Slide 18

Slide 18 text

Postmortems

Slide 19

Slide 19 text

Postmortems

Slide 20

Slide 20 text

Blameless? Focus on identifying problems.

Slide 21

Slide 21 text

Collaboration Design for collaboration.

Slide 22

Slide 22 text

Design for failure Set SLOs, plan for instrumentation, plan for debugging.

Slide 23

Slide 23 text

Cross-stack debugging Accountability across stack with high cardinality data. speakerdeck.com/rakyll/rpc-metrics-at-google

Slide 24

Slide 24 text

Correlation Jump from monitoring/debugging data to data.

Slide 25

Slide 25 text

On-call debugging Jump from distributed tracing data to on-call information. who to page?

Slide 26

Slide 26 text

Dynamic collection Capability to enable more collection in production when needed.

Slide 27

Slide 27 text

Continuous collection Continuously collect signals, generate fleet-wide analysis reports.

Slide 28

Slide 28 text

Introspection Introspection pages provided from the services.

Slide 29

Slide 29 text

@rakyll monitoring debugging postmortem

Slide 30

Slide 30 text

Thank you Jaana B. Dogan Google jbd@google.com