@rakyll
"100% is the wrong
reliability target for
basically everything."
-- Benjamin Treynor Sloss, VP of Engineering, Google
Slide 5
Slide 5 text
@rakyll
"A service is available
if users cannot tell that
there was an outage."
Slide 6
Slide 6 text
@rakyll
Principled way of saying what level of downtime is acceptable.
● Error rate
● Latency expectations
SLOs
Slide 7
Slide 7 text
@rakyll
Analytics frontend server
Authentication Reporting Users ...
Spanner
Blob Store
Slide 8
Slide 8 text
@rakyll
Questions infra teams want to ask:
● Are we meeting the SLO for the other team?
● What’s the impact of a product on infra?
● How much do we need to scale up if product grows 10%?
Slide 9
Slide 9 text
@rakyll
High-Cardinality
Breaking down the metrics data...
Slide 10
Slide 10 text
@rakyll
Query the collected data in various ways:
● Latency distribution for RPCs originated at Google Analytics.
● Requests take took more than 100ms for the customer #123.
● Compare the request latency initiated at web vs mobile frontend.
Slide 11
Slide 11 text
@rakyll
Analytics frontend server
Authentication Reporting Users ...
Spanner
Blob Store
originator=analytics;
...
Slide 12
Slide 12 text
@rakyll
Blob store read errors by originator
Slide 13
Slide 13 text
@rakyll
Dynamically choose
aggregation
(split between recording and aggregation)