Slide 1

Slide 1 text

RPC Metrics at Google JBD, Google (@rakyll)

Slide 2

Slide 2 text

gRPC Metrics at Google JBD, Google (@rakyll)

Slide 3

Slide 3 text

Request Metrics at Google JBD, Google (@rakyll)

Slide 4

Slide 4 text

@rakyll "100% is the wrong reliability target for basically everything." -- Benjamin Treynor Sloss, VP of Engineering, Google

Slide 5

Slide 5 text

@rakyll "A service is available if users cannot tell that there was an outage."

Slide 6

Slide 6 text

@rakyll Principled way of saying what level of downtime is acceptable. ● Error rate ● Latency expectations SLOs

Slide 7

Slide 7 text

@rakyll Analytics frontend server Authentication Reporting Users ... Spanner Blob Store

Slide 8

Slide 8 text

@rakyll Questions infra teams want to ask: ● Are we meeting the SLO for the other team? ● What’s the impact of a product on infra? ● How much do we need to scale up if product grows 10%?

Slide 9

Slide 9 text

@rakyll High-Cardinality Breaking down the metrics data...

Slide 10

Slide 10 text

@rakyll Query the collected data in various ways: ● Latency distribution for RPCs originated at Google Analytics. ● Requests take took more than 100ms for the customer #123. ● Compare the request latency initiated at web vs mobile frontend.

Slide 11

Slide 11 text

@rakyll Analytics frontend server Authentication Reporting Users ... Spanner Blob Store originator=analytics; ...

Slide 12

Slide 12 text

@rakyll Blob store read errors by originator

Slide 13

Slide 13 text

@rakyll Dynamically choose aggregation (split between recording and aggregation)

Slide 14

Slide 14 text

@rakyll Exemplars

Slide 15

Slide 15 text

@rakyll /rpz and /statz

Slide 16

Slide 16 text

@rakyll http://server:7777/debug/rpcz

Slide 17

Slide 17 text

@rakyll Export? Monarch, Prometheus, and more.

Slide 18

Slide 18 text

@rakyll import “cloud.google.com/go/pubsub”

Slide 19

Slide 19 text

@rakyll +

Slide 20

Slide 20 text

Thank you! JBD, Google [email protected] @rakyll