RPC Metrics at Google

E7526ec3e801f8ba99f6746498a154a6?s=47 JBD
August 09, 2018

RPC Metrics at Google

E7526ec3e801f8ba99f6746498a154a6?s=128

JBD

August 09, 2018
Tweet

Transcript

  1. RPC Metrics at Google JBD, Google (@rakyll)

  2. gRPC Metrics at Google JBD, Google (@rakyll)

  3. Request Metrics at Google JBD, Google (@rakyll)

  4. @rakyll "100% is the wrong reliability target for basically everything."

    -- Benjamin Treynor Sloss, VP of Engineering, Google
  5. @rakyll "A service is available if users cannot tell that

    there was an outage."
  6. @rakyll Principled way of saying what level of downtime is

    acceptable. • Error rate • Latency expectations SLOs
  7. @rakyll Analytics frontend server Authentication Reporting Users ... Spanner Blob

    Store
  8. @rakyll Questions infra teams want to ask: • Are we

    meeting the SLO for the other team? • What’s the impact of a product on infra? • How much do we need to scale up if product grows 10%?
  9. @rakyll High-Cardinality Breaking down the metrics data...

  10. @rakyll Query the collected data in various ways: • Latency

    distribution for RPCs originated at Google Analytics. • Requests take took more than 100ms for the customer #123. • Compare the request latency initiated at web vs mobile frontend.
  11. @rakyll Analytics frontend server Authentication Reporting Users ... Spanner Blob

    Store originator=analytics; ...
  12. @rakyll Blob store read errors by originator

  13. @rakyll Dynamically choose aggregation (split between recording and aggregation)

  14. @rakyll Exemplars

  15. @rakyll /rpz and /statz

  16. @rakyll http://server:7777/debug/rpcz

  17. @rakyll Export? Monarch, Prometheus, and more.

  18. @rakyll import “cloud.google.com/go/pubsub”

  19. @rakyll +

  20. Thank you! JBD, Google jbd@google.com @rakyll