Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Service Level Monitoring

Rafael Jesus
October 18, 2018
100

Service Level Monitoring

Service Level Monitoring

Rafael Jesus

October 18, 2018
Tweet

Transcript

  1. Agenda Request Life Cycle The Pillars of Observability SRE's 4

    golden signals Opencensus Service Mesh Hands on Actionable Alerts Next Steps
  2. Request Life Cycle A user request propagates through CDN, DNS,

    load balancers, api gateways, microservices which perform more requests, asyncronous messages, etc...
  3. X-Request-ID Extract from HTTP Request Headers Generate one if it's

    none is given Use it for all log operations in the service Correlate later
  4. Distributed Tracing Microservices are great for turning method calls in

    to distributed computing problems Show you how a request propagates throughout your application or set of services, helping you understand the bottlenecks in your architecture by visualizing how data ows between all of your services.
  5. Contextualized Logs Correlate logs for seamlessly visualization in the UI

    Hint: Use request.id as a tag and trace request e2e
  6. The Challenges of Tracing Every component of a request needs

    to be modi ed to propagate tracing More challenging at places with polyglot architecture Sampling Strategy (Constant, Probabilistic, Rate Limiting, Remote, etc.) Engineers needs to instrument in the code (White Box)
  7. Metrics Piece of data that you would like to track,

    such as latency in a service or database, metrics can help you understand the performance and overall quality of your application and set of services. Data-driven decisions win over decisions based on feelings, or the opinion of the most senior employee in the room Testing in Production
  8. SRE's golden signals From the Google SRE book: Latency, Traf

    c, Errors, Saturation Industry experience has shown that it's contain all the information you need to know what's going on and where Are critical for ops teams to monitor their systems and identify problems
  9. Traf c Incoming Request Rate (HTTP) Incoming Network Traf c

    Outgoing Request Rate (HTTP) Breakdown by path, method and code Outgoing Query Rate (Postgres) Outgoing Message Rate (RabbitMQ) Outgoing Network Traf c
  10. Latency Incoming Request Latency (HTTP) Outgoing Request Latency (HTTP) Breakdown

    by path, method and code Outgoing Query Latency (Postgres) Outgoing Message Latency (RabbitMQ) Outgoing Cache Latency (Redis)
  11. Errors API Success Rate (non-4xx|5xx response) Measurement of user satisfaction

    Are users happy with 4xx? API Error Rate Percentage (5xx response) API Error Rate By Code (HTTP)
  12. OpenCensus Provides libraries for metrics collection and tracing Supported languages

    include Go, Java, C++, Ruby, Erlang, Python, and PHP Supported backends include Datadog, Honeycomb, Jaeger, Zipkin, Stackdriver, Prometheus Instrument Traf c, Latency and Errors Originates from Google, recomended by Google SRE's
  13. Service Meshes Integration of tracing funcionality is almost effortless Data

    Planes implement tracing and stats collection at the proxy level Applications that are part of the mesh needs to forward headers to the next hop in the mesh High Adoption Rate (Github, Red Hat, Wallmart) Still in very early stages Current players are: Istio, Linkerd, Envoy, Consul, Nginmesh
  14. Alerting Avoid Alert Fatigue Pod was restarted Error Rate >=

    0.001% CronJob Failed CPU/Memory is too High Health Check Failed
  15. TooManySlowCon gurationsQuery Paging on 99th can be too noise and

    unactionable Page when good requests (2xx) are having high response times
  16. Next Steps SLO Workshop with SRE Team Choosing a good

    SLI Setting Goals for Service Reliability SLO based alerting
  17. Conclusion As Brian Knox who manages the Observability Team at

    Digital Ocean said: The goal of an Observability team is not to collect logs, metrics, or traces. It is to build a culture of engineering based on facts and feedback, and then spread that culture within the broader organization. “ “