Service Level Monitoring

Agenda Request Life Cycle The Pillars of Observability SRE's 4
golden signals Opencensus Service Mesh Hands on Actionable Alerts Next Steps

Request Life Cycle A user request propagates through CDN, DNS,
load balancers, api gateways, microservices which perform more requests, asyncronous messages, etc...

Simple Request

More Complex Request

Three Pillars of Observability Structured Logging Metrics Traces

Logging Immutable records that happened overtime

X-Request-ID Extract from HTTP Request Headers Generate one if it's
none is given Use it for all log operations in the service Correlate later

Distributed Tracing Microservices are great for turning method calls in
to distributed computing problems Show you how a request propagates throughout your application or set of services, helping you understand the bottlenecks in your architecture by visualizing how data ows between all of your services.

Jaeger Example

Contextualized Logs Correlate logs for seamlessly visualization in the UI
Hint: Use request.id as a tag and trace request e2e

The Challenges of Tracing Every component of a request needs
to be modi ed to propagate tracing More challenging at places with polyglot architecture Sampling Strategy (Constant, Probabilistic, Rate Limiting, Remote, etc.) Engineers needs to instrument in the code (White Box)

Metrics Piece of data that you would like to track,
such as latency in a service or database, metrics can help you understand the performance and overall quality of your application and set of services. Data-driven decisions win over decisions based on feelings, or the opinion of the most senior employee in the room Testing in Production

SRE's golden signals From the Google SRE book: Latency, Traf
c, Errors, Saturation Industry experience has shown that it's contain all the information you need to know what's going on and where Are critical for ops teams to monitor their systems and identify problems

Traf c Incoming Request Rate (HTTP) Incoming Network Traf c
Outgoing Request Rate (HTTP) Breakdown by path, method and code Outgoing Query Rate (Postgres) Outgoing Message Rate (RabbitMQ) Outgoing Network Traf c

Traf c Dashboard

Latency Incoming Request Latency (HTTP) Outgoing Request Latency (HTTP) Breakdown
by path, method and code Outgoing Query Latency (Postgres) Outgoing Message Latency (RabbitMQ) Outgoing Cache Latency (Redis)

Latency Dashboard

Errors API Success Rate (non-4xx|5xx response) Measurement of user satisfaction
Are users happy with 4xx? API Error Rate Percentage (5xx response) API Error Rate By Code (HTTP)

Errors Example

Saturation & Utilisation How overload something is? Memory Usage CPU
Usage Pods Count FPM Queue Size

Saturation & Utilisation Dashboard

OpenCensus Provides libraries for metrics collection and tracing Supported languages
include Go, Java, C++, Ruby, Erlang, Python, and PHP Supported backends include Datadog, Honeycomb, Jaeger, Zipkin, Stackdriver, Prometheus Instrument Traf c, Latency and Errors Originates from Google, recomended by Google SRE's

Service Meshes Integration of tracing funcionality is almost effortless Data
Planes implement tracing and stats collection at the proxy level Applications that are part of the mesh needs to forward headers to the next hop in the mesh High Adoption Rate (Github, Red Hat, Wallmart) Still in very early stages Current players are: Istio, Linkerd, Envoy, Consul, Nginmesh

Alerting Avoid Alert Fatigue Pod was restarted Error Rate >=
0.001% CronJob Failed CPU/Memory is too High Health Check Failed

From Google SRE workshop

Alert on Symsptoms HighAPIErrorRate HighAPIResponseLatency Advanced TooManyHTTP5xxErrors TooManySlowUsersCheckout TooManySlowUsersMenuQuery TooManySlowUsersSubscription
ErrorBudgetWillExhaust

Google Page Example

TooManySlowCon gurationsQuery Paging on 99th can be too noise and
unactionable Page when good requests (2xx) are having high response times

Hands on

Enable Prometheus

Enable Tracing

Incoming HTTP Traf c

Outgoing HTTP Traf c

HTTP Client

Next Steps SLO Workshop with SRE Team Choosing a good
SLI Setting Goals for Service Reliability SLO based alerting

Conclusion As Brian Knox who manages the Observability Team at
Digital Ocean said: The goal of an Observability team is not to collect logs, metrics, or traces. It is to build a culture of engineering based on facts and feedback, and then spread that culture within the broader organization. “ “

Thank you

Service Level Monitoring

Service Level Monitoring

More Decks by Rafael Jesus

Featured

Transcript