Service Level Monitoring

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Agenda Request Life Cycle The Pillars of Observability SRE's 4 golden signals Opencensus Service Mesh Hands on Actionable Alerts Next Steps

Slide 3

Slide 3 text

Request Life Cycle A user request propagates through CDN, DNS, load balancers, api gateways, microservices which perform more requests, asyncronous messages, etc...

Slide 4

Slide 4 text

Simple Request

Slide 5

Slide 5 text

More Complex Request

Slide 6

Slide 6 text

Three Pillars of Observability Structured Logging Metrics Traces

Slide 7

Slide 7 text

Logging Immutable records that happened overtime

Slide 8

Slide 8 text

X-Request-ID Extract from HTTP Request Headers Generate one if it's none is given Use it for all log operations in the service Correlate later

Slide 9

Slide 9 text

Distributed Tracing Microservices are great for turning method calls in to distributed computing problems Show you how a request propagates throughout your application or set of services, helping you understand the bottlenecks in your architecture by visualizing how data ows between all of your services.

Slide 10

Slide 10 text

Jaeger Example

Slide 11

Slide 11 text

Contextualized Logs Correlate logs for seamlessly visualization in the UI Hint: Use request.id as a tag and trace request e2e

Slide 12

Slide 12 text

The Challenges of Tracing Every component of a request needs to be modi ed to propagate tracing More challenging at places with polyglot architecture Sampling Strategy (Constant, Probabilistic, Rate Limiting, Remote, etc.) Engineers needs to instrument in the code (White Box)

Slide 13

Slide 13 text

Metrics Piece of data that you would like to track, such as latency in a service or database, metrics can help you understand the performance and overall quality of your application and set of services. Data-driven decisions win over decisions based on feelings, or the opinion of the most senior employee in the room Testing in Production

Slide 14

Slide 14 text

SRE's golden signals From the Google SRE book: Latency, Traf c, Errors, Saturation Industry experience has shown that it's contain all the information you need to know what's going on and where Are critical for ops teams to monitor their systems and identify problems

Slide 15

Slide 15 text

Traf c Incoming Request Rate (HTTP) Incoming Network Traf c Outgoing Request Rate (HTTP) Breakdown by path, method and code Outgoing Query Rate (Postgres) Outgoing Message Rate (RabbitMQ) Outgoing Network Traf c

Slide 16

Slide 16 text

Traf c Dashboard

Slide 17

Slide 17 text

Latency Incoming Request Latency (HTTP) Outgoing Request Latency (HTTP) Breakdown by path, method and code Outgoing Query Latency (Postgres) Outgoing Message Latency (RabbitMQ) Outgoing Cache Latency (Redis)

Slide 18

Slide 18 text

Latency Dashboard

Slide 19

Slide 19 text

Errors API Success Rate (non-4xx|5xx response) Measurement of user satisfaction Are users happy with 4xx? API Error Rate Percentage (5xx response) API Error Rate By Code (HTTP)

Slide 20

Slide 20 text

Errors Example

Slide 21

Slide 21 text

Saturation & Utilisation How overload something is? Memory Usage CPU Usage Pods Count FPM Queue Size

Slide 22

Slide 22 text

Saturation & Utilisation Dashboard

Slide 23

Slide 23 text

OpenCensus Provides libraries for metrics collection and tracing Supported languages include Go, Java, C++, Ruby, Erlang, Python, and PHP Supported backends include Datadog, Honeycomb, Jaeger, Zipkin, Stackdriver, Prometheus Instrument Traf c, Latency and Errors Originates from Google, recomended by Google SRE's

Slide 24

Slide 24 text

ocsql

Slide 25

Slide 25 text

Service Meshes Integration of tracing funcionality is almost effortless Data Planes implement tracing and stats collection at the proxy level Applications that are part of the mesh needs to forward headers to the next hop in the mesh High Adoption Rate (Github, Red Hat, Wallmart) Still in very early stages Current players are: Istio, Linkerd, Envoy, Consul, Nginmesh

Slide 26

Slide 26 text

Alerting Avoid Alert Fatigue Pod was restarted Error Rate >= 0.001% CronJob Failed CPU/Memory is too High Health Check Failed

Slide 27

Slide 27 text

From Google SRE workshop

Slide 28

Slide 28 text

Alert on Symsptoms HighAPIErrorRate HighAPIResponseLatency Advanced TooManyHTTP5xxErrors TooManySlowUsersCheckout TooManySlowUsersMenuQuery TooManySlowUsersSubscription ErrorBudgetWillExhaust

Slide 29

Slide 29 text

Google Page Example

Slide 30

Slide 30 text

TooManySlowCon gurationsQuery Paging on 99th can be too noise and unactionable Page when good requests (2xx) are having high response times

Slide 31

Slide 31 text

Hands on

Slide 32

Slide 32 text

Enable Prometheus

Slide 33

Slide 33 text

Enable Tracing

Slide 34

Slide 34 text

Incoming HTTP Traf c

Slide 35

Slide 35 text

Outgoing HTTP Traf c

Slide 36

Slide 36 text

HTTP Client

Slide 37

Slide 37 text

Next Steps SLO Workshop with SRE Team Choosing a good SLI Setting Goals for Service Reliability SLO based alerting

Slide 38

Slide 38 text

Conclusion As Brian Knox who manages the Observability Team at Digital Ocean said: The goal of an Observability team is not to collect logs, metrics, or traces. It is to build a culture of engineering based on facts and feedback, and then spread that culture within the broader organization. “ “

Slide 39

Slide 39 text

Thank you