Slide 1

Slide 1 text

Observability ------ Metrics, Monitoring, Alerting Piyush Goel, Feb 2019

Slide 2

Slide 2 text

Disclaimer: Content is not original. Check References for the source.

Slide 3

Slide 3 text

When is a developers job done? 1. After dev-complete? 2. After Staging push? 3. After QA sign-off? 4. After Prod Release?

Slide 4

Slide 4 text

When is a developers job done? 1. After dev-complete? 2. After Staging push? 3. After QA sign-off? 4. After Prod Release? Answer: None of the above!

Slide 5

Slide 5 text

Release to Production is just the beginning!

Slide 6

Slide 6 text

Release to Production is just the beginning! “40% to 90% of the total costs of software are incurred after launch.” ● Facts and Fallacies of Software Engineering, Glass R (2002), Addison-Wesley, p-115 ● Which Factors affect Software Projects maintenance costs more? Acta Informatica Medica

Slide 7

Slide 7 text

Systems Will Fail!

Slide 8

Slide 8 text

Systems Will Fail … Be Prepared for it!

Slide 9

Slide 9 text

#failure points in a distributed system increase with each new component.

Slide 10

Slide 10 text

What to monitor?

Slide 11

Slide 11 text

What to monitor? Let’s take an example: Platform API’s

Slide 12

Slide 12 text

What to monitor? Let’s take an example: Platform API’s ● API Latency (95th percentile, Avg, 99th percentile) ● CPU, Load Avg ● Memory ● Swap ● JMX Heap Size ● HTTP Error Codes (200, 400, 300) ● Exceptions ● External API call Latencies ● … ● And more…….

Slide 13

Slide 13 text

What to monitor? Let’s take an example: Platform API’s ● API Latency (95th percentile, Avg, 99th percentile) ● CPU, Load Avg ● Memory ● Swap ● JMX Heap Size ● HTTP Error Codes (200, 400, 300) ● Exceptions ● External API call Latencies ● And more……. X Servers X Clusters

Slide 14

Slide 14 text

What to monitor? Let’s take an example: Platform API’s ● API Latency (95th percentile, Avg, 99th percentile) ● CPU, Load Avg ● Memory ● Swap ● JMX Heap Size ● HTTP Error Codes (200, 400, 300) ● Exceptions ● External API call Latencies ● And more……. X Servers X Clusters

Slide 15

Slide 15 text

Let’s step back and understand monitoring.

Slide 16

Slide 16 text

Monitoring ● Capturing the state of the system to determine its health. ○ HealthChecks ■ Is the service running? ■ Can I do more work? ○ Metrics ■ System ■ Application ■ Functional ● Alerts ○ Anomalous behaviors - How do you define an anomaly?

Slide 17

Slide 17 text

Monitoring ● Alerts - Known Failures ○ Knowledge Based ○ Reactive (post-outages) What about the unknown failures?

Slide 18

Slide 18 text

Observability

Slide 19

Slide 19 text

Observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

Slide 20

Slide 20 text

Observability The bygone era!!

Slide 21

Slide 21 text

Observability ● https://www.vividcortex.com/blog/monitoring-isnt-observability

Slide 22

Slide 22 text

Observability ● https://theagileadmin.com/2018/02/16/monitoring-and-observability/ Observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs.

Slide 23

Slide 23 text

Observability - Internal States ● Context Specific ○ Web Servers ■ Availability ■ Incoming Request Rate ■ Latency ■ HTTP Failures ○ Micro-Services ■ Success Rate ■ Functionalities ○ Message Queue ■ Queue Length ■ Consumer/Producer Count Needs Instrumentation! While writing code

Slide 24

Slide 24 text

Observability - External Outputs ● Health Checks ● Metrics ● Logging ● Tracing

Slide 25

Slide 25 text

Observability - Health Checks ● HealthChecks ○ Is the service running? ○ Can I do more work? ● Methods ○ Broadcast - Gossip Protocols (Cassandra) ○ Register - Service Discovery ○ Health endpoints - ELB, HAProxy, Nginx

Slide 26

Slide 26 text

Observability - Metrics ● External State at a broad scope (Time dimension) ○ System ○ Application ■ Success Rate/Failure Rate ■ Latency (internal/external) ■ Error Codes ■ Exceptions ○ Business/Functional ○ Order Rate (Regular, Cancel, Return) ○ Payments ○ Conversion Rates ○ Coupon Issual/Redemption ○ Points Issued/Redeemed

Slide 27

Slide 27 text

Observability - Metrics ● Meaningful Metrics** - Generous ● Alerts - Judicious ● Low Cardinality ○ Keep a Watch! ○ Don’t emit for users/orders. We use Logs for that! ● Provide system summary ● Questions: ■ How many transactions failed? ■ How many logins succeeded? ** https://queue.acm.org/detail.cfm?id=3309571 - Must Read

Slide 28

Slide 28 text

Observability - Metrics ● Tools ○ Graphite ○ InfluxDB ○ Prometheus ○ OpenTSDB ○ Scuba (Facebook) ○ Apache Druid

Slide 29

Slide 29 text

Observability - Logging ● Understanding at a smaller scope ○ Request, customer, transaction ● Ask Questions: ○ Why couldn’t the customer place an order? ○ Why did the transaction fail? ● Centralised - ElasticSearch, Splunk ● Searchable - Indexed ● Correlatable - Common Key (Request Id)

Slide 30

Slide 30 text

Observability - Logging

Slide 31

Slide 31 text

Observability - Logging (Anatomy)

Slide 32

Slide 32 text

Observability - Logging ● Tools ○ Splunk ○ ELK ○ SumoLogic ○ RLA

Slide 33

Slide 33 text

Observability - Tracing ● Dissect a request into sub-paths. (Spans) ● Profile system usage at a span level. ● Extract Insights Tools: ● Google Dapper (https://ai.google/research/pubs/pub36356) ● Twitter Zipkin (https://zipkin.io/) ● Open Jaeger (https://www.jaegertracing.io/) ● New Relic

Slide 34

Slide 34 text

Observability Spectrum

Slide 35

Slide 35 text

We have Metrics. What to alert on?

Slide 36

Slide 36 text

Service Level Objectives (SLO)

Slide 37

Slide 37 text

Service Level Objectives (SLO) * https://landing.google.com/sre/sre-book/chapters/service-level-objectives/ * https://www.youtube.com/watch?v=tEylFyxbDLE ● Defines a Quantifiable Goal for a service. ● Measure the goal - Represents the User Experience/Delight Factor. ● First step before writing a new service. Work backwards ● Have as few SLO’s as possible. ○ Represents the system behaviour.

Slide 38

Slide 38 text

Service Level Objectives (SLO) - Exercise ● Cart Service ● Payments Service ● Card Generation ● Order Management Service ● Communication Engine

Slide 39

Slide 39 text

Happy Observing!

Slide 40

Slide 40 text

References ● Debugging Production Systems : https://www.youtube.com/watch?v=YlrAakN90D0 ● Pierre Vincent - How to build observable Distributed systems? https://www.youtube.com/watch?v=ACL_YVPD3gw ● Charity Majors - Observability for Emerging Infra: What Got You Here Won't Get You There" https://www.youtube.com/watch?v=1wjovFSCGhE ● Caitee McAfree - Of the Order of Billions: Building Observability at Twitter https://www.youtube.com/watch?v=SC6XuD1tgcQ ● https://eng.uber.com/observability-at-scale/