Monitoring Microservices with Minimal Effort (GDG DevFest Ankara 2018)

Slide 1

Slide 1 text

November 17, 2018 Monitoring microservices on Kubernetes with minimal effort Ahmet Alp Balkan (@ahmetb)

Slide 2

Slide 2 text

● Software Engineer at Google Cloud ● Worked at Microsoft Azure (2012-2016) on porting Docker to Windows & Linux stuff. ● Kubernetes/GKE, Knative developer experience ● Twitter/GitHub: @ahmetb About me

Slide 3

Slide 3 text

Why monitor applications?

Slide 4

Slide 4 text

You don't know if it works if you're not monitoring.

Slide 5

Slide 5 text

● SLO (service-level objective): agreed way to measure performance of a service, between two parties (often internal) ● SLA (service-level agreement): SLO, but has a legal contract. ● Error budget: how much can you afford to violate SLA/SLO? Service availability contracts Team A Team B's service Ucuzabilet Turkish Airlines You Google Cloud Storage API

Slide 6

Slide 6 text

Error Budgets ● Team X owns a critical service at Google, used by other teams.

Slide 7

Slide 7 text

Error Budgets ● Team X owns a critical service at Google, used by other teams. ● ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month)

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Error Budgets ● Team X owns a critical service at Google, used by other teams. ● ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month) ● ServiceA went down for 30 minutes today. ● TeamX can't ship new features to ServiceA for 6 months, until they have more error budget (or they will risk violating their SLO/SLA).

Slide 10

Slide 10 text

What to monitor?

Slide 11

Slide 11 text

Asking the right questions Is my website working?

Slide 12

Slide 12 text

Asking the right questions Is my home page working?

Slide 13

Slide 13 text

Asking the right questions Is my home page responding successfully?

Slide 14

Slide 14 text

Asking the right questions Is my home page responding successfully to the 99.5% of the requests?

Slide 15

Slide 15 text

Asking the right questions Is my home page responding successfully to the 99.5% of the requests within 100 milliseconds?

Slide 16

Slide 16 text

Asking the right questions Is my home page responding successfully to the 99.5% of the requests within 100 milliseconds from the servers in us-east1?

Slide 17

Slide 17 text

Ways to monitor services

Slide 18

Slide 18 text

Microservices

Slide 19

Slide 19 text

At Google, everything is a service ● Google Cloud Storage frontend → service ● Google Fonts API → service ● Google search index backend → internal service ● Human resources database → internal service ● Cafeteria menus → internal service ~O(1010) requests per second in Google’s private network. Mostly gRPC/Protobuf-style networking (not HTTP REST/JSON APIs)

Slide 20

Slide 20 text

microservices ● Develop independently YES NO ● Scale independently YES NO ● Fail independently YES NO ● Number of things to monitor MANY ONE monoliths

Slide 21

Slide 21 text

At Google, we don't write the BEST CODE, but we have world systems OBSERVABILITY.

Slide 22

Slide 22 text

Demo

Slide 23

Slide 23 text

Pick your own adventure Open source

Slide 24

Slide 24 text

Demo: Hipster Shop github.com/GoogleCloudPlatform/microservices-demo

Slide 25

Slide 25 text

Metrics

Slide 26

Slide 26 text

Time-series Metrics Measurement of a value over time: ● Gauges: current value of an indicator (example: current memory usage MB) ● Counters: only-increasing values (example: request count) Examples: ● 99th percentile latency of POST requests to /login over the past 5 minutes ● success rate of GET requests over past day ● average memory usage in the past 30 minutes ● "number of orders completed" in the past hour

Slide 27

Slide 27 text

Time-series Metrics "orders_completed" counter example: server 1 server 2 GET /_metrics: orders_completed[server=1] 3 GET /_metrics: orders_completed[server=2] 12 metrics collector

Slide 28

Slide 28 text

Anatomy of a metrics page ● name ● labels ● value orders_created[server=A, region=us-central1, version=14] 563 http_requests[status=200, method=GET, path='/', server=A, region=us-central1, version=14] 156183 http_requests[status=200, method=GET, path='/login', server=A, region=us-central1, version=14] 560 http_requests[status=500, method=GET, path='/login', server=A, region=us-central1, version=14] 2

Slide 29

Slide 29 text

Benefits of metrics Measure if you're meeting your SLOs Create alerts Answer difficult questions (even in a Google datacenter): ● what's the average uptime of machines that's in the top 10% ● how many packet drops happened in the past minutes, from which machines

Slide 30

Slide 30 text

● Prometheus → Grafana (UI) → prometheus alerting → PagerDuty, … ● Google Stackdriver → Stackdriver alerting → Stackdriver console (UI) Metrics collection

Slide 31

Slide 31 text

Want easy request metrics? Istio gives you request/response metrics without changing any application code.

Slide 32

Slide 32 text

Tracing

Slide 33

Slide 33 text

Tracing Which services does a request travel through, for how long. Tracing allows you to understand call patterns between your services, and find bottlenecks. ● Exercise: how do you optimize Facebook home page load time

Slide 34

Slide 34 text

How Tracing works You need to update your code: ● (→)incoming requests: get trace ID from request ● outgoing(→) requests: pass trace ID to the request Service A Service B GET http://A/foo Trace-Header: 123 GET http://B/bar Trace-Header: 123

Slide 35

Slide 35 text

Tracing solutions OpenCensus Google Stackdriver Azure AppInsights [...] write once export anywhere

Slide 36

Slide 36 text

Example request trace frontend.GET./home 120ms ServiceA.Calculate 71ms ServiceF.ComputeX 28ms ServiceD.GetX 32ms ServiceF.ComputeX 28ms ServiceF.ComputeY 38ms ServiceH.GetZ 24ms ServiceF.ComputeY 20 ms

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Profiling

Slide 39

Slide 39 text

Profiling Which functions/methods is time spent in my"process"? ● Helps you identify "slow paths" in your "fast paths" ● You need to enable profiling in your application Want easy process profiling? ● Try Stackdriver Profiling.

Slide 40

Slide 40 text

Example CPU profile

Slide 41

Slide 41 text

Service Topology

Slide 42

Slide 42 text

Which services calls which other services? ● Identify dependencies between your services. ● Answer hard questions about services communication easily. Service Topology Graph

Slide 43

Slide 43 text

Service Topology Graph Which services calls which other services? ● Identify dependencies between your services. ● Answer hard questions about services communication easily. Examples: ● Who is making requests to ServiceA? ● How many requests-per-second (RPS) for ServiceA ⇒ ServiceB ? ● How is the latency of A⇒B compare to C ⇒ B? ● What % of A⇒B requests go to B in us-west, what % to us-east?

Slide 44

Slide 44 text

Want easy Service Topology? Istio gives you a service topology graph without changing any application code.

Slide 45

Slide 45 text

● Read the SRE Book (free) by Google to learn about SLOs/SLAs/error budgets ● If you're using Kubernetes, use Istio (...to get metrics without changing code) ● Types of monitoring ● Metrics ○ Good for monitoring SLOs/SLAs (+alerting), or app health ○ Try Prometheus or Stackdriver Metrics ● Tracing ○ ...which service in the call graph takes how much time ○ Try OpenCensus + Stackdriver Trace ● Profiling ○ Function-level performance diagnostics in a process Summary

Slide 46

Slide 46 text

● Play with github.com/GoogleCloudPlatform/microservices-demo ● Say hello on twitter: @ahmetb ● Google Cloud Startup Program ($3000 credits) ○ Special offer for DevFest: http://goo.gl/XyeCQ ○ g.co/cloudstartups Thanks