Monitoring Microservices with Minimal Effort (GDG DevFest Ankara 2018)

November 17, 2018 Monitoring microservices on Kubernetes with minimal effort
Ahmet Alp Balkan (@ahmetb)

• Software Engineer at Google Cloud • Worked at Microsoft
Azure (2012-2016) on porting Docker to Windows & Linux stuff. • Kubernetes/GKE, Knative developer experience • Twitter/GitHub: @ahmetb About me

Why monitor applications?

You don't know if it works if you're not monitoring.

• SLO (service-level objective): agreed way to measure performance of
a service, between two parties (often internal) • SLA (service-level agreement): SLO, but has a legal contract. • Error budget: how much can you afford to violate SLA/SLO? Service availability contracts Team A Team B's service Ucuzabilet Turkish Airlines You Google Cloud Storage API

Error Budgets • Team X owns a critical service at
Google, used by other teams.

Google, used by other teams. • ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month)

Google, used by other teams. • ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month) • ServiceA went down for 30 minutes today.

Google, used by other teams. • ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month) • ServiceA went down for 30 minutes today. • TeamX can't ship new features to ServiceA for 6 months, until they have more error budget (or they will risk violating their SLO/SLA).

What to monitor?

Asking the right questions Is my website working?

Asking the right questions Is my home page working?

Asking the right questions Is my home page responding successfully?

Asking the right questions Is my home page responding successfully
to the 99.5% of the requests?

to the 99.5% of the requests within 100 milliseconds?

to the 99.5% of the requests within 100 milliseconds from the servers in us-east1?

Ways to monitor services

Microservices

At Google, everything is a service • Google Cloud Storage
frontend → service • Google Fonts API → service • Google search index backend → internal service • Human resources database → internal service • Cafeteria menus → internal service ~O(1010) requests per second in Google’s private network. Mostly gRPC/Protobuf-style networking (not HTTP REST/JSON APIs)

microservices • Develop independently YES NO • Scale independently YES
NO • Fail independently YES NO • Number of things to monitor MANY ONE monoliths

At Google, we don't write the BEST CODE, but we
have world systems OBSERVABILITY.

Pick your own adventure Open source

Demo: Hipster Shop github.com/GoogleCloudPlatform/microservices-demo

Metrics

Time-series Metrics Measurement of a value over time: • Gauges:
current value of an indicator (example: current memory usage MB) • Counters: only-increasing values (example: request count) Examples: • 99th percentile latency of POST requests to /login over the past 5 minutes • success rate of GET requests over past day • average memory usage in the past 30 minutes • "number of orders completed" in the past hour

Time-series Metrics "orders_completed" counter example: server 1 server 2 GET
/_metrics: orders_completed[server=1] 3 GET /_metrics: orders_completed[server=2] 12 metrics collector

Anatomy of a metrics page • name • labels •
value orders_created[server=A, region=us-central1, version=14] 563 http_requests[status=200, method=GET, path='/', server=A, region=us-central1, version=14] 156183 http_requests[status=200, method=GET, path='/login', server=A, region=us-central1, version=14] 560 http_requests[status=500, method=GET, path='/login', server=A, region=us-central1, version=14] 2

Benefits of metrics Measure if you're meeting your SLOs Create
alerts Answer difficult questions (even in a Google datacenter): • what's the average uptime of machines that's in the top 10% • how many packet drops happened in the past minutes, from which machines

• Prometheus → Grafana (UI) → prometheus alerting → PagerDuty,
… • Google Stackdriver → Stackdriver alerting → Stackdriver console (UI) Metrics collection

Want easy request metrics? Istio gives you request/response metrics without
changing any application code.

Tracing

Tracing Which services does a request travel through, for how
long. Tracing allows you to understand call patterns between your services, and find bottlenecks. • Exercise: how do you optimize Facebook home page load time

How Tracing works You need to update your code: •
(→)incoming requests: get trace ID from request • outgoing(→) requests: pass trace ID to the request Service A Service B GET http://A/foo Trace-Header: 123 GET http://B/bar Trace-Header: 123

Tracing solutions OpenCensus Google Stackdriver Azure AppInsights [...] write once
export anywhere

Example request trace frontend.GET./home 120ms ServiceA.Calculate 71ms ServiceF.ComputeX 28ms ServiceD.GetX
32ms ServiceF.ComputeX 28ms ServiceF.ComputeY 38ms ServiceH.GetZ 24ms ServiceF.ComputeY 20 ms

Profiling

Profiling Which functions/methods is time spent in my"process"? • Helps
you identify "slow paths" in your "fast paths" • You need to enable profiling in your application Want easy process profiling? • Try Stackdriver Profiling.

Example CPU profile

Service Topology

Which services calls which other services? • Identify dependencies between
your services. • Answer hard questions about services communication easily. Service Topology Graph

Service Topology Graph Which services calls which other services? •
Identify dependencies between your services. • Answer hard questions about services communication easily. Examples: • Who is making requests to ServiceA? • How many requests-per-second (RPS) for ServiceA ⇒ ServiceB ? • How is the latency of A⇒B compare to C ⇒ B? • What % of A⇒B requests go to B in us-west, what % to us-east?

Want easy Service Topology? Istio gives you a service topology
graph without changing any application code.

• Read the SRE Book (free) by Google to learn
about SLOs/SLAs/error budgets • If you're using Kubernetes, use Istio (...to get metrics without changing code) • Types of monitoring • Metrics ◦ Good for monitoring SLOs/SLAs (+alerting), or app health ◦ Try Prometheus or Stackdriver Metrics • Tracing ◦ ...which service in the call graph takes how much time ◦ Try OpenCensus + Stackdriver Trace • Profiling ◦ Function-level performance diagnostics in a process Summary

• Play with github.com/GoogleCloudPlatform/microservices-demo • Say hello on twitter: @ahmetb
• Google Cloud Startup Program ($3000 credits) ◦ Special offer for DevFest: http://goo.gl/XyeCQ ◦ g.co/cloudstartups Thanks

Monitoring Microservices with Minimal Effort (G...

Monitoring Microservices with Minimal Effort (GDG DevFest Ankara 2018)

More Decks by Ahmet Alp Balkan

Other Decks in Technology

Featured

Transcript