Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring Microservices with Minimal Effort (GDG DevFest Ankara 2018)

Monitoring Microservices with Minimal Effort (GDG DevFest Ankara 2018)

I gave this talk at GDG DevFest Ankara 2018. This talk was about how we use Istio and OpenCensus to automatically extract request metrics and traces from applications, and further extend a simple microservices applications to export profiling data and metrics to Google Stackdriver APM.

Find more at http://cloud.google.com/stackdriver and https://opencensus.io/.

Ahmet Alp Balkan

November 17, 2018
Tweet

More Decks by Ahmet Alp Balkan

Other Decks in Technology

Transcript

  1. November 17, 2018 Monitoring microservices on Kubernetes with minimal effort

    Ahmet Alp Balkan (@ahmetb)
  2. • Software Engineer at Google Cloud • Worked at Microsoft

    Azure (2012-2016) on porting Docker to Windows & Linux stuff. • Kubernetes/GKE, Knative developer experience • Twitter/GitHub: @ahmetb About me
  3. Why monitor applications?

  4. You don't know if it works if you're not monitoring.

  5. • SLO (service-level objective): agreed way to measure performance of

    a service, between two parties (often internal) • SLA (service-level agreement): SLO, but has a legal contract. • Error budget: how much can you afford to violate SLA/SLO? Service availability contracts Team A Team B's service Ucuzabilet Turkish Airlines You Google Cloud Storage API
  6. Error Budgets • Team X owns a critical service at

    Google, used by other teams.
  7. Error Budgets • Team X owns a critical service at

    Google, used by other teams. • ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month)
  8. Error Budgets • Team X owns a critical service at

    Google, used by other teams. • ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month) • ServiceA went down for 30 minutes today.
  9. Error Budgets • Team X owns a critical service at

    Google, used by other teams. • ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month) • ServiceA went down for 30 minutes today. • TeamX can't ship new features to ServiceA for 6 months, until they have more error budget (or they will risk violating their SLO/SLA).
  10. What to monitor?

  11. Asking the right questions Is my website working?

  12. Asking the right questions Is my home page working?

  13. Asking the right questions Is my home page responding successfully?

  14. Asking the right questions Is my home page responding successfully

    to the 99.5% of the requests?
  15. Asking the right questions Is my home page responding successfully

    to the 99.5% of the requests within 100 milliseconds?
  16. Asking the right questions Is my home page responding successfully

    to the 99.5% of the requests within 100 milliseconds from the servers in us-east1?
  17. Ways to monitor services

  18. Microservices

  19. At Google, everything is a service • Google Cloud Storage

    frontend → service • Google Fonts API → service • Google search index backend → internal service • Human resources database → internal service • Cafeteria menus → internal service ~O(1010) requests per second in Google’s private network. Mostly gRPC/Protobuf-style networking (not HTTP REST/JSON APIs)
  20. microservices • Develop independently YES NO • Scale independently YES

    NO • Fail independently YES NO • Number of things to monitor MANY ONE monoliths
  21. At Google, we don't write the BEST CODE, but we

    have world systems OBSERVABILITY.
  22. Demo

  23. Pick your own adventure Open source

  24. Demo: Hipster Shop github.com/GoogleCloudPlatform/microservices-demo

  25. Metrics

  26. Time-series Metrics Measurement of a value over time: • Gauges:

    current value of an indicator (example: current memory usage MB) • Counters: only-increasing values (example: request count) Examples: • 99th percentile latency of POST requests to /login over the past 5 minutes • success rate of GET requests over past day • average memory usage in the past 30 minutes • "number of orders completed" in the past hour
  27. Time-series Metrics "orders_completed" counter example: server 1 server 2 GET

    /_metrics: orders_completed[server=1] 3 GET /_metrics: orders_completed[server=2] 12 metrics collector
  28. Anatomy of a metrics page • name • labels •

    value orders_created[server=A, region=us-central1, version=14] 563 http_requests[status=200, method=GET, path='/', server=A, region=us-central1, version=14] 156183 http_requests[status=200, method=GET, path='/login', server=A, region=us-central1, version=14] 560 http_requests[status=500, method=GET, path='/login', server=A, region=us-central1, version=14] 2
  29. Benefits of metrics Measure if you're meeting your SLOs Create

    alerts Answer difficult questions (even in a Google datacenter): • what's the average uptime of machines that's in the top 10% • how many packet drops happened in the past minutes, from which machines
  30. • Prometheus → Grafana (UI) → prometheus alerting → PagerDuty,

    … • Google Stackdriver → Stackdriver alerting → Stackdriver console (UI) Metrics collection
  31. Want easy request metrics? Istio gives you request/response metrics without

    changing any application code.
  32. Tracing

  33. Tracing Which services does a request travel through, for how

    long. Tracing allows you to understand call patterns between your services, and find bottlenecks. • Exercise: how do you optimize Facebook home page load time
  34. How Tracing works You need to update your code: •

    (→)incoming requests: get trace ID from request • outgoing(→) requests: pass trace ID to the request Service A Service B GET http://A/foo Trace-Header: 123 GET http://B/bar Trace-Header: 123
  35. Tracing solutions OpenCensus Google Stackdriver Azure AppInsights [...] write once

    export anywhere
  36. Example request trace frontend.GET./home 120ms ServiceA.Calculate 71ms ServiceF.ComputeX 28ms ServiceD.GetX

    32ms ServiceF.ComputeX 28ms ServiceF.ComputeY 38ms ServiceH.GetZ 24ms ServiceF.ComputeY 20 ms
  37. Example request trace frontend.GET./home 120ms ServiceA.Calculate 71ms ServiceF.ComputeX 28ms ServiceD.GetX

    32ms ServiceF.ComputeX 28ms ServiceF.ComputeY 38ms ServiceH.GetZ 24ms ServiceF.ComputeY 20 ms
  38. Profiling

  39. Profiling Which functions/methods is time spent in my"process"? • Helps

    you identify "slow paths" in your "fast paths" • You need to enable profiling in your application Want easy process profiling? • Try Stackdriver Profiling.
  40. Example CPU profile

  41. Service Topology

  42. Which services calls which other services? • Identify dependencies between

    your services. • Answer hard questions about services communication easily. Service Topology Graph
  43. Service Topology Graph Which services calls which other services? •

    Identify dependencies between your services. • Answer hard questions about services communication easily. Examples: • Who is making requests to ServiceA? • How many requests-per-second (RPS) for ServiceA ⇒ ServiceB ? • How is the latency of A⇒B compare to C ⇒ B? • What % of A⇒B requests go to B in us-west, what % to us-east?
  44. Want easy Service Topology? Istio gives you a service topology

    graph without changing any application code.
  45. • Read the SRE Book (free) by Google to learn

    about SLOs/SLAs/error budgets • If you're using Kubernetes, use Istio (...to get metrics without changing code) • Types of monitoring • Metrics ◦ Good for monitoring SLOs/SLAs (+alerting), or app health ◦ Try Prometheus or Stackdriver Metrics • Tracing ◦ ...which service in the call graph takes how much time ◦ Try OpenCensus + Stackdriver Trace • Profiling ◦ Function-level performance diagnostics in a process Summary
  46. • Play with github.com/GoogleCloudPlatform/microservices-demo • Say hello on twitter: @ahmetb

    • Google Cloud Startup Program ($3000 credits) ◦ Special offer for DevFest: http://goo.gl/XyeCQ ◦ g.co/cloudstartups Thanks
  47. Logos

  48. None