Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Troubleshooting Kubernetes Applications

5c3807aaaf0ffefe6c75e3dbbb8588b5?s=47 Michael Hausenblas
September 30, 2018

Troubleshooting Kubernetes Applications

Talk at O'Reilly Velocity New York, see http://troubleshooting.kubernetes.sh


Michael Hausenblas

September 30, 2018


  1. Troubleshooting Kubernetes Applications Michael Hausenblas @mhausenblas
 Developer Advocate, Red Hat

    2018-10-03, O’Reilly Velocity NYC troubleshooting.kubernetes.sh
  2. Hit me up on Twitter: @mhausenblas 2 Scope • Focusing

    on prototyping, developing, and testing applications
 with Kubernetes from an appops perspective • Beware, the talk is not really (much) about … • troubleshooting installation issues • performance testing or optimising containerized microservices • SRE-style troubleshooting (check out what Googlers say on this topic)
  3. Hit me up on Twitter: @mhausenblas

  4. Hit me up on Twitter: @mhausenblas 4

  5. Hit me up on Twitter: @mhausenblas 5 Monoliths vs. microservices

    monolith v1 monolith v2 time µS1
 v1 µS2
 v1 µS3
 v1 µS2
 v2 µS3
 v2 µS1
 v2 µS2
 v3 µS3
 v3 µS1
 v3 µS3
 v4 µS2
 v4 µS3
 v5 µS1
 v4 µS2
 v5 µS3
  6. Hit me up on Twitter: @mhausenblas 6 Moving parts—physical view

  7. Hit me up on Twitter: @mhausenblas 7 Moving parts—logical view

  8. Hit me up on Twitter: @mhausenblas 8 Honestly, not much

    new under the sun … www.computerhistory.org/tdih/september/9/
  9. Hit me up on Twitter: @mhausenblas 9 Honestly, not much

    new under the sun … www.ibiblio.org/harris/500milemail.html
  10. TOP 10 failures

  11. Hit me up on Twitter: @mhausenblas

  12. Hit me up on Twitter: @mhausenblas 12 The TOP 10

    list 1. invalid YAML specification (EXAMPLE) 2. can’t reach service/pod (EXAMPLE 1, EXAMPLE 2) 3. wrong container image (EXAMPLE) 4. no access to container registry (EXAMPLE) 5. supposedly long-running application exits (EXAMPLE)
  13. Hit me up on Twitter: @mhausenblas 13 The TOP 10

    list 6. missing/bad config or secret ( EXAMPLE) 7. failed mounts (EXAMPLE) 8. lifecycle issues/probes fail (EXAMPLE) 9. wrong or missing permissions (EXAMPLE) 10. looking at the wrong place aka: where is localhost? (EXAMPLE)
  14. Poking around

  15. Hit me up on Twitter: @mhausenblas Observe What’s in the

    logs? What the baseline? Orient Formulate hypotheses. Don’t jump to conclusions. Decide Sort hypotheses by likelihood.
 Pick one of the hypotheses. Act Test the hypothesis you picked. If confirmed: fix it, else: continue. OODA loop
  16. Hit me up on Twitter: @mhausenblas 16 The How •

    Using kubectl get events • Using kubectl describe • Using kubectl exec • Using kubectl logs • Full-blown observability approaches (note: we get back to that later)
  17. Apps run in pods, and pods sometimes fail …

  18. Hit me up on Twitter: @mhausenblas 18 What’s (in) a

  19. Hit me up on Twitter: @mhausenblas control plane worker node

    kubectl apply kubelet asks container runtime via CRI to launch container(s) etcd happy? API Server stores desired state Scheduler sees new pod, selects node Scheduler assigns pod to a fitting node container runtime pulls image container runtime runs images kubelet takes over pod lifecycle (probes) pod runs until deleted or evicted garbage collection ask cluster admin NO YES does the pod get scheduled? fork out more $$$ container runtime happy? ask cluster admin can access container registry? fix access to registry is container starting up? (init containers) debug app probes fine? no leaking resources? soak testing, monitoring YES NO YES YES YES YES NO NO NO NO kubelet watches API server and notices new pod 1 2 3 4 5 6 7 8 9 container crashing after startup? NO YES debug app
  20. Hit me up on Twitter: @mhausenblas 20 I think I’m

    having image issues … kubectl get events to the rescue?
  21. hands-on time!

  22. Hit me up on Twitter: @mhausenblas 22 Dunno, just keeps

    crashing … kubectl describe and exec
  23. hands-on time!

  24. Hit me up on Twitter: @mhausenblas 24 Oh my Lanta,

    something’s wrong with the app … kubectl logs
  25. hands-on time!

  26. Storage

  27. Hit me up on Twitter: @mhausenblas 27 What and how

    • Storage in Kubernetes (CSI) • Failure modes • See stateful.kubernetes.sh
  28. hands-on time!

  29. Network

  30. Hit me up on Twitter: @mhausenblas 30 What and how

    • Container networking in Kubernetes (CNI) • Failure modes • See mhausenblas.info/cn-ref/
  31. Hit me up on Twitter: @mhausenblas 31

  32. hands-on time!

  33. Security

  34. Hit me up on Twitter: @mhausenblas 34 Access control (RBAC)

    and policies • Use kubectl auth can-i to check RBAC permissions • Make yourself familiar with: • Pod Security Policies, might constrain your app too much • Network Policies, might be too strict for your app’s communication needs • See kubernetes-security.info
  35. hands-on time!

  36. Observability

  37. Hit me up on Twitter: @mhausenblas 37 Metrics node container

    runtime app alerts dashboards storage event router
  38. Hit me up on Twitter: @mhausenblas 38 Metrics • Use

    industry standards in cloud native land: Prometheus + Grafana • Out-of-the-box low-level metrics (CPU, memory) • Options for app-specific (custom) metrics: • instrumentation of the app • service mesh-based approaches (aka: magic), for example Linkerd2
  39. Hit me up on Twitter: @mhausenblas 39 kudos to demo.robustperception.io

  40. Hit me up on Twitter: @mhausenblas 40 kudos to linkerd.io/2

  41. Hit me up on Twitter: @mhausenblas 41 kudos to linkerd.io/2

    and grafana.com
  42. Hit me up on Twitter: @mhausenblas 42 kudos to okd.io

  43. Hit me up on Twitter: @mhausenblas 43 Aggregated logs

  44. Hit me up on Twitter: @mhausenblas 44 Aggregated logs •

    In app, log to stdout or if you can’t use an adapter • Use the industry standard in cloud native land: ELK/EFK stack
  45. Hit me up on Twitter: @mhausenblas 45 kudos to okd.io

    and elastic.co
  46. Hit me up on Twitter: @mhausenblas 46 Distributed tracing •

    Roots: need to overcome limitations of “time-synced logs” • Specifications: OpenCensus and OpenTracing • Tooling: Zipkin, Jaeger, Stackdriver • A must-have in a microservices setup
  47. Hit me up on Twitter: @mhausenblas

  48. hands-on time!

  49. Vaccination

  50. Hit me up on Twitter: @mhausenblas 50 Quizzie time! You

    wrote an application server. For load-balancing purposes, where would you put a reverse proxy such as NGINX? A. Into the container (same Dockerfile) B. Into a side car container (same pod) C. Into a separate pod
  51. Hit me up on Twitter: @mhausenblas 51 Proactive measures Architect

    your apps the cloud native way by … • knowing and using the Kubernetes primitives (services, deployments) • implementing retries & timeouts (in-tree or via service mesh) • avoiding hardcoded (start-up) dependencies • listening on (not
  52. Hit me up on Twitter: @mhausenblas 52 Proactive measures •

    Apply chaos engineering as long as all
 is well and learn from it where and how
 your system fails • Provide debug tools in image, but also: footprint, security! • Automate all the things: Autoscaler, Brigade, Draft, Forge, Helm, knative, ksync, odo, Operators, Skaffold, watchpod, etc.
  53. hands-on time!

  54. Resources

  55. Hit me up on Twitter: @mhausenblas 55 Liz Rice &

    Michael Hausenblas Operating Kubernetes Clusters and Applications Safely Kubernetes Security
  56. Hit me up on Twitter: @mhausenblas 56 • Kubernetes Troubleshooting

    site • GKE Troubleshooting docs • Debugging microservices - Squash vs. Telepresence • Debugging and Troubleshooting Microservices in Kubernetes with Ray Tsang (Google) • Troubleshooting Kubernetes Using Logs • Debug a Go Application in Kubernetes from IDE • Troubleshooting Kubernetes Networking Issues • Video: CrashLoopBackoff, Pending, FailedMount and Friends: Debugging Common Kubernetes Cluster (KubeCon NA 2017) • Slide deck: Evolution of Monitoring and Prometheus Articles
  57. Hit me up on Twitter: @mhausenblas 57 • 10 Most

    Common Reasons Kubernetes Deployments Fail: Part 1 and Part 2 • Kubernetes Application Operator Basics • Kubernetes: five steps to well-behaved apps • Kubernetes Best Practices • Developing on Kubernetes • Debugging Microservices: How Google SREs Resolve Outages • Debugging Microservices: Lessons from Google, Facebook, Lyft • Troubleshooting Java applications on OpenShift • Debugging Kubernetes PVCs Articles
  58. Hit me up on Twitter: @mhausenblas 58 • kubernetes.io/docs/tasks/debug-application-cluster/debug-application/ •

    kubernetes.io/docs/tasks/debug-application-cluster/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-init-containers/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-service/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-stateful-set/ • kubernetes.io/docs/tasks/debug-application-cluster/local-debugging/ Official Kubernetes docs
  59. plus.google.com/+RedHat linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHatNews learn.openshift.com