Troubleshooting Kubernetes Applications

Troubleshooting Kubernetes Applications Michael Hausenblas @mhausenblas  Developer Advocate, Red Hat 
2018-10-03, O’Reilly Velocity NYC troubleshooting.kubernetes.sh

Hit me up on Twitter: @mhausenblas 2 Scope • Focusing
on prototyping, developing, and testing applications  with Kubernetes from an appops perspective • Beware, the talk is not really (much) about … • troubleshooting installation issues • performance testing or optimising containerized microservices • SRE-style troubleshooting (check out what Googlers say on this topic)

Hit me up on Twitter: @mhausenblas

Hit me up on Twitter: @mhausenblas 4

Hit me up on Twitter: @mhausenblas 5 Monoliths vs. microservices
monolith v1 monolith v2 time µS1  v1 µS2  v1 µS3  v1 µS2  v2 µS3  v2 µS1  v2 µS2  v3 µS3  v3 µS1  v3 µS3  v4 µS2  v4 µS3  v5 µS1  v4 µS2  v5 µS3  v6

Hit me up on Twitter: @mhausenblas 6 Moving parts—physical view

Hit me up on Twitter: @mhausenblas 7 Moving parts—logical view

Hit me up on Twitter: @mhausenblas 8 Honestly, not much
new under the sun … www.computerhistory.org/tdih/september/9/

Hit me up on Twitter: @mhausenblas 9 Honestly, not much
new under the sun … www.ibiblio.org/harris/500milemail.html

TOP 10 failures

Hit me up on Twitter: @mhausenblas 12 The TOP 10
list 1. invalid YAML speciﬁcation (EXAMPLE) 2. can’t reach service/pod (EXAMPLE 1, EXAMPLE 2) 3. wrong container image (EXAMPLE) 4. no access to container registry (EXAMPLE) 5. supposedly long-running application exits (EXAMPLE)

Hit me up on Twitter: @mhausenblas 13 The TOP 10
list 6. missing/bad conﬁg or secret ( EXAMPLE) 7. failed mounts (EXAMPLE) 8. lifecycle issues/probes fail (EXAMPLE) 9. wrong or missing permissions (EXAMPLE) 10. looking at the wrong place aka: where is localhost? (EXAMPLE)

Poking around

Hit me up on Twitter: @mhausenblas Observe What’s in the
logs? What the baseline? Orient Formulate hypotheses. Don’t jump to conclusions. Decide Sort hypotheses by likelihood.  Pick one of the hypotheses. Act Test the hypothesis you picked. If conﬁrmed: ﬁx it, else: continue. OODA loop

Hit me up on Twitter: @mhausenblas 16 The How •
Using kubectl get events • Using kubectl describe • Using kubectl exec • Using kubectl logs • Full-blown observability approaches (note: we get back to that later)

Apps run in pods, and pods sometimes fail …

Hit me up on Twitter: @mhausenblas 18 What’s (in) a
pod?

Hit me up on Twitter: @mhausenblas control plane worker node
kubectl apply kubelet asks container runtime via CRI to launch container(s) etcd happy? API Server stores desired state Scheduler sees new pod, selects node Scheduler assigns pod to a fitting node container runtime pulls image container runtime runs images kubelet takes over pod lifecycle (probes) pod runs until deleted or evicted garbage collection ask cluster admin NO YES does the pod get scheduled? fork out more $$$ container runtime happy? ask cluster admin can access container registry? fix access to registry is container starting up? (init containers) debug app probes fine? no leaking resources? soak testing, monitoring YES NO YES YES YES YES NO NO NO NO kubelet watches API server and notices new pod 1 2 3 4 5 6 7 8 9 container crashing after startup? NO YES debug app

Hit me up on Twitter: @mhausenblas 20 I think I’m
having image issues … kubectl get events to the rescue?

hands-on time!

Hit me up on Twitter: @mhausenblas 22 Dunno, just keeps
crashing … kubectl describe and exec

hands-on time!

Hit me up on Twitter: @mhausenblas 24 Oh my Lanta,
something’s wrong with the app … kubectl logs

hands-on time!

Storage

Hit me up on Twitter: @mhausenblas 27 What and how
• Storage in Kubernetes (CSI) • Failure modes • See stateful.kubernetes.sh

hands-on time!

Network

Hit me up on Twitter: @mhausenblas 30 What and how
• Container networking in Kubernetes (CNI) • Failure modes • See mhausenblas.info/cn-ref/

Hit me up on Twitter: @mhausenblas 31

hands-on time!

Security

Hit me up on Twitter: @mhausenblas 34 Access control (RBAC)
and policies • Use kubectl auth can-i to check RBAC permissions • Make yourself familiar with: • Pod Security Policies, might constrain your app too much • Network Policies, might be too strict for your app’s communication needs • See kubernetes-security.info

hands-on time!

Observability

Hit me up on Twitter: @mhausenblas 37 Metrics node container
runtime app alerts dashboards storage event router

Hit me up on Twitter: @mhausenblas 38 Metrics • Use
industry standards in cloud native land: Prometheus + Grafana • Out-of-the-box low-level metrics (CPU, memory) • Options for app-speciﬁc (custom) metrics: • instrumentation of the app • service mesh-based approaches (aka: magic), for example Linkerd2

Hit me up on Twitter: @mhausenblas 39 kudos to demo.robustperception.io

Hit me up on Twitter: @mhausenblas 40 kudos to linkerd.io/2

Hit me up on Twitter: @mhausenblas 41 kudos to linkerd.io/2
and grafana.com

Hit me up on Twitter: @mhausenblas 42 kudos to okd.io

Hit me up on Twitter: @mhausenblas 43 Aggregated logs

Hit me up on Twitter: @mhausenblas 44 Aggregated logs •
In app, log to stdout or if you can’t use an adapter • Use the industry standard in cloud native land: ELK/EFK stack

Hit me up on Twitter: @mhausenblas 45 kudos to okd.io
and elastic.co

Hit me up on Twitter: @mhausenblas 46 Distributed tracing •
Roots: need to overcome limitations of “time-synced logs” • Speciﬁcations: OpenCensus and OpenTracing • Tooling: Zipkin, Jaeger, Stackdriver • A must-have in a microservices setup

hands-on time!

Vaccination

Hit me up on Twitter: @mhausenblas 50 Quizzie time! You
wrote an application server. For load-balancing purposes, where would you put a reverse proxy such as NGINX? A. Into the container (same Dockerﬁle) B. Into a side car container (same pod) C. Into a separate pod

Hit me up on Twitter: @mhausenblas 51 Proactive measures Architect
your apps the cloud native way by … • knowing and using the Kubernetes primitives (services, deployments) • implementing retries & timeouts (in-tree or via service mesh) • avoiding hardcoded (start-up) dependencies • listening on 0.0.0.0 (not 127.0.0.1)

Hit me up on Twitter: @mhausenblas 52 Proactive measures •
Apply chaos engineering as long as all  is well and learn from it where and how  your system fails • Provide debug tools in image, but also: footprint, security! • Automate all the things: Autoscaler, Brigade, Draft, Forge, Helm, knative, ksync, odo, Operators, Skaffold, watchpod, etc.

hands-on time!

Resources

Hit me up on Twitter: @mhausenblas 55 Liz Rice &
Michael Hausenblas Operating Kubernetes Clusters and Applications Safely Kubernetes Security

Hit me up on Twitter: @mhausenblas 56 • Kubernetes Troubleshooting
site • GKE Troubleshooting docs • Debugging microservices - Squash vs. Telepresence • Debugging and Troubleshooting Microservices in Kubernetes with Ray Tsang (Google) • Troubleshooting Kubernetes Using Logs • Debug a Go Application in Kubernetes from IDE • Troubleshooting Kubernetes Networking Issues • Video: CrashLoopBackoff, Pending, FailedMount and Friends: Debugging Common Kubernetes Cluster (KubeCon NA 2017) • Slide deck: Evolution of Monitoring and Prometheus Articles

Hit me up on Twitter: @mhausenblas 57 • 10 Most
Common Reasons Kubernetes Deployments Fail: Part 1 and Part 2 • Kubernetes Application Operator Basics • Kubernetes: ﬁve steps to well-behaved apps • Kubernetes Best Practices • Developing on Kubernetes • Debugging Microservices: How Google SREs Resolve Outages • Debugging Microservices: Lessons from Google, Facebook, Lyft • Troubleshooting Java applications on OpenShift • Debugging Kubernetes PVCs Articles

Hit me up on Twitter: @mhausenblas 58 • kubernetes.io/docs/tasks/debug-application-cluster/debug-application/ •
kubernetes.io/docs/tasks/debug-application-cluster/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-init-containers/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-service/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-stateful-set/ • kubernetes.io/docs/tasks/debug-application-cluster/local-debugging/ Ofﬁcial Kubernetes docs

plus.google.com/+RedHat linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHatNews learn.openshift.com

Troubleshooting Kubernetes Applications

Troubleshooting Kubernetes Applications

More Decks by Michael Hausenblas

Featured

Transcript