Slide 1

Slide 1 text

Troubleshooting Kubernetes apps Michael Hausenblas @mhausenblas
 Developer Advocate, Red Hat
 2019-02-08, KubeCologne troubleshooting.kubernetes.sh

Slide 2

Slide 2 text

Hit me up on Twitter: @mhausenblas 2 Scope • Focusing on prototyping, developing, and testing applications
 with Kubernetes from an appops perspective (tools & techniques) • But not really (much) about … • troubleshooting installation or upgrading issues • performance testing or optimising containerized microservices • SRE-style troubleshooting (check out what Googlers say on this topic)

Slide 3

Slide 3 text

Hit me up on Twitter: @mhausenblas 3 Monoliths vs. microservices monolith v1 monolith v2 time µS1
 v1 µS2
 v1 µS3
 v1 µS2
 v2 µS3
 v2 µS1
 v2 µS2
 v3 µS3
 v3 µS1
 v3 µS3
 v4 µS2
 v4 µS3
 v5 µS1
 v4 µS2
 v5 µS3
 v6

Slide 4

Slide 4 text

Hit me up on Twitter: @mhausenblas 4

Slide 5

Slide 5 text

Hit me up on Twitter: @mhausenblas 5 Moving parts—logical view

Slide 6

Slide 6 text

Hit me up on Twitter: @mhausenblas 6 Moving parts—physical view

Slide 7

Slide 7 text

TOP 10 failures

Slide 8

Slide 8 text

Hit me up on Twitter: @mhausenblas 8 The TOP 10 list 1. invalid YAML specification 2. wrong or missing permissions 3. wrong container image 4. no access to container registry 5. supposedly long-running application exits

Slide 9

Slide 9 text

Hit me up on Twitter: @mhausenblas 9 The TOP 10 list 6. missing/bad config or secret 7. lifecycle issues (probes fail) 8. can’t reach service 9. looking at the wrong place—where is localhost? 10. failed mounts

Slide 10

Slide 10 text

Not just poking around …

Slide 11

Slide 11 text

Hit me up on Twitter: @mhausenblas Observe What’s in the logs? Establish baseline. Orient Formulate hypotheses. Don’t jump to conclusions. Decide Sort hypotheses by likelihood.
 Pick one of the hypotheses. Act Test the hypothesis you picked. If confirmed: fix it, else: continue. OODA loop

Slide 12

Slide 12 text

Hit me up on Twitter: @mhausenblas

Slide 13

Slide 13 text

Hit me up on Twitter: @mhausenblas

Slide 14

Slide 14 text

Hit me up on Twitter: @mhausenblas • Deployment seems OK • Pod seems OK (image found, scheduled, launched) • I see log output, so container is running • Keeps crashing after launch

Slide 15

Slide 15 text

Hit me up on Twitter: @mhausenblas • Could be a resource issues (OOM, etc.) • Could be config/data missing • Could be an application logic/runtime error

Slide 16

Slide 16 text

Hit me up on Twitter: @mhausenblas 1.Could be an application logic/runtime error 2.Could be a resource issues (OOM, etc.) 3.Could be config/data missing

Slide 17

Slide 17 text

Hit me up on Twitter: @mhausenblas command: - sh - '-c' - echo "I will just print something here and then exit” && sleep 1000

Slide 18

Slide 18 text

Hit me up on Twitter: @mhausenblas 18 The How • Using kubectl get events • Using kubectl describe • Using kubectl exec • Using kubectl logs (or kubetail, stern) • Full-blown observability approaches

Slide 19

Slide 19 text

Observability

Slide 20

Slide 20 text

Hit me up on Twitter: @mhausenblas 20 Metrics node container runtime app alerts dashboards storage event router

Slide 21

Slide 21 text

Hit me up on Twitter: @mhausenblas 21 Metrics • Out-of-the-box low-level metrics (CPU, memory) • Application-specific metrics (full-blown instrumentation vs service mesh- based approaches) • Options • Roll your own, use the industry standards Prometheus + Grafana • Cloud provider native

Slide 22

Slide 22 text

Hit me up on Twitter: @mhausenblas 22 kudos to okd.io

Slide 23

Slide 23 text

Hit me up on Twitter: @mhausenblas 23 kudos to linkerd.io/2 and grafana.com

Slide 24

Slide 24 text

Hit me up on Twitter: @mhausenblas 24 kudos to linkerd.io/2

Slide 25

Slide 25 text

Hit me up on Twitter: @mhausenblas 25 Aggregated logs

Slide 26

Slide 26 text

Hit me up on Twitter: @mhausenblas 26 Aggregated logs • In app, log to stdout or if you can’t use an adapter • Options • Roll your own, use the industry standards: ELK/EFK stack • Cloud provider native such as CloudWatch or StackDriver

Slide 27

Slide 27 text

Hit me up on Twitter: @mhausenblas 27 kudos to okd.io and elastic.co

Slide 28

Slide 28 text

Hit me up on Twitter: @mhausenblas 28 Distributed tracing and debugging • Roots: need to overcome limitations of “time-synced logs” • Specifications: OpenCensus and OpenTracing • Tooling: Zipkin, Jaeger, Stackdriver • A must-have in a microservices setup • Debugging: use KubeSquash

Slide 29

Slide 29 text

Hit me up on Twitter: @mhausenblas

Slide 30

Slide 30 text

Apps run in pods, and pods sometimes fail …

Slide 31

Slide 31 text

Hit me up on Twitter: @mhausenblas 31 What’s (in) a pod?

Slide 32

Slide 32 text

Hit me up on Twitter: @mhausenblas control plane worker node kubectl apply kubelet asks container runtime via CRI to launch container(s) etcd happy? API Server stores desired state Scheduler sees new pod, selects node Scheduler assigns pod to a fitting node container runtime pulls image container runtime runs images kubelet takes over pod lifecycle (probes) pod runs until deleted or evicted garbage collection ask cluster admin NO YES does the pod get scheduled? fork out more $$$ container runtime happy? ask cluster admin can access container registry? fix access to registry is container starting up? (init containers) debug app probes fine? no leaking resources? soak testing, monitoring YES NO YES YES YES YES NO NO NO NO kubelet watches API server and notices new pod 1 2 3 4 5 6 7 8 9 container crashing after startup? NO YES debug app

Slide 33

Slide 33 text

Hit me up on Twitter: @mhausenblas

Slide 34

Slide 34 text

Hit me up on Twitter: @mhausenblas invalid YAML https://stackoverflow.com/questions/43532990/kubernetes-error-validating-data-found-invalid-field-env-for-v1-podspec

Slide 35

Slide 35 text

Hit me up on Twitter: @mhausenblas permission issues https://stackoverflow.com/questions/52095161/accessing-kubernetes-api-from-pod-fails-although-roles-are-configured-is-configu

Slide 36

Slide 36 text

Hit me up on Twitter: @mhausenblas 36 • Node (kubelet) • ABAC (outdated) • RBAC • Webhook (external) Authentication & authorization • static password/token file • X509 client certs • proxy+header • OpenID Connect • custom via Webhook

Slide 37

Slide 37 text

Hit me up on Twitter: @mhausenblas 37 Access control (RBAC) and policies • Use kubectl auth can-i to check RBAC permissions • Make yourself familiar with: • Pod Security Policies, might constrain your app too much • Network Policies, might be too strict for your app’s communication needs • See kubernetes-security.info

Slide 38

Slide 38 text

Hit me up on Twitter: @mhausenblas not scheduled https://stackoverflow.com/questions/48495263/scheduler-is-not-scheduling-pod-for-daemonset-in-master-node

Slide 39

Slide 39 text

Hit me up on Twitter: @mhausenblas control plane worker node kubectl apply kubelet asks container runtime via CRI to launch container(s) etcd happy? API Server stores desired state Scheduler sees new pod, selects node Scheduler assigns pod to a fitting node container runtime pulls image container runtime runs images kubelet takes over pod lifecycle (probes) pod runs until deleted or evicted garbage collection ask cluster admin NO YES does the pod get scheduled? fork out more $$$ container runtime happy? ask cluster admin can access container registry? fix access to registry is container starting up? (init containers) debug app probes fine? no leaking resources? soak testing, monitoring YES NO YES YES YES YES NO NO NO NO kubelet watches API server and notices new pod 1 2 3 4 5 6 7 8 9 container crashing after startup? NO YES debug app

Slide 40

Slide 40 text

Hit me up on Twitter: @mhausenblas registry issues https://stackoverflow.com/questions/51139988/trying-to-create-a-kubernetes-deployment-but-it-shows-0-pods-available

Slide 41

Slide 41 text

Hit me up on Twitter: @mhausenblas wrong image https://stackoverflow.com/questions/45436075/kubernetes-docker-no-image-found-error-while-rolling-update/45436799

Slide 42

Slide 42 text

Hit me up on Twitter: @mhausenblas 42 I think I’m having image issues … kubectl get events to the rescue?

Slide 43

Slide 43 text

Hit me up on Twitter: @mhausenblas startup issues

Slide 44

Slide 44 text

Hit me up on Twitter: @mhausenblas no long-running app https://stackoverflow.com/questions/41604499/my-kubernetes-pods-keep-crashing-with-crashloopbackoff-but-i-cant-find-any-lo

Slide 45

Slide 45 text

Hit me up on Twitter: @mhausenblas 45 Dunno, just keeps crashing … kubectl describe and exec

Slide 46

Slide 46 text

Hit me up on Twitter: @mhausenblas 46 Oh my Lanta, something’s wrong with the app … kubectl logs

Slide 47

Slide 47 text

Hit me up on Twitter: @mhausenblas 47 Networking

Slide 48

Slide 48 text

Hit me up on Twitter: @mhausenblas localhost? https://stackoverflow.com/questions/51662015/service-not-exposing-in-kubernetes

Slide 49

Slide 49 text

Hit me up on Twitter: @mhausenblas wrong port https://stackoverflow.com/questions/52289583/kubernetes-can-not-curl-minikube-pod/52289956

Slide 50

Slide 50 text

Hit me up on Twitter: @mhausenblas 50 What and how • Container networking in Kubernetes (CNI) • App-level or infra (CNI, DNS, etc.)? • See mhausenblas.info/cn-ref

Slide 51

Slide 51 text

Hit me up on Twitter: @mhausenblas 51 Stateful apps

Slide 52

Slide 52 text

Hit me up on Twitter: @mhausenblas node-local volume issue https://stackoverflow.com/questions/51206154/how-to-find-out-why-mounting-an-emptydir-volume-fails-in-kubernetes

Slide 53

Slide 53 text

Hit me up on Twitter: @mhausenblas PV/PVC issue https://stackoverflow.com/questions/53476478/persistentvolumeclaim-fails-to-create-on-alicloud-kubernetes/53479574

Slide 54

Slide 54 text

Hit me up on Twitter: @mhausenblas 54 What and how • Storage in Kubernetes (CSI) • Understand storage offerings (vendor docs!) • Failure modes • See stateful.kubernetes.sh

Slide 55

Slide 55 text

Vaccination

Slide 56

Slide 56 text

Hit me up on Twitter: @mhausenblas 56 Quizzie time! You wrote an application server. For load-balancing purposes, where would you put a reverse proxy such as NGINX? A. Into the container (same Dockerfile) B. Into a side car container (same pod) C. Into a separate pod

Slide 57

Slide 57 text

Hit me up on Twitter: @mhausenblas 57 Proactive measures Architect your apps the cloud native way by … • knowing and using the Kubernetes primitives (services, deployments) • implementing retries & timeouts (in-tree or via service mesh) • avoiding hardcoded (start-up) dependencies • listening on 0.0.0.0 (not 127.0.0.1)

Slide 58

Slide 58 text

Hit me up on Twitter: @mhausenblas 58 Proactive measures • Apply chaos engineering as long as all
 is well and learn from it where and how
 your system fails • Provide debug tools in image, but also: footprint, security! • Automate all the things: Autoscaler, Brigade, Draft, Forge, Helm, knative, ksync, odo, Operators, Skaffold, watchpod, etc.

Slide 59

Slide 59 text

Resources

Slide 60

Slide 60 text

Hit me up on Twitter: @mhausenblas 60 Liz Rice & Michael Hausenblas Operating Kubernetes Clusters and Applications Safely Kubernetes Security

Slide 61

Slide 61 text

Hit me up on Twitter: @mhausenblas 61 • Kubernetes Troubleshooting site • Debugging microservices - Squash vs. Telepresence • Debugging and Troubleshooting Microservices in Kubernetes with Ray Tsang (Google) • Troubleshooting Kubernetes Using Logs • Debug a Go Application in Kubernetes from IDE • Troubleshooting Kubernetes Networking Issues • Video: CrashLoopBackoff, Pending, FailedMount and Friends: Debugging Common Kubernetes Cluster • Video: Troubleshooting & Debugging Microservices in Kubernetes • Slide deck: Evolution of Monitoring and Prometheus Articles, slide decks, videos

Slide 62

Slide 62 text

Hit me up on Twitter: @mhausenblas 62 • 10 Most Common Reasons Kubernetes Deployments Fail: Part 1 and Part 2 • Kubernetes Application Operator Basics • Kubernetes: five steps to well-behaved apps • Kubernetes Best Practices • Developing on Kubernetes • Debugging Microservices: How Google SREs Resolve Outages • Debugging Microservices: Lessons from Google, Facebook, Lyft • Troubleshooting Java applications on OpenShift • Debugging Kubernetes PVCs Articles, slide decks, videos

Slide 63

Slide 63 text

Hit me up on Twitter: @mhausenblas 63 • kubernetes.io/docs/tasks/debug-application-cluster/debug-application/ • kubernetes.io/docs/tasks/debug-application-cluster/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-init-containers/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-service/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-stateful-set/ • kubernetes.io/docs/tasks/debug-application-cluster/local-debugging/ Official Kubernetes docs

Slide 64

Slide 64 text

plus.google.com/+RedHat linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHatNews learn.openshift.com