$30 off During Our Annual Pro Sale. View Details »

KubeCologne keynote—Troubleshooting Kubernetes apps

KubeCologne keynote—Troubleshooting Kubernetes apps

Michael Hausenblas

February 08, 2019
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. Troubleshooting Kubernetes apps
    Michael Hausenblas @mhausenblas

    Developer Advocate, Red Hat

    2019-02-08, KubeCologne
    troubleshooting.kubernetes.sh

    View Slide

  2. Hit me up on Twitter: @mhausenblas
    2
    Scope
    • Focusing on prototyping, developing, and testing applications

    with Kubernetes from an appops perspective (tools & techniques)
    • But not really (much) about …
    • troubleshooting installation or upgrading issues
    • performance testing or optimising containerized microservices
    • SRE-style troubleshooting (check out what Googlers say on this topic)

    View Slide

  3. Hit me up on Twitter: @mhausenblas
    3
    Monoliths vs. microservices
    monolith
    v1
    monolith
    v2
    time
    µS1

    v1
    µS2

    v1
    µS3

    v1
    µS2

    v2
    µS3

    v2
    µS1

    v2
    µS2

    v3
    µS3

    v3
    µS1

    v3
    µS3

    v4
    µS2

    v4
    µS3

    v5
    µS1

    v4
    µS2

    v5
    µS3

    v6

    View Slide

  4. Hit me up on Twitter: @mhausenblas
    4

    View Slide

  5. Hit me up on Twitter: @mhausenblas
    5
    Moving parts—logical view

    View Slide

  6. Hit me up on Twitter: @mhausenblas
    6
    Moving parts—physical view

    View Slide

  7. TOP 10 failures

    View Slide

  8. Hit me up on Twitter: @mhausenblas
    8
    The TOP 10 list
    1. invalid YAML specification
    2. wrong or missing permissions
    3. wrong container image
    4. no access to container registry
    5. supposedly long-running application exits

    View Slide

  9. Hit me up on Twitter: @mhausenblas
    9
    The TOP 10 list
    6. missing/bad config or secret
    7. lifecycle issues (probes fail)
    8. can’t reach service
    9. looking at the wrong place—where is localhost?
    10. failed mounts

    View Slide

  10. Not just poking around …

    View Slide

  11. Hit me up on Twitter: @mhausenblas
    Observe
    What’s in the logs?
    Establish baseline.
    Orient
    Formulate hypotheses.
    Don’t jump to conclusions.
    Decide
    Sort hypotheses by likelihood.

    Pick one of the hypotheses.
    Act
    Test the hypothesis you picked.
    If confirmed: fix it, else: continue.
    OODA
    loop

    View Slide

  12. Hit me up on Twitter: @mhausenblas

    View Slide

  13. Hit me up on Twitter: @mhausenblas

    View Slide

  14. Hit me up on Twitter: @mhausenblas
    • Deployment seems OK
    • Pod seems OK (image found, scheduled, launched)
    • I see log output, so container is running
    • Keeps crashing after launch

    View Slide

  15. Hit me up on Twitter: @mhausenblas
    • Could be a resource issues (OOM, etc.)
    • Could be config/data missing
    • Could be an application logic/runtime error

    View Slide

  16. Hit me up on Twitter: @mhausenblas
    1.Could be an application logic/runtime error
    2.Could be a resource issues (OOM, etc.)
    3.Could be config/data missing

    View Slide

  17. Hit me up on Twitter: @mhausenblas
    command:
    - sh
    - '-c'
    - echo "I will just print something here and then exit” && sleep 1000

    View Slide

  18. Hit me up on Twitter: @mhausenblas
    18
    The How
    • Using kubectl get events
    • Using kubectl describe
    • Using kubectl exec
    • Using kubectl logs (or kubetail, stern)
    • Full-blown observability approaches

    View Slide

  19. Observability

    View Slide

  20. Hit me up on Twitter: @mhausenblas
    20
    Metrics
    node
    container runtime
    app
    alerts
    dashboards
    storage
    event router

    View Slide

  21. Hit me up on Twitter: @mhausenblas
    21
    Metrics
    • Out-of-the-box low-level metrics (CPU, memory)
    • Application-specific metrics (full-blown instrumentation vs service mesh-
    based approaches)
    • Options
    • Roll your own, use the industry standards Prometheus + Grafana
    • Cloud provider native

    View Slide

  22. Hit me up on Twitter: @mhausenblas
    22 kudos to okd.io

    View Slide

  23. Hit me up on Twitter: @mhausenblas
    23 kudos to linkerd.io/2 and grafana.com

    View Slide

  24. Hit me up on Twitter: @mhausenblas
    24 kudos to linkerd.io/2

    View Slide

  25. Hit me up on Twitter: @mhausenblas
    25
    Aggregated logs

    View Slide

  26. Hit me up on Twitter: @mhausenblas
    26
    Aggregated logs
    • In app, log to stdout or if you can’t use an adapter
    • Options
    • Roll your own, use the industry standards: ELK/EFK stack
    • Cloud provider native such as CloudWatch or StackDriver

    View Slide

  27. Hit me up on Twitter: @mhausenblas
    27 kudos to okd.io and elastic.co

    View Slide

  28. Hit me up on Twitter: @mhausenblas
    28
    Distributed tracing and debugging
    • Roots: need to overcome limitations of “time-synced logs”
    • Specifications: OpenCensus and OpenTracing
    • Tooling: Zipkin, Jaeger, Stackdriver
    • A must-have in a microservices setup
    • Debugging: use KubeSquash

    View Slide

  29. Hit me up on Twitter: @mhausenblas

    View Slide

  30. Apps run in pods, and pods sometimes fail …

    View Slide

  31. Hit me up on Twitter: @mhausenblas
    31
    What’s (in) a pod?

    View Slide

  32. Hit me up on Twitter: @mhausenblas
    control plane
    worker node
    kubectl apply
    kubelet asks
    container runtime
    via CRI to launch
    container(s)
    etcd happy?
    API Server stores
    desired state
    Scheduler sees new
    pod, selects node
    Scheduler assigns
    pod to a
    fitting node
    container runtime
    pulls image
    container runtime
    runs images
    kubelet takes over
    pod lifecycle
    (probes)
    pod runs until
    deleted or evicted
    garbage
    collection
    ask cluster
    admin
    NO
    YES
    does the pod get
    scheduled?
    fork out more $$$
    container runtime
    happy?
    ask cluster
    admin
    can access
    container registry?
    fix access to
    registry
    is container
    starting up?
    (init containers)
    debug app
    probes fine?
    no leaking
    resources?
    soak testing,
    monitoring
    YES
    NO
    YES
    YES
    YES
    YES
    NO
    NO
    NO
    NO
    kubelet watches
    API server and
    notices new pod
    1 2 3
    4
    5
    6
    7
    8
    9
    container crashing
    after startup?
    NO
    YES
    debug app

    View Slide

  33. Hit me up on Twitter: @mhausenblas

    View Slide

  34. Hit me up on Twitter: @mhausenblas
    invalid YAML
    https://stackoverflow.com/questions/43532990/kubernetes-error-validating-data-found-invalid-field-env-for-v1-podspec

    View Slide

  35. Hit me up on Twitter: @mhausenblas
    permission
    issues
    https://stackoverflow.com/questions/52095161/accessing-kubernetes-api-from-pod-fails-although-roles-are-configured-is-configu

    View Slide

  36. Hit me up on Twitter: @mhausenblas
    36
    • Node (kubelet)
    • ABAC (outdated)
    • RBAC
    • Webhook (external)
    Authentication & authorization
    • static password/token file
    • X509 client certs
    • proxy+header
    • OpenID Connect
    • custom via Webhook

    View Slide

  37. Hit me up on Twitter: @mhausenblas
    37
    Access control (RBAC) and policies
    • Use kubectl auth can-i to check RBAC permissions
    • Make yourself familiar with:
    • Pod Security Policies, might constrain your app too much
    • Network Policies, might be too strict for your app’s communication needs
    • See kubernetes-security.info

    View Slide

  38. Hit me up on Twitter: @mhausenblas
    not scheduled
    https://stackoverflow.com/questions/48495263/scheduler-is-not-scheduling-pod-for-daemonset-in-master-node

    View Slide

  39. Hit me up on Twitter: @mhausenblas
    control plane
    worker node
    kubectl apply
    kubelet asks
    container runtime
    via CRI to launch
    container(s)
    etcd happy?
    API Server stores
    desired state
    Scheduler sees new
    pod, selects node
    Scheduler assigns
    pod to a
    fitting node
    container runtime
    pulls image
    container runtime
    runs images
    kubelet takes over
    pod lifecycle
    (probes)
    pod runs until
    deleted or evicted
    garbage
    collection
    ask cluster
    admin
    NO
    YES
    does the pod get
    scheduled?
    fork out more $$$
    container runtime
    happy?
    ask cluster
    admin
    can access
    container registry?
    fix access to
    registry
    is container
    starting up?
    (init containers)
    debug app
    probes fine?
    no leaking
    resources?
    soak testing,
    monitoring
    YES
    NO
    YES
    YES
    YES
    YES
    NO
    NO
    NO
    NO
    kubelet watches
    API server and
    notices new pod
    1 2 3
    4
    5
    6
    7
    8
    9
    container crashing
    after startup?
    NO
    YES
    debug app

    View Slide

  40. Hit me up on Twitter: @mhausenblas
    registry issues
    https://stackoverflow.com/questions/51139988/trying-to-create-a-kubernetes-deployment-but-it-shows-0-pods-available

    View Slide

  41. Hit me up on Twitter: @mhausenblas
    wrong image
    https://stackoverflow.com/questions/45436075/kubernetes-docker-no-image-found-error-while-rolling-update/45436799

    View Slide

  42. Hit me up on Twitter: @mhausenblas
    42
    I think I’m having image issues …
    kubectl get events to the rescue?

    View Slide

  43. Hit me up on Twitter: @mhausenblas
    startup issues

    View Slide

  44. Hit me up on Twitter: @mhausenblas
    no long-running
    app
    https://stackoverflow.com/questions/41604499/my-kubernetes-pods-keep-crashing-with-crashloopbackoff-but-i-cant-find-any-lo

    View Slide

  45. Hit me up on Twitter: @mhausenblas
    45
    Dunno, just keeps crashing …
    kubectl describe and exec

    View Slide

  46. Hit me up on Twitter: @mhausenblas
    46
    Oh my Lanta, something’s wrong with the app …
    kubectl logs

    View Slide

  47. Hit me up on Twitter: @mhausenblas
    47
    Networking

    View Slide

  48. Hit me up on Twitter: @mhausenblas
    localhost?
    https://stackoverflow.com/questions/51662015/service-not-exposing-in-kubernetes

    View Slide

  49. Hit me up on Twitter: @mhausenblas
    wrong port
    https://stackoverflow.com/questions/52289583/kubernetes-can-not-curl-minikube-pod/52289956

    View Slide

  50. Hit me up on Twitter: @mhausenblas
    50
    What and how
    • Container networking in Kubernetes (CNI)
    • App-level or infra (CNI, DNS, etc.)?
    • See mhausenblas.info/cn-ref

    View Slide

  51. Hit me up on Twitter: @mhausenblas
    51
    Stateful apps

    View Slide

  52. Hit me up on Twitter: @mhausenblas
    node-local
    volume issue
    https://stackoverflow.com/questions/51206154/how-to-find-out-why-mounting-an-emptydir-volume-fails-in-kubernetes

    View Slide

  53. Hit me up on Twitter: @mhausenblas
    PV/PVC issue
    https://stackoverflow.com/questions/53476478/persistentvolumeclaim-fails-to-create-on-alicloud-kubernetes/53479574

    View Slide

  54. Hit me up on Twitter: @mhausenblas
    54
    What and how
    • Storage in Kubernetes (CSI)
    • Understand storage offerings (vendor docs!)
    • Failure modes
    • See stateful.kubernetes.sh

    View Slide

  55. Vaccination

    View Slide

  56. Hit me up on Twitter: @mhausenblas
    56
    Quizzie time!
    You wrote an application server. For load-balancing purposes, where would
    you put a reverse proxy such as NGINX?
    A. Into the container (same Dockerfile)

    B. Into a side car container (same pod)

    C. Into a separate pod

    View Slide

  57. Hit me up on Twitter: @mhausenblas
    57
    Proactive measures
    Architect your apps the cloud native way by …
    • knowing and using the Kubernetes primitives (services, deployments)
    • implementing retries & timeouts (in-tree or via service mesh)
    • avoiding hardcoded (start-up) dependencies
    • listening on 0.0.0.0 (not 127.0.0.1)

    View Slide

  58. Hit me up on Twitter: @mhausenblas
    58
    Proactive measures
    • Apply chaos engineering as long as all

    is well and learn from it where and how

    your system fails
    • Provide debug tools in image, but also:
    footprint, security!
    • Automate all the things: Autoscaler, Brigade, Draft, Forge, Helm, knative,
    ksync, odo, Operators, Skaffold, watchpod, etc.

    View Slide

  59. Resources

    View Slide

  60. Hit me up on Twitter: @mhausenblas
    60
    Liz Rice & Michael Hausenblas
    Operating Kubernetes Clusters
    and Applications Safely
    Kubernetes
    Security

    View Slide

  61. Hit me up on Twitter: @mhausenblas
    61
    • Kubernetes Troubleshooting site
    • Debugging microservices - Squash vs. Telepresence
    • Debugging and Troubleshooting Microservices in Kubernetes with Ray Tsang (Google)
    • Troubleshooting Kubernetes Using Logs
    • Debug a Go Application in Kubernetes from IDE
    • Troubleshooting Kubernetes Networking Issues
    • Video: CrashLoopBackoff, Pending, FailedMount and Friends: Debugging Common Kubernetes Cluster
    • Video: Troubleshooting & Debugging Microservices in Kubernetes
    • Slide deck: Evolution of Monitoring and Prometheus
    Articles, slide decks, videos

    View Slide

  62. Hit me up on Twitter: @mhausenblas
    62
    • 10 Most Common Reasons Kubernetes Deployments Fail: Part 1 and Part 2
    • Kubernetes Application Operator Basics
    • Kubernetes: five steps to well-behaved apps
    • Kubernetes Best Practices
    • Developing on Kubernetes
    • Debugging Microservices: How Google SREs Resolve Outages
    • Debugging Microservices: Lessons from Google, Facebook, Lyft
    • Troubleshooting Java applications on OpenShift
    • Debugging Kubernetes PVCs
    Articles, slide decks, videos

    View Slide

  63. Hit me up on Twitter: @mhausenblas
    63
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-application/
    • kubernetes.io/docs/tasks/debug-application-cluster/
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-init-containers/
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-service/
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-stateful-set/
    • kubernetes.io/docs/tasks/debug-application-cluster/local-debugging/
    Official Kubernetes docs

    View Slide

  64. plus.google.com/+RedHat
    linkedin.com/company/red-hat
    youtube.com/user/RedHatVideos
    facebook.com/redhatinc
    twitter.com/RedHatNews
    learn.openshift.com

    View Slide