Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Troubleshooting Kubernetes Applications

Michael Hausenblas
September 30, 2018
530

Troubleshooting Kubernetes Applications

Talk at O'Reilly Velocity New York, see http://troubleshooting.kubernetes.sh

Michael Hausenblas

September 30, 2018
Tweet

Transcript

  1. Troubleshooting Kubernetes Applications
    Michael Hausenblas @mhausenblas

    Developer Advocate, Red Hat

    2018-10-03, O’Reilly Velocity NYC
    troubleshooting.kubernetes.sh

    View full-size slide

  2. Hit me up on Twitter: @mhausenblas
    2
    Scope
    • Focusing on prototyping, developing, and testing applications

    with Kubernetes from an appops perspective
    • Beware, the talk is not really (much) about …
    • troubleshooting installation issues
    • performance testing or optimising containerized microservices
    • SRE-style troubleshooting (check out what Googlers say on this topic)

    View full-size slide

  3. Hit me up on Twitter: @mhausenblas

    View full-size slide

  4. Hit me up on Twitter: @mhausenblas
    4

    View full-size slide

  5. Hit me up on Twitter: @mhausenblas
    5
    Monoliths vs. microservices
    monolith
    v1
    monolith
    v2
    time
    µS1

    v1
    µS2

    v1
    µS3

    v1
    µS2

    v2
    µS3

    v2
    µS1

    v2
    µS2

    v3
    µS3

    v3
    µS1

    v3
    µS3

    v4
    µS2

    v4
    µS3

    v5
    µS1

    v4
    µS2

    v5
    µS3

    v6

    View full-size slide

  6. Hit me up on Twitter: @mhausenblas
    6
    Moving parts—physical view

    View full-size slide

  7. Hit me up on Twitter: @mhausenblas
    7
    Moving parts—logical view

    View full-size slide

  8. Hit me up on Twitter: @mhausenblas
    8
    Honestly, not much new under the sun …
    www.computerhistory.org/tdih/september/9/

    View full-size slide

  9. Hit me up on Twitter: @mhausenblas
    9
    Honestly, not much new under the sun …
    www.ibiblio.org/harris/500milemail.html

    View full-size slide

  10. TOP 10 failures

    View full-size slide

  11. Hit me up on Twitter: @mhausenblas

    View full-size slide

  12. Hit me up on Twitter: @mhausenblas
    12
    The TOP 10 list
    1. invalid YAML specification (EXAMPLE)
    2. can’t reach service/pod (EXAMPLE 1, EXAMPLE 2)
    3. wrong container image (EXAMPLE)
    4. no access to container registry (EXAMPLE)
    5. supposedly long-running application exits (EXAMPLE)

    View full-size slide

  13. Hit me up on Twitter: @mhausenblas
    13
    The TOP 10 list
    6. missing/bad config or secret ( EXAMPLE)
    7. failed mounts (EXAMPLE)
    8. lifecycle issues/probes fail (EXAMPLE)
    9. wrong or missing permissions (EXAMPLE)
    10. looking at the wrong place aka: where is localhost? (EXAMPLE)

    View full-size slide

  14. Poking around

    View full-size slide

  15. Hit me up on Twitter: @mhausenblas
    Observe
    What’s in the logs?
    What the baseline?
    Orient
    Formulate hypotheses.
    Don’t jump to conclusions.
    Decide
    Sort hypotheses by likelihood.

    Pick one of the hypotheses.
    Act
    Test the hypothesis you picked.
    If confirmed: fix it, else: continue.
    OODA
    loop

    View full-size slide

  16. Hit me up on Twitter: @mhausenblas
    16
    The How
    • Using kubectl get events
    • Using kubectl describe
    • Using kubectl exec
    • Using kubectl logs
    • Full-blown observability approaches (note: we get back to that later)

    View full-size slide

  17. Apps run in pods, and pods sometimes fail …

    View full-size slide

  18. Hit me up on Twitter: @mhausenblas
    18
    What’s (in) a pod?

    View full-size slide

  19. Hit me up on Twitter: @mhausenblas
    control plane
    worker node
    kubectl apply
    kubelet asks
    container runtime
    via CRI to launch
    container(s)
    etcd happy?
    API Server stores
    desired state
    Scheduler sees new
    pod, selects node
    Scheduler assigns
    pod to a
    fitting node
    container runtime
    pulls image
    container runtime
    runs images
    kubelet takes over
    pod lifecycle
    (probes)
    pod runs until
    deleted or evicted
    garbage
    collection
    ask cluster
    admin
    NO
    YES
    does the pod get
    scheduled?
    fork out more $$$
    container runtime
    happy?
    ask cluster
    admin
    can access
    container registry?
    fix access to
    registry
    is container
    starting up?
    (init containers)
    debug app
    probes fine?
    no leaking
    resources?
    soak testing,
    monitoring
    YES
    NO
    YES
    YES
    YES
    YES
    NO
    NO
    NO
    NO
    kubelet watches
    API server and
    notices new pod
    1 2 3
    4
    5
    6
    7
    8
    9
    container crashing
    after startup?
    NO
    YES
    debug app

    View full-size slide

  20. Hit me up on Twitter: @mhausenblas
    20
    I think I’m having image issues …
    kubectl get events to the rescue?

    View full-size slide

  21. hands-on time!

    View full-size slide

  22. Hit me up on Twitter: @mhausenblas
    22
    Dunno, just keeps crashing …
    kubectl describe and exec

    View full-size slide

  23. hands-on time!

    View full-size slide

  24. Hit me up on Twitter: @mhausenblas
    24
    Oh my Lanta, something’s wrong with the app …
    kubectl logs

    View full-size slide

  25. hands-on time!

    View full-size slide

  26. Hit me up on Twitter: @mhausenblas
    27
    What and how
    • Storage in Kubernetes (CSI)
    • Failure modes
    • See stateful.kubernetes.sh

    View full-size slide

  27. hands-on time!

    View full-size slide

  28. Hit me up on Twitter: @mhausenblas
    30
    What and how
    • Container networking in Kubernetes (CNI)
    • Failure modes
    • See mhausenblas.info/cn-ref/

    View full-size slide

  29. Hit me up on Twitter: @mhausenblas
    31

    View full-size slide

  30. hands-on time!

    View full-size slide

  31. Hit me up on Twitter: @mhausenblas
    34
    Access control (RBAC) and policies
    • Use kubectl auth can-i to check RBAC permissions
    • Make yourself familiar with:
    • Pod Security Policies, might constrain your app too much
    • Network Policies, might be too strict for your app’s communication needs
    • See kubernetes-security.info

    View full-size slide

  32. hands-on time!

    View full-size slide

  33. Observability

    View full-size slide

  34. Hit me up on Twitter: @mhausenblas
    37
    Metrics
    node
    container runtime
    app
    alerts
    dashboards
    storage
    event router

    View full-size slide

  35. Hit me up on Twitter: @mhausenblas
    38
    Metrics
    • Use industry standards in cloud native land: Prometheus + Grafana
    • Out-of-the-box low-level metrics (CPU, memory)
    • Options for app-specific (custom) metrics:
    • instrumentation of the app
    • service mesh-based approaches (aka: magic), for example Linkerd2

    View full-size slide

  36. Hit me up on Twitter: @mhausenblas
    39 kudos to demo.robustperception.io

    View full-size slide

  37. Hit me up on Twitter: @mhausenblas
    40 kudos to linkerd.io/2

    View full-size slide

  38. Hit me up on Twitter: @mhausenblas
    41 kudos to linkerd.io/2 and grafana.com

    View full-size slide

  39. Hit me up on Twitter: @mhausenblas
    42 kudos to okd.io

    View full-size slide

  40. Hit me up on Twitter: @mhausenblas
    43
    Aggregated logs

    View full-size slide

  41. Hit me up on Twitter: @mhausenblas
    44
    Aggregated logs
    • In app, log to stdout or if you can’t use an adapter
    • Use the industry standard in cloud native land: ELK/EFK stack

    View full-size slide

  42. Hit me up on Twitter: @mhausenblas
    45 kudos to okd.io and elastic.co

    View full-size slide

  43. Hit me up on Twitter: @mhausenblas
    46
    Distributed tracing
    • Roots: need to overcome limitations of “time-synced logs”
    • Specifications: OpenCensus and OpenTracing
    • Tooling: Zipkin, Jaeger, Stackdriver
    • A must-have in a microservices setup

    View full-size slide

  44. Hit me up on Twitter: @mhausenblas

    View full-size slide

  45. hands-on time!

    View full-size slide

  46. Hit me up on Twitter: @mhausenblas
    50
    Quizzie time!
    You wrote an application server. For load-balancing purposes, where would
    you put a reverse proxy such as NGINX?
    A. Into the container (same Dockerfile)

    B. Into a side car container (same pod)

    C. Into a separate pod

    View full-size slide

  47. Hit me up on Twitter: @mhausenblas
    51
    Proactive measures
    Architect your apps the cloud native way by …
    • knowing and using the Kubernetes primitives (services, deployments)
    • implementing retries & timeouts (in-tree or via service mesh)
    • avoiding hardcoded (start-up) dependencies
    • listening on 0.0.0.0 (not 127.0.0.1)

    View full-size slide

  48. Hit me up on Twitter: @mhausenblas
    52
    Proactive measures
    • Apply chaos engineering as long as all

    is well and learn from it where and how

    your system fails
    • Provide debug tools in image, but also:
    footprint, security!
    • Automate all the things: Autoscaler, Brigade, Draft, Forge, Helm, knative,
    ksync, odo, Operators, Skaffold, watchpod, etc.

    View full-size slide

  49. hands-on time!

    View full-size slide

  50. Hit me up on Twitter: @mhausenblas
    55
    Liz Rice & Michael Hausenblas
    Operating Kubernetes Clusters
    and Applications Safely
    Kubernetes
    Security

    View full-size slide

  51. Hit me up on Twitter: @mhausenblas
    56
    • Kubernetes Troubleshooting site
    • GKE Troubleshooting docs
    • Debugging microservices - Squash vs. Telepresence
    • Debugging and Troubleshooting Microservices in Kubernetes with Ray Tsang (Google)
    • Troubleshooting Kubernetes Using Logs
    • Debug a Go Application in Kubernetes from IDE
    • Troubleshooting Kubernetes Networking Issues
    • Video: CrashLoopBackoff, Pending, FailedMount and Friends: Debugging Common Kubernetes Cluster (KubeCon NA 2017)
    • Slide deck: Evolution of Monitoring and Prometheus
    Articles

    View full-size slide

  52. Hit me up on Twitter: @mhausenblas
    57
    • 10 Most Common Reasons Kubernetes Deployments Fail: Part 1 and Part 2
    • Kubernetes Application Operator Basics
    • Kubernetes: five steps to well-behaved apps
    • Kubernetes Best Practices
    • Developing on Kubernetes
    • Debugging Microservices: How Google SREs Resolve Outages
    • Debugging Microservices: Lessons from Google, Facebook, Lyft
    • Troubleshooting Java applications on OpenShift
    • Debugging Kubernetes PVCs
    Articles

    View full-size slide

  53. Hit me up on Twitter: @mhausenblas
    58
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-application/
    • kubernetes.io/docs/tasks/debug-application-cluster/
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-init-containers/
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-service/
    • kubernetes.io/docs/tasks/debug-application-cluster/debug-stateful-set/
    • kubernetes.io/docs/tasks/debug-application-cluster/local-debugging/
    Official Kubernetes docs

    View full-size slide

  54. plus.google.com/+RedHat
    linkedin.com/company/red-hat
    youtube.com/user/RedHatVideos
    facebook.com/redhatinc
    twitter.com/RedHatNews
    learn.openshift.com

    View full-size slide