Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Grand Adventure of Production Apps: Build, ...

Avatar for Aoi Takahashi Aoi Takahashi
June 16, 2025
100

The Grand Adventure of Production Apps: Build, Break, and Survive!

"I've started to understand the basics of Kubernetes, but when it comes to running it in production, I can't quite imagine what kind of issues might arise..."
To ease these concerns, this session will use original characters, illustrations, and animations in a “cute manga” style to visually demonstrate how production applications can break and how to troubleshoot them.
In our story, the main application as a character takes center stage as the hero, venturing out on a grand journey—only to be "suddenly attacked by a monster" at the most inopportune moment. By following this storyline, you will learn both how applications fail and how to fix them. Additionally, just as an adventurer equips better armor to prepare for future battles, we will explore common issues and explain how to prevent them from happening in the first place.
Join us on an exciting adventure in "manga-style troubleshooting" and gain the confidence to tackle production Kubernetes challenges head-on!

Avatar for Aoi Takahashi

Aoi Takahashi

June 16, 2025
Tweet

Transcript

  1. Build, Break, and Survive! ~ A Kawaii Manga Journey Through

    the Ups and Downs of Production Apps ~ AOI TAKAHASHI The Grand Adventure Of Production Apps
  2. About me • Aoi Takahashi • Working as a Site

    Reliability Engineer at an IT Company • I live with two Dogs(Golden Retriever)🐶 • Love reading and writing Manga! • 𝕏: @_a0i • GitHub: github.com/aoi1 • LinkedIn
  3. • I wrote a book 「つくって、壊して、直して学ぶKubernetes入門」which means “Build, Breaking, Fixing:

    A Playful Way to Learn Kubernetes” • In this session, I’ll guide you through the world of Build, Break and Fixing — the very heart of the book this talk is based on.
  4. The Target of This Session • People who know the

    basics of Kubernetes. ◦ I will not explain what a Pod is, what a Service is, etc. • People who have little experience in running applications in production with Kubernetes. I hope to help people like the above to run their applications with a little more confidence.
  5. Table of Contents Chapter 1. - The Beginning - Before

    the Hero Draws the Sword (kubectl) Chapter 2. - Level Up by Battling Monsters - Learn Troubleshooting Techniques for Real-World Production Issues Chapter 3. - And so the journey continues... - Recommended Books and Sessions for the Journey Ahead
  6. - The Beginning - The hero embarked on a quest

    to slay the monsters wreaking havoc on production applications. This is the tale of a lone hero who rose to save the production environment. Monsters known as incidents strike at applications, even today… Can the hero protect the service and emerge victorious...?
  7. Hero's Sword:kubectl What you first need to do… Is to

    become proficient in using kubectl. The most fundamental and versatile weapon.
  8. Don’t forget! You should specify the namespace; otherwise, the default

    namespace will be used. 🙁 kubectl get pods ✅ kubectl get pods --namespace(-n) mynamespace ✅ kubectl config set-context --current --namespace=mynamespace kubectl get pods ✅ kubens mynamespace kubectl get pods
  9. Other Tools kubectx kubectx is a CLI tool that simplifies

    switching between multiple Kubernetes contexts. A context in Kubernetes defines the cluster, user, and namespace to use. kubens kubens complements kubectx by simplifying namespace switching. Instead of specifying --namespace repeatedly.
  10. Other Tools k9s k9s is a powerful terminal-based UI that

    lets you interactively manage your Kubernetes cluster.
  11. Hero’s Learnings: Mastering the Basics of kubectl • kubectl is

    your primary weapon ◦ get comfortable with the basics:kubectl get, describe • Always specify the correct namespace — don't get lost in the default realm!
  12. First monster: ImagePullBackOff Here, we’ll start by learning how to

    fight our first monster. We’ll do this by deliberately reproducing a common error — the ImagePullBackOff state — and then fixing it.
  13. About kubectl edit Here, we used kubectl edit to fix

    the issue right from the command line. But a word of caution: kubectl edit makes direct changes to your live environment. In real-world operations, changes should flow through your deployment workflow, where proper manifests are applied safely — according to your team’s practices.
  14. - Level Up by Battling Monsters - What’s important in

    troubleshooting is being able to effectively repeat cycles of forming hypotheses and testing them. But in order to form good hypotheses, you also need to understand the big picture. For example, having a “map” of a typical setup for deploying a stateless application can help smooth your journey.
  15. Hero’s Learnings: Battling Common Incidents • Troubleshooting = Hypothesize →

    Test → Observe → Repeat • Know what to look for: ◦ Pod status ◦ Container states ◦ Logs & events • Misconfigured probes can hurt your app — use them wisely!
  16. - Upgrade Your Gear - Improving observability enables faster troubleshooting.

    Examples: the three pillars • Logs - Storing logs for a long period helps with historical investigations. • Metrics - Making metrics accessible helps with setting up and responding to alerts. • Traces - Making traces accessible allows you to identify which requests had issues.
  17. - And so the journey continues... - The journey is

    far from over. The troubleshooting methods introduced in this session are just a small part—out in the real world, you'll face many more challenges. To overcome them, it's important to draw on the wisdom of those who came before you.
  18. - Recommended Books, Communities - 10 Weird Ways to Blow

    Up Your Kubernetes - Melanie Cebula & Bruce Sherrod, Airbnb https://www.youtube.com/watch?v=FrQ8Lwm9_j8 10 More Weird Ways to Blow Up Your Kubernetes - Jian Cheung & Joseph Kim, Airbnb https://www.youtube.com/watch?v=4CT0cI62YHk Books in English: • Cloud Native DevOps with Kubernetes: Building, Deploying, and Scaling Modern Applications in the Cloud by John Arundel, Justin Domingus • Kubernetes Patterns: Reusable Elements for Designing Cloud-Native Applications by Bilgin Ibryam , Roland Huß Books in Japanese: • Kubernetes完全ガイド by Aoyama Masaya OR… Earn a certification:CKAD, CKA
  19. END

  20. Hint: Sample of Status ErrImagePull There was some issue when

    pulling an image. Error The container has some error. OOMKilled The container was killed by OOM.
  21. Hint: Setting the probes Liveness probe Liveness probes determine when

    to restart a container. For example, liveness probes could catch a deadlock when an application is running but unable to make progress. Readiness probe Readiness probes determine when a container is ready to start accepting traffic. Startup probe A startup probe verifies whether the application within a container is started. Probes are originally intended to make applications more robust, but misconfigurations can sometimes prevent the application from starting. Of course, there are also cases where the application doesn't start as intended, in order to protect it properly.
  22. Hint: Pod Phases & Container States Pending The Pod has

    been accepted by the Kubernetes cluster, but one or more of the containers has not been set up and made ready to run. This includes time a Pod spends waiting to be scheduled as well as the time spent downloading container images over the network. Running The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting. Succeeded All containers in the Pod have terminated in success, and will not be restarted. Failed All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system, and is not set for automatic restarting. Unknown For some reason the state of the Pod could not be obtained. This phase typically occurs due to an error in communicating with the node where the Pod should be running.
  23. Hint: Container States Waiting If a container is not in

    either the Running or Terminated state, it is Waiting. The container is still running the operations it requires in order to complete start up. Running The Running status indicates that a container is executing without issues. Terminated A container in the Terminated state began execution and then either ran to completion or failed for some reason.