Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Istio helped us investigate failures on our microservices

How Istio helped us investigate failures on our microservices

Here are the slides used on IstioCon2021 Lightning Talk.

https://events.istio.io/istiocon-2021/sessions/how-istio-helped-us-investigate-failures-on-our-microservices/
---
We introduced Istio to our microservices. Istio’s logs, metrics, and features are very helpful for us to investigate in detail in case of failures.

One day we had big trouble due to a node failure, and it was very hard to find the root cause of why our application had not been recovered automatically. At that time, we finally found the root cause of it on our application logic thanks to Istio and we could reproduce the same failure in the development environment with Istio as well. I’d like to share this story.

s-shirayama

February 24, 2021
Tweet

More Decks by s-shirayama

Other Decks in Programming

Transcript

  1. #IstioCon The goal of this session I believe that people,

    who are considering Service Mesh, think that Istio looks good by giving an example of the usefulness of Istio in our system.
  2. #IstioCon Background • Microservices increase system complexity in general. •

    It wasn't easy to improve logging or architecture for the development team because of their focus on service development. • SRE decided to deploy Istio to combat the system complexity.
  3. #IstioCon Today’s Story Istio brought us the network's observability and

    testability, which led us to solve the complex system failure.
  4. #IstioCon Simple diagram of service architecture GKE Main Service (GraphQL)

    Service A Service B : Istio (Control Plane) Istio- Proxy
  5. #IstioCon One day, the node went down • Main Service

    pods were redundant. • The node on which Main Service pod was running went down. • Pod and Node recovered automatically after a while. Node Node Node Main Service Main Service Service A Automatic repair by k8s/GKE
  6. #IstioCon But, one endpoint on Main Service didn’t work •

    One endpoint on Main Service (which calls ServiceA internally) didn’t respond. • ServiceA pod was restarted manually. Then, it started working properly. Main Service Service A Service B : x o o
  7. #IstioCon Main Service The additional clue for the root cause

    from Istio-Proxy’s log • Fact 1: The node on which Main Service pod was running went down. • Fact 2: One Main Service endpoint stopped working until manual ServiceA pod restart. • (No useful application logs to reveal what was happening...) • Fact3: Main Service waited for responses from ServiceA. Istio-Proxy access log • StatusCode : 503 • ResponseFlag : Upstream Connection Termination • Duration : long (untill ServiceA pod restart) Service A Istio- Proxy
  8. #IstioCon Why did Main Service wait for responses from ServiceA

    ? Hypothesis: 1. Node down caused that ServiceA waited for Main Service’s response forever. 2. The Queue was full, and ServiceA waited to enqueue the data. 3. ServiceA didn’t return responses to Main Service Main Service Service A queue API API API daemon enqueue dequeue 1 2 3
  9. #IstioCon Main Service Reproduce the failure in a test environment

    • Difficult to reproduce the same situation... • Istio’s “Fault injection" makes it easy to reproduce the same situation in a test environment. Service A queue API API API daemon enqueue dequeue Inject delay with Istio’s “Fault Injection” feature Istio- Proxy
  10. #IstioCon Recap: Istio helped us investigate failure on our microservices

    Istio improves observability → Istio-Proxy’s log helped us find the hypothesis of the cause of failure. Istio improves testability → Istio’s feature (fault injection) made it easy to reproduce the failure.