Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Root Cause Analysis for Middleware Issues by ...

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Root Cause Analysis for Middleware Issues by Kubernetes Resource Events / KST-2026

Avatar for Tomoyuki KOYAMA

Tomoyuki KOYAMA

February 07, 2026
Tweet

More Decks by Tomoyuki KOYAMA

Other Decks in Technology

Transcript

  1. Tomoyuki Koyama*†, Takayuki Kushida*, Soichiro Ikuno* *Tokyo University of Technology

    †Mercari, Inc. January 23, 2026 / KST-2026 @Pattaya, Thailand Root Cause Analysis for Middleware Issues by Kubernetes Resource Events 1
  2. Introduction • System failure occurred on a production web application

    “Doktor” • Root cause: Rook Ceph* (middleware of software-defined storage) • A node of Rook Ceph was down • Root Cause Analysis (RCA) • System administrator identifies root cause by finding Logs, Metrics, and Traces • RCA requires manual investigation and time- consuming process • The time to complete RCA should be reduced 2 * https://rook.io/ users Kubernetes node OSD Pod Metrics monitoring server Logs Traces Logs PV Mongo Pod App Pod Rook Ceph author MS error System Admin finding Web UI of Doktor
  3. Issue • Distributed architecture (e.g. Microservice) • Consists of several

    subsystems having dependencies • Sub-systems have a variety of fault points • Variety of fault points → time-consuming investigation • Software-based infrastructure • Network and Storage are defined as middleware • Infrastructure failure frequently propagates to the application failure [Li,2022] → System administrator takes time for RCA 3 Li, Xiaoyun, et al. "Going through the life cycle of faults in clouds: Guidelines on fault handling." 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). 2022. Rank From To Cases 1 Storage Application 73 2 Network Application 65 3 Middleware Application 45 Cascading Failure@Top-3 [Li, 2022] An example of infrastructure resource dependencies
  4. Related study 4 Single-modal RCA[1,2] Multi-modal RCA[3,4] Metrics Logs Traces

    • Single data source is used for RCA • Insufficient information for analyzing middleware status [1] L. Wu, et al., "MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems," 2021 IEEE/ACM International Workshop on Cloud Intelligence, Madrid, Spain, 2021, pp. 31-36 [2] L. Pham, et al., “BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection,” Proc. ACM on Software Engineering, vol. 1, no. FSE, pp. 2214-2237, Jul. 2024, [3] H. Wang, et al., "Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings," 2021 36th IEEE/ACM International Conference on Automated Software Engineering, Melbourne, Australia, 2021, pp. 419-429 [4] C. Lee, et al., "Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data," 2023 IEEE/ACM 45th International Conference on Software Engineering, Melbourne, Australia, 2023, pp. 1750-1762 • Multiple data sources are used for RCA • Lacking analysis of Kubernetes resource dependency Root cause Metrics Logs Root cause Root cause Root cause
  5. Kubernetes cluster (1) (4) scores Microservices events e1 e2 e3

    EventRCA dependency graph traces resource definition (2) (3) Score calculation (5) Service PVC Pod PV Proposed method: “EventRCA” • The method of root cause analysis for middleware issues • Multi-modal approach • Data sources: Traces, Events, Resource Definitions • Build resource dependency graph from traces and resource definitions • Aggregate the number of events for each resource • Detect event increases for each end-to- end route • Output the cause and score 5 0.0495 stats/ConfigMap/istio-ca-root-cert 0.0495 stats/LocalStorage/empty-dir 0.0495 /StorageProvisioner/rook-ceph.rbd.csi.ceph.com Example of scores Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 69s default-scheduler Successfully assigned default/ubuntu to clematis-worker2 Example of events
  6. single trace view span span span span score list view

    (2) click timestamp error (3)transition web UI for distributed tracing score resources events 0.81 (Pod) OSD 1 ... 0.64 (PV) author MongoDB ... database scores fault target system administrator (1) Use-case scenario • Integrate the proposed method to web UI for distributed tracing • System administrator finds root cause of system failure • The proposed method provide the root cause candidates as a list view • The list view allows system administrator to identify the root cause without time-consuming investigation
  7. Experimental method and environment • Evaluation metrics • Mean Reciprocal

    Rank (MRR) • Hits@k, k={1,2,3,4,5} • Execution time • Application: Doktor, microservice application • Baseline method: MicroRCA • Failure scenario • S1: Reproduced inconsistency error of Rook Ceph • One Kubernetes node shutdown to simulate node unresponsible failure (no ICMP response) • S2: Actual inconsistency error of Rook Ceph Failure scenario Event RCA fault targets Doktor Kubernetes cluster error SigNoz Evaluator traces resource name score score list events MRR Hits@k • Environment • 4VMs on VMware ESXi • K3s cluster (x1 master and x3 worker) • Istio (service mesh) • SigNoz (monitoring server)
  8. Experimental results 8 0.0 0.2 0.4 0.6 0.8 1.0 s1

    s2 MRR MicroRCA EventRCA Hits@k (k=1,2,3,4,5) • Proposed method (Event RCA): Root cause is contained in Top-3 MRR (Mean Reciprocal Rank) • A metric for ranking evaluation • MRR = ! " ∑#$! " ! %&'(! Better 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 Hits@k k MicroRCA-s1 EventRCA-s1 MicroRCA-s2 EventRCA-s2 Better 0 5 10 MicroRCA EventRCA Execution Time (s) Better Execution time k=1 ConfigMap/istio-ca-root-cert k=2 LocalStorage/empty-dir k=3 StorageProvisioner/rook-ceph.rbd.csi.ceph.com Root cause candidates of Top-3
  9. Conclusion Objective • Enable effective root cause analysis for middleware

    issues in Kubernetes Issue • Existing RCA methods lack accuracy and context awareness Proposed method • Uses events, dependency graphs, and traces • To detect event increases and root causes with focusing on resource dependencies Results • Increased MRR and Hits@k • Execution time: Proposed method < Baseline method 9