of Large Language Models on Distributed Infrastructures: A Survey”, arXiv, 2024. • Qian Ding, “Transformers in SRE Land:Evolving to Manage AI Infrastructure”, USENIX SREcon25 America, 2025. • Deepak Narayanan, et al., “Ef fi cient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM”, the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2021. • Yusheng Zheng, et al., “Extending Applications Safely and Ef fi ciently”, USENIX OSDI, 2025. • Yiwei Yang, et al., “eGPU: Extending eBPF Programmability and Observability to GPUs”, Workshop on Heterogeneous Composable and Disaggregated Systems (HCDS), 2025. 4. ·ͱΊ 56
Detection for Large-scale Distributed Model Training • [Xu+, IWQoS2025] eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems • [Jiang+, FSE2025] L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis • [Jiang+, DSN2025] LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms • [Cui+, arXiv2025] XPUTimer: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale • [Dong+, NSDI2025] Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production 62