Slide 56
Slide 56 text
References
• Jiangfei Duan, et al., “Ef
fi
cient Training of Large Language Models on Distributed
Infrastructures: A Survey”, arXiv, 2024.
• Qian Ding, “Transformers in SRE Land:Evolving to Manage AI Infrastructure”, USENIX
SREcon25 America, 2025.
• Deepak Narayanan, et al., “Ef
fi
cient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM”, the International Conference for High Performance Computing,
Networking, Storage, and Analysis (SC), 2021.
• Yusheng Zheng, et al., “Extending Applications Safely and Ef
fi
ciently”, USENIX OSDI, 2025.
• Yiwei Yang, et al., “eGPU: Extending eBPF Programmability and Observability to GPUs”,
Workshop on Heterogeneous Composable and Disaggregated Systems (HCDS), 2025.
4. ·ͱΊ
56