Keskar, Nitish Shirish, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tak, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” arXiv.org, 2016. https://arxiv.org/abs/1609.04836. Global Batch Size: 𝐵𝑔𝑙𝑜𝑏𝑎𝑙 = 𝐵𝑙𝑜𝑐𝑎𝑙 × 𝑁 Larger clusters → Larger global batch size. Two Key Challenges: ▪ Lower generalization at very large batch size (e.g., >8000). Models tend to converge to sharp minima → lower test performance. ▪ Scalability drops with more workers, especially with high communication overhead.