GLU Variants Improve Transformer 3. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity 4. The Problem with Metrics is a Fundamental Problem for AI 5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 6. A Survey on Neural Network Interpretability 7. Soft-DTW: a Differentiable Loss Function for Time-Series 8. RepVGG: Making VGG-style ConvNets Great Again 9. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension 10. A Modern Introduction to Online Learning gray: [dup]