(Video, sampling rate: 3 fps), VGGish (Audio) Evaluation Metrics • BLEU, METEOR, ROUGE-L, CIDEr-D (MS Challenge Metrics ) Model • Encoder dim (low, high): (512, 256), (128, 64) for video, audio • Chunk size: 10, 4 for video, audio • Decoder dim: 256, 1024 for global, local • Embedding dim: 512