Slide 11
Slide 11 text
Proposed Method / Auditory Features
• Compress the Mel spectrogram to match the number of samples
with the number of video frames
• For each frame 𝑡 , find the nearest Mel spectrogram index 𝑖
• Extract a 5-sample segment centered at 𝑖 ,
and apply a 2D CNN to obtain auditory features 𝒂 𝑡
11
Mel spectrogram 𝑆
𝑡
↓
𝑖
2D CNN
Auditory features
𝒂 𝑡 ∈ ℝ!"
𝑆[𝑖 − 2, 𝑖 − 1, 𝑖, 𝑖 + 1, 𝑖 + 2]
Extract