Slide 12
Slide 12 text
Sound Event Detection (SED)
12
Two major approach to tackle the task
Audio Tagging Sound Event Detection
Clip level labeling to the audio input Segment level labeling(time-annotated) to the audio
Aggregate in time axis
(max, mean, attention,…)
Feature Extractor
Feature map
input
(waveform,melspec,…)
Feature extraction
CNN, etc. Feature Extractor
Feature map
input
(waveform,melspec,…)
Pointwise
Classifier
Classifier
Clip-level prediction
Frame-level prediction
Aggregate in time axis
(max, mean, attention,…)
The outputs are two: clip-wise prediction
and segment-wise prediction
Feature extraction
CNN, etc.