Sound Event Detection (SED)
12
Two major approach to tackle the task
Audio Tagging Sound Event Detection
Clip level labeling to the audio input Segment level labeling(time-annotated) to the audio
Aggregate in time axis
(max, mean, attention,…)
Feature Extractor
Feature map
input
(waveform,melspec,…)
Feature extraction
CNN, etc. Feature Extractor
Feature map
input
(waveform,melspec,…)
Pointwise
Classifier
Classifier
Clip-level prediction
Frame-level prediction
Aggregate in time axis
(max, mean, attention,…)
The outputs are two: clip-wise prediction
and segment-wise prediction
Feature extraction
CNN, etc.