the task Audio Tagging Sound Event Detection Clip level labeling to the audio input Segment level labeling(time-annotated) to the audio Aggregate in time axis (max, mean, attention,…) Feature Extractor Feature map input (waveform,melspec,…) Feature extraction CNN, etc. Feature Extractor Feature map input (waveform,melspec,…) Pointwise Classifier Classifier Clip-level prediction Frame-level prediction Aggregate in time axis (max, mean, attention,…) The outputs are two: clip-wise prediction and segment-wise prediction Feature extraction CNN, etc.