APSIPA 2023 Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies

Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding
Tasks: Three Case Studies Yuya Yamamoto University of Tsukuba, Japan

2 Singing voice understanding (SVU) Background Singer Song (note, lyric,
pitch,...) Expression ... Vocal recording Automatically analyze the components in the singing voice → beneficial for music pedagogy, discovery, creation etc. Singing voice is complex Datasets tend to be small

3 Existing approaches Background ⭕: Low data necessity ❌: Low
performance (depends on the quality of modeling) ⭕: High performance ❌: High data necessity Deep learning Hand-crafted rules

4 Leveraging pretrained self-supervised models Background Pretraining (Self-supervised learning) Transfer
learning for Downstream tasks Rapidly emerging on speech (’20-: Wav2Vec2.0 [Baevski 20] etc.) and music (’21-: CLMR [Spijkervet 21], JukeMIR [Castellon 21] etc. ) domain Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations NeurIPS 2020 Spijkervet, J., & Burgoyne, J. A. CONTRASTIVE LEARNING OF MUSICAL REPRESENTATIONS. ISMIR 2021 Castellon, R., Donahue, C., & Liang, P. (2021). Codified audio language modeling learns useful representations for music information retrieval. ISMIR 2021.

5 How about pretrained SSL for SVU? Proposal Singing voice
has both side of speech and music Speech Music 🎙 Each model can be leveraged for singing voice Wav2Vec2.0 WavLM MERT Map- Music2Vec We compared pretrained SSL models of speech and music Not really… 😭

7 4 models (2 from speech, 2 from music) Method:
Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction + Denoising Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information)

Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction + Denoising Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information) Contrastive learning between masked quantized CNN output and transformer output

Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction ① + Denoising ② Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information) ①: predict cluster where the mask is applied (HuBERT) ②: convolve multiple voice as noise, predict original voice

Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction + Denoising Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information) Mask prediction Target: MFCC -> Log mel spec. + chroma + Reconstruct CQT spectrogram -> To enhance pitch & harmony representations

Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction ① + Denoising ② Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information) BYOL: Teacher-student & MLM Student predicts Teacher’s representation Update teacher parameters by student’s EMA

12 Chose the 12 encoder-layers version from each model Method:
Compared models Wav2Vec2.0 MERT WavLM MapMusic2Vec Contrastive Succeeded in ASR tasks Masked prediction + Denoising Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks, (Good on capturing local information) Base model Base+ model Public-v0 model (Only this model)

13 Utilize all intermediate encoder layers Methods: weighted sum

14 1. Train with SSL model frozen Methods: two-stage fine-tuning
🔥 ❄ ❄ ❄ ❄ 🔥

15 2. Unfreeze the parameters of encoder layers Methods: two-stage
fine-tuning 🔥 🔥 🔥 🔥 ❄ 🔥

16 Experiment on three SVU tasks Experiment: overview Singer identification
On Artist20 Note transcription On MIRST-500 Technique classification On Vocalset Who? What? How? https://www.universalmusic.com/aerosmith-and-universal-music-group-announce-historic-strategic-global-alliance/ https://www.universal-music.co.jp/madonna/products/uics-1247/ https://www.universal-music.co.jp/the-beatles/products/tycp-60013/

17 20-way classification of singer, given 5s chunks l Artist20
dataset [Ellis 07] l 20 famous artists sung in EN l Each artist has 6 album l Train: Dev: Test = 4:1:1 by album l Separated & 5s-chunked vocal l Removed silence by RMS thresholding l Baseline l CRNN model [Hsieh 20] l Evaluation metrics l F1-score, Top-2 and Top-3 acc. For test set Experiment: Singer identification D. Ellis (2007). Classifying Music Audio with Timbral and Chroma Features, ISMIR 2007 Hsieh, T. H., Cheng, K. H., Fan, Z. C., Yang, Y. C., & Yang, Y. H. (2020, May). Addressing the confounds of accompaniments in singer identification. ICASSP 2020 - All SSL models outperformed baseline - Speech models are better on F1-score

18 Transcribe performance MIDI of vocal part l MIR-ST500 [Wang
21] l 500 Chinese songs (400 for train, 100 for test), separated l Performance MIDI annotation for each 5s chunks Experiment: Note transcription Wang, J. Y., & Jang, J. S. R. (2021, June). On the preparation and validation of a large-scale dataset of singing transcription. ICASSP 2021

19 Evaluation metrics [Molina 14] l Estimates onset, offset, and
Pitch (MIDI Number) l F1-score that considers... l COn: onset l COnP: + pitch l COnPOff: + offset l Tolerance: l Pitch: ±50 cent l Onset: ±50ms l Offset: ±20% duration Experiment: Note transcription Molina, E., Barbancho-Perez, A. M., Tardon-Garcia, L. J., & Barbancho-Perez, I. (2014). Evaluation framework for automatic singing transcription., ISMIR 2014

20 Transcribe performance MIDI of vocal part Experiment: Note transcription
l Baseline: l EfficientNet-b0 [Wang 21] l JDCnote [Kum 22]: utilize pseudo label via vocal melody extraction l Wav2Vec2-Large [Gu 23]: Wav2Vec2.0 Large model + Last layer - Music models are better, especially on pitch - Comparable with Wav2Vec2-Large → weight sum is effective? Kum, S., Lee, J., Kim, K. L., Kim, T., & Nam, J. (2022, May). Pseudo-label transfer from frame-level to note-level in a teacher-student framework for singing transcription from polyphonic music. ICASSP 2022 Gu, X., Zeng, W., Zhang, J., Ou, L., & Wang, Y. (2023). Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models. arXiv preprint arXiv:2304.12082.

21 Identify 10 techniques from given 3s chunks l Vocalset
10-way singing technique classification [Wilkins 18] l 20 singer (15 for train, 5 for test), 10 techniques, isolated voice l Input is chunked into 3s l Imbalanced technique distribution Experiment: Singing technique classification Wilkins, J., Seetharaman, P., Wahl, A., & Pardo, B. (2018, September). VocalSet: A Singing Voice Dataset. ISMIR 2018

22 Identify 10 techniques from given 3s chunks l Baselines:
l 1DCNN: raw waveform + 1DConv [Wilkins 18] l OblongCNN: Multi-resolution Spec + oblong-shaped 2DConv [Yamamoto 21] l D-CNN-cRT: OblongCNN + Deformable convolution + Classifier retraining [Yamamoto 22] l Every model (including SSL) are trained on weighted CELoss Experiment: Singing technique classification - MapMusic2Vec is the best performing → Locality is important for the task? - Imbalance learning is a room for improvement Wilkins, J., Seetharaman, P., Wahl, A., & Pardo, B. (2018, September). VocalSet: A Singing Voice Dataset. ISMIR 2018 Yamamoto, Y., Nam, J., Terasawa, H., & Hiraga, Y. (2021, December). Investigating time-frequency representations for audio feature extraction in singing technique classification. APSIPA 2021 Yamamoto, Y., Nam, J., & Terasawa, H. (2022, September). Deformable CNN and Imbalance-Aware Feature Learning for Singing Technique Classification. Interspeech 2022

23 Which layer more contributed? Evaluation: inspection on learnt weight
Inspect each weights after training

24 Evaluation: inspection on learnt weight Singer identification Transcription Technique
classification Early layers are more important Like speaker information [Chen 22] - Importance are scattered → many informations are incorporated? - MERT showed different pattern → due to the CQT reconstruction? - Early layers are more important →pitch, loudness and timbre - Might be affected by imbalance

25 We need further investigation l Both domains’ model have
a certain potential for SVU l Every model showed comparable performance on each task l How to adapt to singing voice more? l Way of finetuning (e.g., Adapter, LoRA), domain adaptation, etc. l Setting of downstream task l Investigation on more SVU tasks l Unexplored components : phoneme, lyric, loudness, vocal mixing, etc. l Variation: Singer diarization, Pitch extraction, Technique detection, etc. Discussion & future work

26 Pretrained SSL models for singing voice understanding (SVU) l
Tackled difficulty and low-resource problem of SVU l Compared Wav2Vec2, WavLM, MERT, Map-Music2Vec l Layer weight sum + Two-stage finetuning l Comparable or outperforming SoTA models l Performing models are different l Music models are good on transcription l Speech models are good on singer ID l Future work: further adaptation to singing voice, more tasks Take home message THANK YOU!!

APSIPA 2023 Toward Leveraging Pre-Trained Self-...

APSIPA 2023 Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies

Yuya Yamamoto

More Decks by Yuya Yamamoto

Featured

Transcript

Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding

2 Singing voice understanding (SVU) Background Singer Song (note, lyric,

3 Existing approaches Background ⭕: Low data necessity ❌: Low

4 Leveraging pretrained self-supervised models Background Pretraining (Self-supervised learning) Transfer

5 How about pretrained SSL for SVU? Proposal Singing voice

7 4 models (2 from speech, 2 from music) Method:

8 4 models (2 from speech, 2 from music) Method:

9 4 models (2 from speech, 2 from music) Method:

10 4 models (2 from speech, 2 from music) Method:

11 4 models (2 from speech, 2 from music) Method:

12 Chose the 12 encoder-layers version from each model Method:

13 Utilize all intermediate encoder layers Methods: weighted sum

14 1. Train with SSL model frozen Methods: two-stage fine-tuning

15 2. Unfreeze the parameters of encoder layers Methods: two-stage

16 Experiment on three SVU tasks Experiment: overview Singer identification

17 20-way classification of singer, given 5s chunks l Artist20

18 Transcribe performance MIDI of vocal part l MIR-ST500 [Wang

19 Evaluation metrics [Molina 14] l Estimates onset, offset, and

20 Transcribe performance MIDI of vocal part Experiment: Note transcription

21 Identify 10 techniques from given 3s chunks l Vocalset

22 Identify 10 techniques from given 3s chunks l Baselines:

23 Which layer more contributed? Evaluation: inspection on learnt weight

24 Evaluation: inspection on learnt weight Singer identification Transcription Technique

25 We need further investigation l Both domains’ model have

26 Pretrained SSL models for singing voice understanding (SVU) l