Upgrade to Pro — share decks privately, control downloads, hide ads and more …

APSIPA 2023 Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies

Yuya Yamamoto
October 31, 2023
42

APSIPA 2023 Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies

The presentation slide of "Yuya Yamamoto, Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies" at APSIPA ASC 2023.

Yuya Yamamoto

October 31, 2023
Tweet

Transcript

  1. Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding

    Tasks: Three Case Studies Yuya Yamamoto University of Tsukuba, Japan
  2. 2 Singing voice understanding (SVU) Background Singer Song (note, lyric,

    pitch,...) Expression ... Vocal recording Automatically analyze the components in the singing voice → beneficial for music pedagogy, discovery, creation etc. Singing voice is complex Datasets tend to be small
  3. 3 Existing approaches Background ⭕: Low data necessity ❌: Low

    performance (depends on the quality of modeling) ⭕: High performance ❌: High data necessity Deep learning Hand-crafted rules
  4. 4 Leveraging pretrained self-supervised models Background Pretraining (Self-supervised learning) Transfer

    learning for Downstream tasks Rapidly emerging on speech (’20-: Wav2Vec2.0 [Baevski 20] etc.) and music (’21-: CLMR [Spijkervet 21], JukeMIR [Castellon 21] etc. ) domain Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations NeurIPS 2020 Spijkervet, J., & Burgoyne, J. A. CONTRASTIVE LEARNING OF MUSICAL REPRESENTATIONS. ISMIR 2021 Castellon, R., Donahue, C., & Liang, P. (2021). Codified audio language modeling learns useful representations for music information retrieval. ISMIR 2021.
  5. 5 How about pretrained SSL for SVU? Proposal Singing voice

    has both side of speech and music Speech Music 🎙 Each model can be leveraged for singing voice Wav2Vec2.0 WavLM MERT Map- Music2Vec We compared pretrained SSL models of speech and music Not really… 😭
  6. 7 4 models (2 from speech, 2 from music) Method:

    Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction + Denoising Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information)
  7. 8 4 models (2 from speech, 2 from music) Method:

    Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction + Denoising Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information) Contrastive learning between masked quantized CNN output and transformer output
  8. 9 4 models (2 from speech, 2 from music) Method:

    Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction ① + Denoising ② Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information) ①: predict cluster where the mask is applied (HuBERT) ②: convolve multiple voice as noise, predict original voice
  9. 10 4 models (2 from speech, 2 from music) Method:

    Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction + Denoising Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information) Mask prediction Target: MFCC -> Log mel spec. + chroma + Reconstruct CQT spectrogram -> To enhance pitch & harmony representations
  10. 11 4 models (2 from speech, 2 from music) Method:

    Compared models Wav2Vec2.0 [Baevski 20] MERT [Li 23] WavLM [Chen 22] MapMusic2Vec [Li 22] Contrastive Succeeded in ASR tasks Masked prediction ① + Denoising ② Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks (Good on capturing local information) BYOL: Teacher-student & MLM Student predicts Teacher’s representation Update teacher parameters by student’s EMA
  11. 12 Chose the 12 encoder-layers version from each model Method:

    Compared models Wav2Vec2.0 MERT WavLM MapMusic2Vec Contrastive Succeeded in ASR tasks Masked prediction + Denoising Extend to Various speech tasks Masked prediction + CQT spectrogram reconstruction Music tasks BYOL (Data2Vec) Music tasks, (Good on capturing local information) Base model Base+ model Public-v0 model (Only this model)
  12. 16 Experiment on three SVU tasks Experiment: overview Singer identification

    On Artist20 Note transcription On MIRST-500 Technique classification On Vocalset Who? What? How? https://www.universalmusic.com/aerosmith-and-universal-music-group-announce-historic-strategic-global-alliance/ https://www.universal-music.co.jp/madonna/products/uics-1247/ https://www.universal-music.co.jp/the-beatles/products/tycp-60013/
  13. 17 20-way classification of singer, given 5s chunks l Artist20

    dataset [Ellis 07] l 20 famous artists sung in EN l Each artist has 6 album l Train: Dev: Test = 4:1:1 by album l Separated & 5s-chunked vocal l Removed silence by RMS thresholding l Baseline l CRNN model [Hsieh 20] l Evaluation metrics l F1-score, Top-2 and Top-3 acc. For test set Experiment: Singer identification D. Ellis (2007). Classifying Music Audio with Timbral and Chroma Features, ISMIR 2007 Hsieh, T. H., Cheng, K. H., Fan, Z. C., Yang, Y. C., & Yang, Y. H. (2020, May). Addressing the confounds of accompaniments in singer identification. ICASSP 2020 - All SSL models outperformed baseline - Speech models are better on F1-score
  14. 18 Transcribe performance MIDI of vocal part l MIR-ST500 [Wang

    21] l 500 Chinese songs (400 for train, 100 for test), separated l Performance MIDI annotation for each 5s chunks Experiment: Note transcription Wang, J. Y., & Jang, J. S. R. (2021, June). On the preparation and validation of a large-scale dataset of singing transcription. ICASSP 2021
  15. 19 Evaluation metrics [Molina 14] l Estimates onset, offset, and

    Pitch (MIDI Number) l F1-score that considers... l COn: onset l COnP: + pitch l COnPOff: + offset l Tolerance: l Pitch: ±50 cent l Onset: ±50ms l Offset: ±20% duration Experiment: Note transcription Molina, E., Barbancho-Perez, A. M., Tardon-Garcia, L. J., & Barbancho-Perez, I. (2014). Evaluation framework for automatic singing transcription., ISMIR 2014
  16. 20 Transcribe performance MIDI of vocal part Experiment: Note transcription

    l Baseline: l EfficientNet-b0 [Wang 21] l JDCnote [Kum 22]: utilize pseudo label via vocal melody extraction l Wav2Vec2-Large [Gu 23]: Wav2Vec2.0 Large model + Last layer - Music models are better, especially on pitch - Comparable with Wav2Vec2-Large → weight sum is effective? Kum, S., Lee, J., Kim, K. L., Kim, T., & Nam, J. (2022, May). Pseudo-label transfer from frame-level to note-level in a teacher-student framework for singing transcription from polyphonic music. ICASSP 2022 Gu, X., Zeng, W., Zhang, J., Ou, L., & Wang, Y. (2023). Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models. arXiv preprint arXiv:2304.12082.
  17. 21 Identify 10 techniques from given 3s chunks l Vocalset

    10-way singing technique classification [Wilkins 18] l 20 singer (15 for train, 5 for test), 10 techniques, isolated voice l Input is chunked into 3s l Imbalanced technique distribution Experiment: Singing technique classification Wilkins, J., Seetharaman, P., Wahl, A., & Pardo, B. (2018, September). VocalSet: A Singing Voice Dataset. ISMIR 2018
  18. 22 Identify 10 techniques from given 3s chunks l Baselines:

    l 1DCNN: raw waveform + 1DConv [Wilkins 18] l OblongCNN: Multi-resolution Spec + oblong-shaped 2DConv [Yamamoto 21] l D-CNN-cRT: OblongCNN + Deformable convolution + Classifier retraining [Yamamoto 22] l Every model (including SSL) are trained on weighted CELoss Experiment: Singing technique classification - MapMusic2Vec is the best performing → Locality is important for the task? - Imbalance learning is a room for improvement Wilkins, J., Seetharaman, P., Wahl, A., & Pardo, B. (2018, September). VocalSet: A Singing Voice Dataset. ISMIR 2018 Yamamoto, Y., Nam, J., Terasawa, H., & Hiraga, Y. (2021, December). Investigating time-frequency representations for audio feature extraction in singing technique classification. APSIPA 2021 Yamamoto, Y., Nam, J., & Terasawa, H. (2022, September). Deformable CNN and Imbalance-Aware Feature Learning for Singing Technique Classification. Interspeech 2022
  19. 24 Evaluation: inspection on learnt weight Singer identification Transcription Technique

    classification Early layers are more important Like speaker information [Chen 22] - Importance are scattered → many informations are incorporated? - MERT showed different pattern → due to the CQT reconstruction? - Early layers are more important →pitch, loudness and timbre - Might be affected by imbalance
  20. 25 We need further investigation l Both domains’ model have

    a certain potential for SVU l Every model showed comparable performance on each task l How to adapt to singing voice more? l Way of finetuning (e.g., Adapter, LoRA), domain adaptation, etc. l Setting of downstream task l Investigation on more SVU tasks l Unexplored components : phoneme, lyric, loudness, vocal mixing, etc. l Variation: Singer diarization, Pitch extraction, Technique detection, etc. Discussion & future work
  21. 26 Pretrained SSL models for singing voice understanding (SVU) l

    Tackled difficulty and low-resource problem of SVU l Compared Wav2Vec2, WavLM, MERT, Map-Music2Vec l Layer weight sum + Two-stage finetuning l Comparable or outperforming SoTA models l Performing models are different l Music models are good on transcription l Speech models are good on singer ID l Future work: further adaptation to singing voice, more tasks Take home message THANK YOU!!