Upgrade to Pro — share decks privately, control downloads, hide ads and more …

APSIPA 2023 Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies

Yuya Yamamoto
October 31, 2023
35

APSIPA 2023 Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies

The presentation slide of "Yuya Yamamoto, Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies" at APSIPA ASC 2023.

Yuya Yamamoto

October 31, 2023
Tweet

Transcript

  1. Toward Leveraging Pre-Trained Self-Supervised Frontends
    for Automatic Singing Voice Understanding Tasks:
    Three Case Studies
    Yuya Yamamoto
    University of Tsukuba, Japan

    View full-size slide

  2. 2
    Singing voice understanding (SVU)
    Background
    Singer
    Song (note, lyric, pitch,...)
    Expression
    ...
    Vocal
    recording
    Automatically analyze the components in the singing voice
    → beneficial for music pedagogy, discovery, creation etc.
    Singing voice is
    complex Datasets tend to be
    small

    View full-size slide

  3. 3
    Existing approaches
    Background
    ⭕: Low data necessity
    ❌: Low performance
    (depends on the quality of modeling)
    ⭕: High performance
    ❌: High data necessity
    Deep learning
    Hand-crafted rules

    View full-size slide

  4. 4
    Leveraging pretrained self-supervised models
    Background
    Pretraining (Self-supervised learning) Transfer learning for Downstream tasks
    Rapidly emerging on speech (’20-: Wav2Vec2.0 [Baevski 20] etc.)
    and music (’21-: CLMR [Spijkervet 21], JukeMIR [Castellon 21] etc. ) domain
    Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations NeurIPS 2020
    Spijkervet, J., & Burgoyne, J. A. CONTRASTIVE LEARNING OF MUSICAL REPRESENTATIONS. ISMIR 2021
    Castellon, R., Donahue, C., & Liang, P. (2021). Codified audio language modeling learns useful representations for music information retrieval. ISMIR 2021.

    View full-size slide

  5. 5
    How about pretrained SSL for SVU?
    Proposal
    Singing voice has both side of
    speech and music
    Speech Music
    🎙
    Each model can be leveraged
    for singing voice
    Wav2Vec2.0
    WavLM
    MERT
    Map-
    Music2Vec
    We compared pretrained SSL models of speech and music
    Not really… 😭

    View full-size slide

  6. 7
    4 models (2 from speech, 2 from music)
    Method: Compared models
    Wav2Vec2.0 [Baevski 20]
    MERT [Li 23]
    WavLM [Chen 22]
    MapMusic2Vec [Li 22]
    Contrastive
    Succeeded in
    ASR tasks
    Masked prediction
    + Denoising
    Extend to
    Various speech tasks
    Masked prediction
    + CQT spectrogram
    reconstruction
    Music tasks
    BYOL (Data2Vec)
    Music tasks
    (Good on capturing
    local information)

    View full-size slide

  7. 8
    4 models (2 from speech, 2 from music)
    Method: Compared models
    Wav2Vec2.0 [Baevski 20]
    MERT [Li 23]
    WavLM [Chen 22]
    MapMusic2Vec [Li 22]
    Contrastive
    Succeeded in
    ASR tasks
    Masked prediction
    + Denoising
    Extend to
    Various speech tasks
    Masked prediction
    + CQT spectrogram
    reconstruction
    Music tasks
    BYOL (Data2Vec)
    Music tasks
    (Good on capturing
    local information)
    Contrastive learning between masked
    quantized CNN output and transformer output

    View full-size slide

  8. 9
    4 models (2 from speech, 2 from music)
    Method: Compared models
    Wav2Vec2.0 [Baevski 20]
    MERT [Li 23]
    WavLM [Chen 22]
    MapMusic2Vec [Li 22]
    Contrastive
    Succeeded in
    ASR tasks
    Masked prediction ①
    + Denoising ②
    Extend to
    Various speech tasks
    Masked prediction
    + CQT spectrogram
    reconstruction
    Music tasks
    BYOL (Data2Vec)
    Music tasks
    (Good on capturing
    local information)
    ①: predict cluster where the mask is applied
    (HuBERT)
    ②: convolve multiple voice as noise, predict
    original voice

    View full-size slide

  9. 10
    4 models (2 from speech, 2 from music)
    Method: Compared models
    Wav2Vec2.0 [Baevski 20]
    MERT [Li 23]
    WavLM [Chen 22]
    MapMusic2Vec [Li 22]
    Contrastive
    Succeeded in
    ASR tasks
    Masked prediction
    + Denoising
    Extend to
    Various speech tasks
    Masked prediction
    + CQT spectrogram
    reconstruction
    Music tasks
    BYOL (Data2Vec)
    Music tasks
    (Good on capturing
    local information)
    Mask prediction
    Target: MFCC -> Log mel spec. + chroma
    + Reconstruct CQT spectrogram
    -> To enhance pitch & harmony representations

    View full-size slide

  10. 11
    4 models (2 from speech, 2 from music)
    Method: Compared models
    Wav2Vec2.0 [Baevski 20]
    MERT [Li 23]
    WavLM [Chen 22]
    MapMusic2Vec [Li 22]
    Contrastive
    Succeeded in
    ASR tasks
    Masked prediction ①
    + Denoising ②
    Extend to
    Various speech tasks
    Masked prediction
    + CQT spectrogram
    reconstruction
    Music tasks
    BYOL (Data2Vec)
    Music tasks
    (Good on capturing
    local information)
    BYOL: Teacher-student & MLM
    Student predicts Teacher’s representation
    Update teacher parameters by student’s EMA

    View full-size slide

  11. 12
    Chose the 12 encoder-layers version from each model
    Method: Compared models
    Wav2Vec2.0
    MERT
    WavLM
    MapMusic2Vec
    Contrastive
    Succeeded in
    ASR tasks
    Masked prediction
    + Denoising
    Extend to
    Various speech tasks
    Masked prediction
    + CQT spectrogram
    reconstruction
    Music tasks
    BYOL (Data2Vec)
    Music tasks,
    (Good on capturing
    local information)
    Base model Base+ model
    Public-v0 model (Only this model)

    View full-size slide

  12. 13
    Utilize all intermediate encoder layers
    Methods: weighted sum

    View full-size slide

  13. 14
    1. Train with SSL model frozen
    Methods: two-stage fine-tuning
    🔥




    🔥

    View full-size slide

  14. 15
    2. Unfreeze the parameters of encoder layers
    Methods: two-stage fine-tuning
    🔥
    🔥
    🔥
    🔥

    🔥

    View full-size slide

  15. 16
    Experiment on three SVU tasks
    Experiment: overview
    Singer identification
    On Artist20
    Note transcription
    On MIRST-500
    Technique classification
    On Vocalset
    Who? What? How?
    https://www.universalmusic.com/aerosmith-and-universal-music-group-announce-historic-strategic-global-alliance/
    https://www.universal-music.co.jp/madonna/products/uics-1247/
    https://www.universal-music.co.jp/the-beatles/products/tycp-60013/

    View full-size slide

  16. 17
    20-way classification of singer, given 5s chunks
    l Artist20 dataset [Ellis 07]
    l 20 famous artists sung in EN
    l Each artist has 6 album
    l Train: Dev: Test = 4:1:1 by album
    l Separated & 5s-chunked vocal
    l Removed silence by RMS thresholding
    l Baseline
    l CRNN model [Hsieh 20]
    l Evaluation metrics
    l F1-score, Top-2 and Top-3 acc.
    For test set
    Experiment: Singer identification
    D. Ellis (2007). Classifying Music Audio with Timbral and Chroma Features, ISMIR 2007
    Hsieh, T. H., Cheng, K. H., Fan, Z. C., Yang, Y. C., & Yang, Y. H. (2020, May). Addressing the confounds of accompaniments in singer identification. ICASSP 2020
    - All SSL models outperformed baseline
    - Speech models are better on F1-score

    View full-size slide

  17. 18
    Transcribe performance MIDI of vocal part
    l MIR-ST500 [Wang 21]
    l 500 Chinese songs (400 for train, 100 for test), separated
    l Performance MIDI annotation for each 5s chunks
    Experiment: Note transcription
    Wang, J. Y., & Jang, J. S. R. (2021, June). On the preparation and validation of a large-scale dataset of singing transcription. ICASSP 2021

    View full-size slide

  18. 19
    Evaluation metrics [Molina 14]
    l Estimates onset, offset, and
    Pitch (MIDI Number)
    l F1-score that considers...
    l COn: onset
    l COnP: + pitch
    l COnPOff: + offset
    l Tolerance:
    l Pitch: ±50 cent
    l Onset: ±50ms
    l Offset: ±20% duration
    Experiment: Note transcription
    Molina, E., Barbancho-Perez, A. M., Tardon-Garcia, L. J., & Barbancho-Perez, I. (2014). Evaluation framework for automatic singing transcription., ISMIR 2014

    View full-size slide

  19. 20
    Transcribe performance MIDI of vocal part
    Experiment: Note transcription
    l Baseline:
    l EfficientNet-b0 [Wang 21]
    l JDCnote [Kum 22]: utilize
    pseudo label via vocal
    melody extraction
    l Wav2Vec2-Large [Gu 23]:
    Wav2Vec2.0 Large model +
    Last layer
    - Music models are better, especially on pitch
    - Comparable with Wav2Vec2-Large
    → weight sum is effective?
    Kum, S., Lee, J., Kim, K. L., Kim, T., & Nam, J. (2022, May). Pseudo-label transfer from frame-level to note-level in a teacher-student framework for singing transcription from polyphonic music. ICASSP 2022
    Gu, X., Zeng, W., Zhang, J., Ou, L., & Wang, Y. (2023). Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models. arXiv preprint arXiv:2304.12082.

    View full-size slide

  20. 21
    Identify 10 techniques from given 3s chunks
    l Vocalset 10-way singing technique classification [Wilkins 18]
    l 20 singer (15 for train, 5 for test), 10 techniques, isolated voice
    l Input is chunked into 3s
    l Imbalanced technique distribution
    Experiment: Singing technique classification
    Wilkins, J., Seetharaman, P., Wahl, A., & Pardo, B. (2018, September). VocalSet: A Singing Voice Dataset. ISMIR 2018

    View full-size slide

  21. 22
    Identify 10 techniques from given 3s chunks
    l Baselines:
    l 1DCNN: raw waveform + 1DConv
    [Wilkins 18]
    l OblongCNN: Multi-resolution Spec +
    oblong-shaped 2DConv [Yamamoto 21]
    l D-CNN-cRT: OblongCNN + Deformable
    convolution + Classifier retraining
    [Yamamoto 22]
    l Every model (including SSL) are
    trained on weighted CELoss
    Experiment: Singing technique classification
    - MapMusic2Vec is the best performing
    → Locality is important for the task?
    - Imbalance learning is a room for improvement
    Wilkins, J., Seetharaman, P., Wahl, A., & Pardo, B. (2018, September). VocalSet: A Singing Voice Dataset. ISMIR 2018
    Yamamoto, Y., Nam, J., Terasawa, H., & Hiraga, Y. (2021, December). Investigating time-frequency representations for audio feature extraction in singing technique classification. APSIPA 2021
    Yamamoto, Y., Nam, J., & Terasawa, H. (2022, September). Deformable CNN and Imbalance-Aware Feature Learning for Singing Technique Classification. Interspeech 2022

    View full-size slide

  22. 23
    Which layer more contributed?
    Evaluation: inspection on learnt weight
    Inspect each weights
    after training

    View full-size slide

  23. 24
    Evaluation: inspection on learnt weight
    Singer
    identification
    Transcription
    Technique
    classification
    Early layers are more important
    Like speaker information
    [Chen 22]
    - Importance are scattered
    → many informations are
    incorporated?
    - MERT showed different pattern
    → due to the CQT reconstruction?
    - Early layers are more important
    →pitch, loudness and timbre
    - Might be affected by imbalance

    View full-size slide

  24. 25
    We need further investigation
    l Both domains’ model have a certain potential for SVU
    l Every model showed comparable performance on each task
    l How to adapt to singing voice more?
    l Way of finetuning (e.g., Adapter, LoRA), domain adaptation, etc.
    l Setting of downstream task
    l Investigation on more SVU tasks
    l Unexplored components : phoneme, lyric, loudness, vocal mixing,
    etc.
    l Variation: Singer diarization, Pitch extraction, Technique detection,
    etc.
    Discussion & future work

    View full-size slide

  25. 26
    Pretrained SSL models for singing voice understanding (SVU)
    l Tackled difficulty and low-resource problem of SVU
    l Compared Wav2Vec2, WavLM, MERT, Map-Music2Vec
    l Layer weight sum + Two-stage finetuning
    l Comparable or outperforming SoTA models
    l Performing models are different
    l Music models are good on transcription
    l Speech models are good on singer ID
    l Future work: further adaptation to singing voice, more tasks
    Take home message
    THANK
    YOU!!

    View full-size slide