$30 off During Our Annual Pro Sale. View Details »

SIGMUS130-yamamoto

 SIGMUS130-yamamoto

2021年3月17日 第130回音楽情報科学研究会での発表資料

Yuya Yamamoto

March 17, 2021
Tweet

More Decks by Yuya Yamamoto

Other Decks in Research

Transcript

  1. ՎএςΫχοΫͷࣝผʹ͓͚Δ


    hand-craftedಛ௃ྔͱਂ૚ֶश
    ಛ௃ྔͷൺֱ
    ࢁຊ ༤໵1, Juhan Nam2, ࣉᖒ ༸ࢠ1 , ฏլ ৡ1


    1ɿஜ೾େֶ, 2ɿKAIST


    2021.03.17 ୈ130ճԻָ৘ใՊֶݚڀձ ޱ಄ൃද

    View Slide

  2. Please help us!!
    • Scrapboxʹɼಛʹٞ͝࿦͍͖͍ͨͩͨ఺͸ͪ͜ΒͰ ɹ
    ͋Β͔͡Ίॻ͍͓͖ͯ·ͨ͠ʂͲ͏͔وॏͳ͝ҙݟΛ
    ͍͚ͨͩΕ͹޾͍Ͱ͢ʂ
    2

    View Slide

  3. ֓ཁ 3
    VS
    ՎএςΫχοΫͷࣝผ
    ઐ໳஌ࣝʹج͖ͮઃܭ͞Εͨhand-craftedಛ௃ྔͱ


    ਂ૚ֶशʹΑͬͯࣗಈநग़ͨ͠ಛ௃ྔΛɼ


    ՎএςΫχοΫͷࣝผ໰୊ʹ͓͍ͯൺֱ
    $//Ͱநग़ͨ͠
    ಛ௃ྔ
    ϐονͱԻ৭ʹؔ͢Δ
    )BOEDSBGUFEಛ௃ྔ

    View Slide

  4. എܠɿՎএςΫχοΫͷࣝผ
    • ՎএςΫχοΫɿՎख͕ԻߴɾԻྔɾԻ৭ͳͲΛมಈͤ͞ɼɹ
    ࣮ݱ͢ΔՎএදݱٕ๏


    • ࣝผͰ͖ΔͱɼՎखͷՎ͍ํͷཧղʹͭͳ͕Δ


    • ՎএςΫχοΫࣝผ͸VocalSet [Wilkins 18]ͱ͍͏σʔληοτͷɹ
    ొ৔Ҏ߱ɼਂ૚ֶशʹΑΔख๏͕੒ޭΛೲΊͨ[Luo 20, Pishdadian 19ଞ]


    • Ұํɼैདྷ୯ҰͷՎএςΫχοΫࣝผʢϏϒϥʔτ[Dridger 16ଞ] ౳ʣ
    Ͱઐ໳஌ࣝʹجͮ͘ಛ௃ʢHand-craftedಛ௃ʣ͕༻͍ΒΕͨ


    • ࠓճɼ͜ͷHand-craftedಛ௃ͱਂ૚ֶशಛ௃Λൺֱ
    4

    View Slide

  5. ઌߦݚڀɿԻָࣝผͷಛ௃ྔͷൺֱ [Abeßer 19]
    • ϐονͷي੻Λ෼ྨ͢Δ໰୊ʹ͓͍ͯɼ
    Hand-craftedಛ௃(Pymus, Bitteli)ͱਂ૚ֶश
    (CNN)ʹΑΓ֫ಘͨ͠ಛ௃ྔΛൺֱ


    • ʮ໌ࣔతͳϞσϦϯάΛ͠ͳ͍CNNͰ΋ɼ
    Hand-craftedಛ௃ྔͱಉ౳ͷࣝผྗΛ֫ಘ
    Մೳʯͱ͍͏݁࿦
    5
    [Abeßer 19] J. Abeßer et al. Fundamental frequency contour classi
    fi
    cation: A comparison between hand-crafted and cnn-based features.
    ICASSP2019
    Ի৭ͷཁૉ͕ՃΘΔͱͲ͏͔ʁ
    ෼ྨͷಘҙۤख͸ʁ
    ͳͲΛ௥ՃͰݕূ

    View Slide

  6. ࣝผʹ༻͍ΔσʔληοτɿVocalSet [Wilkins 18]
    • ՎএςΫχοΫΛऩ࿥ͨ͠େن໛
    σʔληοτ


    • உঁ20໊ɼҟͳΔ10छྨͷɹɹɹ
    ՎএςΫχοΫΛؚΉɹɹɹɹɹɹɹ


    • ૯ܭ໿10.1࣌ؒ


    • 1ϑΝΠϧʹ1ϥϕϧ


    • ϩϯάτʔϯɼΞϧϖδΦɼɹɹɹ
    εέʔϧͳͲ͋ΒΏΔϑϨʔζͰ
    Վএ


    • ϥϕϧ͕ෆۉߧ
    6

    View Slide

  7. ख๏
    7

    View Slide

  8. 8
    ໰୊ઃఆ
    • ՎএςΫχοΫͷ10Ϋϥεࣝผ໰୊ʹ͓͚Δྑ͠ѱ͠Λ୳Δ


    • ෼ྨثΛಉҰ৚݅ʹͯ͠౷੍ɼಛ௃நग़ͷ৚݅ͷΈΛൺֱ
    ৚݅ݻఆ
    ख๏Λൺֱ
    8

    View Slide

  9. ಛ௃நग़ख๏1ɿHand-Craftedಛ௃ྔ 9
    • Վखࣝผ [Kroher 14] Ͱ༻͍ΒΕͨಛ௃
    ͔ΒԻߴมಈͱԻ৭ʹؔ͢Δಛ௃ྔΛ
    ൈਮ
    Ի৭ɼԻߴมಈͦΕͧΕʹMFCCɼϏϒϥʔτಛ௃ྔΛ࠾༻
    MFCC 20࣍ݩ


    Ի৭ʹؔ͢Δಛ௃ྔ
    Ϗϒϥʔτಛ௃ྔ


    2࣍ݩɼਂ͞(rate)ͱ଎͞(extent)


    Իߴมಈʹؔ͢Δಛ௃ྔ
    ->22࣍ݩͷϕΫτϧ

    View Slide

  10. • ܇࿅࣌ɿ௨ৗͷڭࢣ͋Γֶशͱಉ͡Α͏ʹֶश


    • ༧ଌ࣌ɿ1ͭ໨ͷશ݁߹૚ͷग़ྗΛಛ௃ϕΫτϧͱͯ͠༻͍ɼɹɹ ɹ
    ෼ྨث΁ೖྗ


    • ެฏͷͨΊಛ௃ϕΫτϧͷ࣍ݩ਺͸Hand-craftedಛ௃ͱಉ͡22࣍ݩʹ
    10
    ϝϧεϖΫτϩάϥϜΛೖྗʹಛ௃ྔΛಘΔ4૚ͷCNN
    ಛ௃நग़ख๏2ɿਂ૚ֶशʹΑΔࣗಈಛ௃நग़

    View Slide

  11. ࣮ݧ 11
    • VocalSetͷࣝผ࣮ݧ


    • VocalSetΛֶश༻ɿςετ༻=8ɿ2ʹ෼ׂ


    • ෼ྨثʹ͸྆৚݅ͱ΋Random Forest [Breiman 01]Λ
    ༻͍Δ


    • ܾఆ໦ͷ਺͸50


    • ࣮ࡍʹ͸ϥϕϧ਺ͷෆۉߧΛߟྀ͢ΔBalanced Random
    Forest [Chen 04]Λ༻͍Δ

    View Slide

  12. ݁Ռ
    12

    View Slide

  13. ݁Ռͱ෼ੳ 13
    1. ͲͷςΫχοΫͷࣝผΛͲΕ͘Β͍ਖ਼ղͰ͖͔ͨ


    • શମɾΫϥε͝ͱͷਖ਼ղ཰ (Accuracy)


    2. ಛ௃ྔ͸֤ʑΛหผ͠͏ΔදݱΛ֫ಘͰ͖͍ͯΔ͔


    • tSNEʹΑΔ࣍ݩѹॖϓϩοτ


    3. Ͳͷಛ௃ྔ͕ࣝผʹޮՌత͔ͩͬͨ


    • Random Forestͷಛ௃ྔॏཁ౓ͷ֬ೝ

    View Slide

  14. ݁Ռᶃɿਖ਼ղ཰ 14
    • શମͷਖ਼ղ཰


    • Hand-craftedಛ௃ྔ: 0.710


    • ਂ૚ֶशಛ௃ྔ: 0.736


    • Ϋϥε͝ͱͷਖ਼ղ཰


    • Straight, Vibrato, Vocal fry ɿHand-crafted > ਂ૚ֶश


    • ͦΕҎ֎ͷՎএςΫχοΫ ɿ Hand-crafted < ਂ૚ֶश
    HC


    ਂ૚

    View Slide

  15. 15
    ݁ՌᶄɿtSNEʹΑΔಛ௃ϕΫτϧͷ࣍ݩѹॖ
    tSNEʹΑΓ࣍ݩѹॖΛߦ͍ɼΫϥε಺ͷࢄΒ͹ΓΛ໨ࢹ
    ਂ૚ֶशख๏ʢӈʣʹಉ͡Ϋϥεɾྨࣅ͢ΔΫϥεؒͷ


    ·ͱ·Γͷྑ͞Λ֬ೝ -> ࣝผʹΑΓྑ͍

    View Slide

  16. 16
    ݁ՌᶅɿRandom Forestͷಛ௃ྔॏཁ౓
    Hand-craftedಛ௃ྔͰͷɼRandom Forestಛ௃ྔॏཁ౓Λܭࢉ
    Ϗϒϥʔτಛ௃ྔͱMFCCͷ௿࣍܎਺͕ॏཁ౓͕ߴΊ
    0
    0.01
    0.02
    0.03
    0.04
    0.05
    0.06
    0.07
    0.08
    0.09
    0.1
    Vibrato
    extent
    Vibrato
    rate
    M
    FCC-3
    M
    FCC-1
    M
    FCC-2
    M
    FCC-7
    M
    FCC-10
    M
    FCC-14
    M
    FCC-5
    M
    FCC-20
    M
    FCC-6
    M
    FCC-12
    M
    FCC-11
    M
    FCC-4
    M
    FCC-9
    M
    FCC-13
    M
    FCC-19
    M
    FCC-7
    M
    FCC-18
    M
    FCC-17
    M
    FCC-16
    M
    FCC-15
    Hand-crafted 特徴量重要度

    View Slide

  17. ݁Ռͷߟ࡯
    • ࣝผͷਖ਼ղ཰ɼtSNEϓϩοτΑΓ


    • ਂ૚ֶशख๏͕hand-craftedख๏ͱḮ৭ͳ͠ ɹɹɹɹɹɹ ɹ


    ɹ-> ઐ໳஌ࣝΛͦΕ΄Ͳ໌ࣔతʹ༩͑ͳͯ͘΋ࣝผʹ༗༻ͳಛ௃Λ֫ಘՄೳʁ


    • ಛ௃ྔॏཁ౓ͷ෼ੳΑΓ


    • Hand-craftedख๏͸Ϗϒϥʔτಛ௃ྔɼMFCC௿࣍܎਺ͷॏཁ౓͕ߴ͍ɹ


    -> Straight΍Vibratoͷਖ਼ղ཰͕ߴ͔ͬͨ͜ͱͱซͤΔͱɼ͜ͷೋͭΛݟ෼͚Δʹ͸
    Ϗϒϥʔτಛ௃ྔ͑͋͞Ε͹Α͍ʁ


    -> ੑ࣭͕Α͘Θ͔͍ͬͯΔςΫχοΫʹରͯ͠͸Hand-craftedಛ௃΋࢖͑Δʁ


    17

    View Slide

  18. ·ͱΊ 18
    • ΍ͬͨ͜ͱ


    • ਂ૚ֶशʹΑΔࣗಈநग़ಛ௃ྔͷੑೳΛௐࠪ


    • Hand-craftedಛ௃ྔͱͷൺֱ


    • ݁Ռ


    • ਂ૚ֶशख๏ͷਖ਼ղ཰͕2.6%্ճΔɼHand-craftedख๏ͱḮ৭ͳ͠


    • ಛʹ௨ৗͱ͸͔͚཭ΕͨՎএςΫχοΫͰੑೳ˕


    • Hand-craftedख๏Ͱ͸Straight, Vibratoʹ͓͍ͯੑೳ˕


    • ಛ௃ྔॏཁ౓͔ΒɼϏϒϥʔτಛ௃ྔ͕ͦͷ2ͭͷࣝผʹΫϦςΟΧϧʁ


    • ࠓޙ


    • ΑΓଟ͘ͷγνϡΤʔγϣϯʹΑΔ࣮ݧʢೖྗ΍σʔληοτ౳ͷൺֱର৅Λ
    ΋ͬͱ૿΍͢ʣ

    View Slide

  19. ࠓޙͷ՝୊
    • ൺֱ͢Δ৚݅Λ௥Ճ͢Δ


    • Hand-crafted


    • Formant, Jitter, ShimmerͳͲɼ ੒ޭࣄྫͷ͋Δଞͷಛ௃ྔΛൺֱ৚݅
    ʹ௥Ճ


    • ָثͷԋ૗ςΫχοΫࣝผͰSoTAΛୡ੒ͨ͠ख๏ [Wang 20]Ͱ͋Δ
    Wavelet Scattering TransformΛ௥Ճ


    • ਂ૚ֶश


    • ৴߸೾ܗɼSTFTεϖΫτϩάϥϜɼF0౳ɼೖྗಛ௃ྔΛมߋ


    • RNNΛCNNͷ࣍ͷ૚ʹஔ͘ͳͲɼΞʔΩςΫνϟͷมߋ
    19

    View Slide

  20. ࢀߟจݙ 20
    • [Wilkins 18] J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo,“Vo- calset: A singing voice dataset,” in ISMIR 2018, 2018.


    • [Abesser 19] J. Abesser and M. Muller, ”Fundamental Frequency Contour Classi
    fi
    cation: A Comparison between Hand- crafted
    and CNN-based Features,” in ICASSP 2019, 2019.


    • [Kroher 14] N. Kroher and E.Gomez. ”Automatic singer identi
    fi
    ca- tion for improvisational styles based on vibrato, timbre and
    statistical performance descriptors.” in ICMC-SMC 2014, 2014.


    • [Breiman 01] L. Breiman. ”Random forests,” Machine learning, Vol. 45, No. 1, pp. 5-32, 2001.


    • [Chen 04] C. Chen, A. Liaw, L. Breiman: Using Random Forest to Learn Imbalanced Data, Technical Report, No.666, 2004.


    • [Luo 20] Yin-Jyun Luo, Chin-Cheng Hsu, Kat Agres, and Dorien Herremans. Singing voice conversion with disentangled
    representations of singer and vocal technique using vari- ational autoencoders. In Proceedings of the IEEE International
    Conference on Acous- tics, Speech, and Signal Processing (ICASSP), pp. 3277–3281. IEEE, 2020.


    • [Pishdadian 19] F. Pishdadian, B. Kim, P. Seetharaman, and B. Pardo. ”Classifying non-speech vocals: Deep vs signal process- ing
    representations.” in Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019.


    • [Dridger 16] J. Driedger, S. Balke, S. Ewert, and M. Muller. ”Template-based vibrato analysis in music signals.” in ISMIR 2016, 2016.


    • [Wang 20] C. Wang, V. Lostanlen, E. Benetos, and E. Chew. Playing technique recognition by joint time–frequency scattering. In
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 881–885. IEEE, 2020.


    View Slide

  21. ิ଍εϥΠυ
    21

    View Slide

  22. ͜ͷWORKͷઌʹ͸Կ͕͋Δͷ͔ʁ 22
    Վ

    ς
    Ϋ
    χ
    ο
    Ϋ


    Վ

    ς
    Ϋ
    χ
    ο
    Ϋ
    ݕ



    from
    Ξ
    Χ
    ϖ
    ϥ
    Վ

    Վ

    ς
    Ϋ
    χ
    ο
    Ϋ
    ݕ



    from


    ͭ
    ͖
    Վ

    σʔληοτͷ

    ൐૗΁ͷରԠͷ

    View Slide

  23. VocalSetʹ͓͚ΔϊΠδʔϥϕϧɾऑϥϕϧ
    • ϊΠδʔϥϕϧʢϥϕϧͷ৘ใ͕ਖ਼֬Ͱͳ͍ʣ


    • ऑϥϕϧʢग़ݱ͕Ұ෦ͷՕॴͷΈʣ
    23
    Breathy, ΞϧϖδΦ, உ੠ Belt, εέʔϧ, ঁ੠
    ໌Β͔ʹϏϒϥʔτͷ
    Α͏ͳԻߴมಈ͕
    ΈΒΕΔ͕ɼ
    7JCSBUPͷϥϕϧͳ͠
    ϘʔΧϧϑϥΠΛ
    ग़͍ͤͯΔͷ͸
    ਫ࿮෦෼ͷϑϨʔϜͷΈ

    View Slide

  24. Q. Hand-craftedͰɼࡉ͔͘ରԠ͢Δಛ௃͸࡞ΕΔͷͰ͸ʁ 24
    • A. ෆՄೳͰ͸ͳ͍͕ɼ
    ͜ͷઌͷԠ༻Λߟ͑Δ
    ͱΩϦ͕ͳ͍


    • ՎএςΫχοΫͷछྨ
    ͸͔ͳΓଟذʹΘͨΔ

    View Slide

  25. Confusion matrix 25
    Hand-crafted ਂ૚
    ޡ෼ྨͷ܏޲͸྆ऀ͍ۙ


    -> ࣝผ͕ෆՄೳͳ৔߹͕ଘࡏʁʢp.23 ࢀরʣ


    hand-craftedͰ͸Straight΍Vibratoʹ෼ྨͨ͠ΫϦοϓ͕ଟ͍


    -> ࡉ͔͍ςΫχοΫͷҧ͍ʹͨͲΓ͚͍ͭͯͳ͍ʁ

    View Slide