Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SIGMUS130-yamamoto

 SIGMUS130-yamamoto

2021年3月17日 第130回音楽情報科学研究会での発表資料

Yuya Yamamoto

March 17, 2021
Tweet

More Decks by Yuya Yamamoto

Other Decks in Research

Transcript

  1. ՎএςΫχοΫͷࣝผʹ͓͚Δ hand-craftedಛ௃ྔͱਂ૚ֶश ಛ௃ྔͷൺֱ ࢁຊ ༤໵1, Juhan Nam2, ࣉᖒ ༸ࢠ1 ,

    ฏլ ৡ1 1ɿஜ೾େֶ, 2ɿKAIST 2021.03.17 ୈ130ճԻָ৘ใՊֶݚڀձ ޱ಄ൃද
  2. എܠɿՎএςΫχοΫͷࣝผ • ՎএςΫχοΫɿՎख͕ԻߴɾԻྔɾԻ৭ͳͲΛมಈͤ͞ɼɹ ࣮ݱ͢ΔՎএදݱٕ๏ • ࣝผͰ͖ΔͱɼՎखͷՎ͍ํͷཧղʹͭͳ͕Δ • ՎএςΫχοΫࣝผ͸VocalSet [Wilkins 18]ͱ͍͏σʔληοτͷɹ

    ొ৔Ҏ߱ɼਂ૚ֶशʹΑΔख๏͕੒ޭΛೲΊͨ[Luo 20, Pishdadian 19ଞ] • Ұํɼैདྷ୯ҰͷՎএςΫχοΫࣝผʢϏϒϥʔτ[Dridger 16ଞ] ౳ʣ Ͱઐ໳஌ࣝʹجͮ͘ಛ௃ʢHand-craftedಛ௃ʣ͕༻͍ΒΕͨ • ࠓճɼ͜ͷHand-craftedಛ௃ͱਂ૚ֶशಛ௃Λൺֱ 4
  3. ઌߦݚڀɿԻָࣝผͷಛ௃ྔͷൺֱ [Abeßer 19] • ϐονͷي੻Λ෼ྨ͢Δ໰୊ʹ͓͍ͯɼ Hand-craftedಛ௃(Pymus, Bitteli)ͱਂ૚ֶश (CNN)ʹΑΓ֫ಘͨ͠ಛ௃ྔΛൺֱ • ʮ໌ࣔతͳϞσϦϯάΛ͠ͳ͍CNNͰ΋ɼ

    Hand-craftedಛ௃ྔͱಉ౳ͷࣝผྗΛ֫ಘ Մೳʯͱ͍͏݁࿦ 5 [Abeßer 19] J. Abeßer et al. Fundamental frequency contour classi fi cation: A comparison between hand-crafted and cnn-based features. ICASSP2019 Ի৭ͷཁૉ͕ՃΘΔͱͲ͏͔ʁ ෼ྨͷಘҙۤख͸ʁ ͳͲΛ௥ՃͰݕূ
  4. ࣝผʹ༻͍ΔσʔληοτɿVocalSet [Wilkins 18] • ՎএςΫχοΫΛऩ࿥ͨ͠େن໛ σʔληοτ • உঁ20໊ɼҟͳΔ10छྨͷɹɹɹ ՎএςΫχοΫΛؚΉɹɹɹɹɹɹɹ •

    ૯ܭ໿10.1࣌ؒ • 1ϑΝΠϧʹ1ϥϕϧ • ϩϯάτʔϯɼΞϧϖδΦɼɹɹɹ εέʔϧͳͲ͋ΒΏΔϑϨʔζͰ Վএ • ϥϕϧ͕ෆۉߧ 6
  5. ಛ௃நग़ख๏1ɿHand-Craftedಛ௃ྔ 9 • Վखࣝผ [Kroher 14] Ͱ༻͍ΒΕͨಛ௃ ͔ΒԻߴมಈͱԻ৭ʹؔ͢Δಛ௃ྔΛ ൈਮ Ի৭ɼԻߴมಈͦΕͧΕʹMFCCɼϏϒϥʔτಛ௃ྔΛ࠾༻

    MFCC 20࣍ݩ Ի৭ʹؔ͢Δಛ௃ྔ Ϗϒϥʔτಛ௃ྔ 2࣍ݩɼਂ͞(rate)ͱ଎͞(extent) Իߴมಈʹؔ͢Δಛ௃ྔ ->22࣍ݩͷϕΫτϧ
  6. ࣮ݧ 11 • VocalSetͷࣝผ࣮ݧ • VocalSetΛֶश༻ɿςετ༻=8ɿ2ʹ෼ׂ • ෼ྨثʹ͸྆৚݅ͱ΋Random Forest [Breiman

    01]Λ ༻͍Δ • ܾఆ໦ͷ਺͸50 • ࣮ࡍʹ͸ϥϕϧ਺ͷෆۉߧΛߟྀ͢ΔBalanced Random Forest [Chen 04]Λ༻͍Δ
  7. ݁Ռͱ෼ੳ 13 1. ͲͷςΫχοΫͷࣝผΛͲΕ͘Β͍ਖ਼ղͰ͖͔ͨ • શମɾΫϥε͝ͱͷਖ਼ղ཰ (Accuracy) 2. ಛ௃ྔ͸֤ʑΛหผ͠͏ΔදݱΛ֫ಘͰ͖͍ͯΔ͔ •

    tSNEʹΑΔ࣍ݩѹॖϓϩοτ 3. Ͳͷಛ௃ྔ͕ࣝผʹޮՌత͔ͩͬͨ • Random Forestͷಛ௃ྔॏཁ౓ͷ֬ೝ
  8. ݁Ռᶃɿਖ਼ղ཰ 14 • શମͷਖ਼ղ཰ • Hand-craftedಛ௃ྔ: 0.710 • ਂ૚ֶशಛ௃ྔ: 0.736

    • Ϋϥε͝ͱͷਖ਼ղ཰ • Straight, Vibrato, Vocal fry ɿHand-crafted > ਂ૚ֶश • ͦΕҎ֎ͷՎএςΫχοΫ ɿ Hand-crafted < ਂ૚ֶश HC ਂ૚
  9. 16 ݁ՌᶅɿRandom Forestͷಛ௃ྔॏཁ౓ Hand-craftedಛ௃ྔͰͷɼRandom Forestಛ௃ྔॏཁ౓Λܭࢉ Ϗϒϥʔτಛ௃ྔͱMFCCͷ௿࣍܎਺͕ॏཁ౓͕ߴΊ 0 0.01 0.02 0.03

    0.04 0.05 0.06 0.07 0.08 0.09 0.1 Vibrato extent Vibrato rate M FCC-3 M FCC-1 M FCC-2 M FCC-7 M FCC-10 M FCC-14 M FCC-5 M FCC-20 M FCC-6 M FCC-12 M FCC-11 M FCC-4 M FCC-9 M FCC-13 M FCC-19 M FCC-7 M FCC-18 M FCC-17 M FCC-16 M FCC-15 Hand-crafted 特徴量重要度
  10. ݁Ռͷߟ࡯ • ࣝผͷਖ਼ղ཰ɼtSNEϓϩοτΑΓ • ਂ૚ֶशख๏͕hand-craftedख๏ͱḮ৭ͳ͠ ɹɹɹɹɹɹ ɹ ɹ-> ઐ໳஌ࣝΛͦΕ΄Ͳ໌ࣔతʹ༩͑ͳͯ͘΋ࣝผʹ༗༻ͳಛ௃Λ֫ಘՄೳʁ •

    ಛ௃ྔॏཁ౓ͷ෼ੳΑΓ • Hand-craftedख๏͸Ϗϒϥʔτಛ௃ྔɼMFCC௿࣍܎਺ͷॏཁ౓͕ߴ͍ɹ -> Straight΍Vibratoͷਖ਼ղ཰͕ߴ͔ͬͨ͜ͱͱซͤΔͱɼ͜ͷೋͭΛݟ෼͚Δʹ͸ Ϗϒϥʔτಛ௃ྔ͑͋͞Ε͹Α͍ʁ -> ੑ࣭͕Α͘Θ͔͍ͬͯΔςΫχοΫʹରͯ͠͸Hand-craftedಛ௃΋࢖͑Δʁ 17
  11. ·ͱΊ 18 • ΍ͬͨ͜ͱ • ਂ૚ֶशʹΑΔࣗಈநग़ಛ௃ྔͷੑೳΛௐࠪ • Hand-craftedಛ௃ྔͱͷൺֱ • ݁Ռ

    • ਂ૚ֶशख๏ͷਖ਼ղ཰͕2.6%্ճΔɼHand-craftedख๏ͱḮ৭ͳ͠ • ಛʹ௨ৗͱ͸͔͚཭ΕͨՎএςΫχοΫͰੑೳ˕ • Hand-craftedख๏Ͱ͸Straight, Vibratoʹ͓͍ͯੑೳ˕ • ಛ௃ྔॏཁ౓͔ΒɼϏϒϥʔτಛ௃ྔ͕ͦͷ2ͭͷࣝผʹΫϦςΟΧϧʁ • ࠓޙ • ΑΓଟ͘ͷγνϡΤʔγϣϯʹΑΔ࣮ݧʢೖྗ΍σʔληοτ౳ͷൺֱର৅Λ ΋ͬͱ૿΍͢ʣ
  12. ࠓޙͷ՝୊ • ൺֱ͢Δ৚݅Λ௥Ճ͢Δ • Hand-crafted • Formant, Jitter, ShimmerͳͲɼ ੒ޭࣄྫͷ͋Δଞͷಛ௃ྔΛൺֱ৚݅

    ʹ௥Ճ • ָثͷԋ૗ςΫχοΫࣝผͰSoTAΛୡ੒ͨ͠ख๏ [Wang 20]Ͱ͋Δ Wavelet Scattering TransformΛ௥Ճ • ਂ૚ֶश • ৴߸೾ܗɼSTFTεϖΫτϩάϥϜɼF0౳ɼೖྗಛ௃ྔΛมߋ • RNNΛCNNͷ࣍ͷ૚ʹஔ͘ͳͲɼΞʔΩςΫνϟͷมߋ 19
  13. ࢀߟจݙ 20 • [Wilkins 18] J. Wilkins, P. Seetharaman, A.

    Wahl, and B. Pardo,“Vo- calset: A singing voice dataset,” in ISMIR 2018, 2018. • [Abesser 19] J. Abesser and M. Muller, ”Fundamental Frequency Contour Classi fi cation: A Comparison between Hand- crafted and CNN-based Features,” in ICASSP 2019, 2019. • [Kroher 14] N. Kroher and E.Gomez. ”Automatic singer identi fi ca- tion for improvisational styles based on vibrato, timbre and statistical performance descriptors.” in ICMC-SMC 2014, 2014. • [Breiman 01] L. Breiman. ”Random forests,” Machine learning, Vol. 45, No. 1, pp. 5-32, 2001. • [Chen 04] C. Chen, A. Liaw, L. Breiman: Using Random Forest to Learn Imbalanced Data, Technical Report, No.666, 2004. • [Luo 20] Yin-Jyun Luo, Chin-Cheng Hsu, Kat Agres, and Dorien Herremans. Singing voice conversion with disentangled representations of singer and vocal technique using vari- ational autoencoders. In Proceedings of the IEEE International Conference on Acous- tics, Speech, and Signal Processing (ICASSP), pp. 3277–3281. IEEE, 2020. • [Pishdadian 19] F. Pishdadian, B. Kim, P. Seetharaman, and B. Pardo. ”Classifying non-speech vocals: Deep vs signal process- ing representations.” in Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019. • [Dridger 16] J. Driedger, S. Balke, S. Ewert, and M. Muller. ”Template-based vibrato analysis in music signals.” in ISMIR 2016, 2016. • [Wang 20] C. Wang, V. Lostanlen, E. Benetos, and E. Chew. Playing technique recognition by joint time–frequency scattering. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 881–885. IEEE, 2020. 
 

  14. ͜ͷWORKͷઌʹ͸Կ͕͋Δͷ͔ʁ 22 Վ এ ς Ϋ χ ο Ϋ ࣝ

    ผ Վ এ ς Ϋ χ ο Ϋ ݕ ग़ from Ξ Χ ϖ ϥ Վ এ Վ এ ς Ϋ χ ο Ϋ ݕ ग़ from ൐ ૗ ͭ ͖ Վ এ σʔληοτͷ น ൐૗΁ͷରԠͷ น
  15. VocalSetʹ͓͚ΔϊΠδʔϥϕϧɾऑϥϕϧ • ϊΠδʔϥϕϧʢϥϕϧͷ৘ใ͕ਖ਼֬Ͱͳ͍ʣ • ऑϥϕϧʢग़ݱ͕Ұ෦ͷՕॴͷΈʣ 23 Breathy, ΞϧϖδΦ, உ੠ Belt,

    εέʔϧ, ঁ੠ ໌Β͔ʹϏϒϥʔτͷ Α͏ͳԻߴมಈ͕ ΈΒΕΔ͕ɼ 7JCSBUPͷϥϕϧͳ͠ ϘʔΧϧϑϥΠΛ ग़͍ͤͯΔͷ͸ ਫ࿮෦෼ͷϑϨʔϜͷΈ