Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GMI44@Music and Language研究のサーベイ報告

Yuya Yamamoto
March 11, 2024
130

GMI44@Music and Language研究のサーベイ報告

2024年3月11日に行われた第44回GMIワークショップでの発表です.

Yuya Yamamoto

March 11, 2024
Tweet

Transcript

  1. ࣗݾ঺հ • ࢁຊ ༤໵ʢ΍·΋ͱ Ώ͏΍ʣ • ஜ೾େֶେֶӃ ࣉᖒݚڀࣨ ത࢜3೥ʢࠓ೥౓मྃ༧ఆʣ •

    ՎএςΫχοΫͷ෼ੳ͓Αͼݕग़ʹ͍ͭͯݚڀ • ৘ใΛ·ͱΊΔͷ͕޷͖ • https://github.com/yamathcy/awesome-music-informatics • https://github.com/yamathcy/ISMIR-2023-Papers • ౳… • ๺ݪઌੜʹ͓༠͍͍͖ͨͩɼܰ཰ʹൃදΛਃࠐ͠·ͨ͠ 2
  2. ࠓ೔ͷ಺༰ • ࣗવݴޠॲཧ x Իָ৘ใॲཧͳݚڀʹ͍ͭͯͷαʔϕΠใࠂ • Իָ -> ςΩετ (Captioning,

    Question & Answering౳) • Իָ + ςΩετͷϚϧνϞʔμϧදݱֶश • Վࢺੜ੒ɼςΩετϕʔεͷԻָੜ੒ʹ͍ͭͯ͸ࠓճର৅֎ 3
  3. Music and Language (:= M&L) • Իָ৘ใॲཧͱࣗવݴޠॲཧΛ݁ͼͭ ͚Δ෼໺ • Իָ͔ΒςΩετͰ৘ใΛநग़͢Δ

    ʢCaptioningɼ Q&Aʣ • ݴ༿ͰԻָΛ୳͢ʢMusic searchʣ etc… • Vision & Language౳ͱಉ͘͡ɼɹɹɹ ݴޠͱͷϚϧνϞʔμϧֶशͷҰ෼໺ 5 It’s a jazz music piece with slow tempo. Piano, muted-trumpet, tenor sax, bass, and drums are appeared. Captioning Question: Is this music a vocal song? Answer: No. It’s an instrumental song. Q&A ϛϡʔττϥϯϖοτͱ ςφʔαοΫε͕ϝϩσΟʔΛ ԋ૗͢Δδϟζָۂ Music search είΞ: 0.91
  4. ͳͥࣗવݴޠ͔ʁ • 1. ΦʔϓϯϘΩϟϒϥϦʔͳ৘ใهड़͕Մೳ • ↔︎ λά෇͚౳ͷࣝผ໰୊Ͱ͸σʔληοτ ͷΫϥεʹറΓΛड͚Δ • ʢཧ࿦্͸ʣࡉ΍͔ͳχϡΞϯεΛهड़Ͱ

    ͖Δ • 2. ҉ʹ͞·͟·ͳ໰୊Λಉ࣌ʹղ͍͍ͯΔ • ↔︎ ࣝผ໰୊Ͱ͸ɼλεΫ͝ͱʹϞσϧΛ࡞ Βͳ͍ͱ͍͚ͳ͍ 6 ϩοΫ Ϊλʔ ܹ͍͠ 80’ … ͜ͷۂ͸80’ͷ ϩοΫιϯάͰ͋Γɼ ଎͍ςϯϙͰ ΪλʔͷԻͱͱ΋ʹ ܹ͘͠υϥϜ͕ ໐Γڹ͖·͢ɽ ɽɽɽ λά෇͚ Ωϟϓγϣχϯά ͜ͷۂ͸80’ͷ ϩοΫιϯάͰ͋Γɼ ΪλʔͷԻͱͱ΋ʹ ܹ͘͠υϥϜ͕ ໐Γڹ͖·͢ɽ ɽɽɽ Ϝʔυࣝผ δϟϯϧࣝผ ָثࣝผ
  5. M&L͕ࠓΞπ͍ • ࠃࡍձٞISMIRʹ͓͍ͯ΋2021೥͝ΖΛڥʹM&Lܥݚڀ͕૿͑ͨ • ϕετϖʔύʔϊϛωʔτʹ΋ຖ೥ग़ݱ • 2023೥ॳΊʹMusicLM [Agostinelli 23] ͕ൃදɼݚڀࣄྫ΋ैͬͯ૿Ճ

    • ౰࿦จͰMusicCapsͱ͍͏Իָ-ΩϟϓγϣϯͷϖΞσʔληοτ΋ൃද • AudioυϝΠϯ͚ͩͰͳ͘ɼSymbolicυϝΠϯ·Ͱʹਐग़ • CLaMP [SWu 23] 7 <"HPTUJOFMMJ>.VTJD-.(FOFSBUJOH.VTJD'SPN5FYU "OESFB"HPTUJOFMMJFUBM"S9JW <48V>$-B.1$POUSBTUJWF-BOHVBHF.VTJD1SFUSBJOJOHGPS$SPTT.PEBM4ZNCPMJD.VTJD*OGPSNBUJPO3FUSJFWBM 4IBOHEB8VFUBM*4.*3
  6. ͦͷલʹ… • جૅతͳ༻ޠͨͪ • Encoder: ೖྗͱͳΔϝσΟΞ৘ใΛΑΓ௿࣍ݩͳϕΫτϧʹม׵ • Ex. BERT, RoBERTa౳

    • Embedding: ϝσΟΞ৘ใͷີϕΫτϧදݱɽ্ۭؒͰͷϕΫτϧͷڑ ཭͕ҙຯతͳۙ͞Λࣔ͢ • Decoder: EmbeddingΛݩʹग़ྗͱͳΔϝσΟΞ৘ใͷܥྻΛਪ࿦ • Ex. GPT, LLaMA౳ 10
  7. Music -> Text ᶃ Captioningܕ "VEJP &ODPEFS "VEJP&NCFEEJOH 5FYU%FDPEFS <4UBSU>

    3PDL TPOH XJUI ʜ ೖྗԻڹ৴߸ %FDPEFSೖྗ લͷεςοϓͷ୯ޠτʔΫϯ ग़ྗςΩετ 3PDL TPOH XJUI QJBOP ʜ ྆Ϟʔμϧ ༥߹ͷͨΊͷ ԿΒ͔ • ೖྗ: ԻָԻڹ৴߸ • Audio Encoderʹ௨͠ɼຒΊࠐΈϕΫτϧ ʢEmbeddingʣΛಘΔ • Text Decoderͷ಺෦ͰɼAudio embeddingͱ Text Decoderͷதؒग़ྗΛ༥߹ • Text Decoder͸ςΩετΛॱ࣍ੜ੒͢Δ • Muscaps [Manco 21] , LP-MusicCaps [Doh 23] etc. 11 <.BODP>.VT$BQT(FOFSBUJOH$BQUJPOTGPS.VTJD"VEJP*MBSJB.BODPFUBM *+$// <%PI>-1.VTJD$BQT--.#BTFE1TFVEP.VTJD$BQUJPOJOH 4FVOHIFPO%PIFUBM*4.*3
  8. Ϟʔμϧ༥߹ͷͨΊͷςΫχοΫ • Concat: ୯ʹ྆ϞʔμϧͷϕΫτϧΛ݁߹ • FiLM [Perez 18]: ҟͳΔϞʔμϧʹର͢Δ ΞϑΟϯม׵ͷ܎਺ΛܾΊΔ

    • Cross-attention: TransformerϕʔεͳΒ Ͱ͸ͷํ๏ɽ M&LͰ͸QueryͱKey&Value ΛͦΕͧΕҟͳΔϞʔμϧ͔ΒಘΔ 12 [Perez 18] FiLM Visual reasoning with a General Conditioning Layer. Perez et al. AAAI2018 https://vaclavkosar.com/ml/cross-attention-in- transformer-architecture
  9. σʔληοτ: MusicCaps [Agostinelli 23] • ԻָͱΩϟϓγϣϯͷϖΞσʔληοτ • Իָ-ΩϟϓγϣϯͷσʔληοτͰ࠷ॳͷΦʔ ϓϯσʔλ •

    YouTubeϏσΦͷԻָΫϦοϓ5521݅ʹϓϩͷ ԻָՈ͕هड़จΛΞϊςʔγϣϯ, ܭ15࣌ؒ΄Ͳ • https://www.kaggle.com/datasets/googleai/ musiccaps • Music captioningͷֶशɾධՁʹ༻͍ΒΕΔ 13 "CBHQJQFFOTFNCMFQMBZJOHB RVJDLNFMPEZJOVOJTPO FNQMPZJOH USJMMTUPPSOBNFOUJU YoutubeϏσΦͷ ԻΫϦοϓ Ωϟϓγϣϯ <"HPTUJOFMMJ>.VTJD-.(FOFSBUJOH.VTJD'SPN5FYU "OESFB"HPTUJOFMMJFUBM"S9JW σʔλ͸ গ͠଍Γͳ͍
  10. طଘͷλά෇͚༻σʔλͷλά͔ΒٖΩϟϓγϣϯΛੜ੒͠σʔλΛ֦ு 14 • GPT-3.5 TurboʹΑͬͯɼλάͷΈͷ MagnaTagATune [Law 09], Million-song Dataset

    [Bertin-Mahieux 11]͔Βٖࣅతͳ ΩϟϓγϣϯΛੜ੒ • ্هͷૢ࡞ʹΑͬͯɼMusicCapsͷ200ഒҎ ্ͷ໿4000࣌ؒͷϖΞσʔλΛߏங • ٖΩϟϓγϣϯͰ͋ͬͯ΋ੑೳΛ্͛ΒΕͨ LP-MusicCaps [Doh 23] <%PI>-1.VTJD$BQT--.#BTFE1TFVEP.VTJD$BQUJPOJOH 4FVOHIFPO%PIFUBM*4.*3 <-BX>&WBMVBUJPOPGBMHPSJUINTVTJOHHBNFT5IFDBTFPGNVTJDUBHHJOH&-BXFUBM*4.*3 <#FSUJO.BIJFVY>5IFNJMMJPOTPOHEBUBTFU5#FSUJO.BIJFVYFUBM*4.*3
  11. Music -> Textܕ ᶄ Q&Aܕ • ೖྗ: ԻָԻڹ৴߸, ςΩετ •

    Իڹ৴߸ΛAudio Encoderʹ௨͠Audio embeddingΛಘΔ • ్தͰAudio EmbeddingͱTextଆͷதؒग़ྗͱ༥߹͢Δ • ೖྗςΩετ͸ • ᶃText Encoder Ͱॲཧ or • ᶄΠϯετϥΫγϣϯͱͯ͠Text Decoderʹ༩͑Δ • Text DecoderͰςΩετΛग़ྗ • ᶃBART-fusion [Zhang 22], ᶄMu-LLaMa [Liu 24] etc. 15 <;IBOH>*OUFSQSFUJOH4POH-ZSJDTXJUIBO"VEJP*OGPSNFE1SFUSBJOFE-BOHVBHF.PEFM*4.*3 <-JV>-1.VTJD$BQT--.#BTFE1TFVEP.VTJD$BQUJPOJOH FUBM*$"441 "DDFQUFE 
  12. Music Understanding LLaMA [Liu 24] • Իָ৴߸ʹର͢ΔQ&AΛ࣮ݱ • ֶशࡁΈϞσϧΛ׆༻ •

    Audio Encoder: MERT [Li 24] • Text Decoder: LLaMA-2 [Touvron 23] • MERT͔Βͷग़ྗʹɼAdapterͱ ͍͏ػߏΛ௨ͯ͠ɼͦΕΛ Decoderͷ࠷ऴ૚ʹೖྗ 16 <5PVWSPO>-MBNB0QFO'PVOEBUJPOBOE'JOF5VOFE$IBU.PEFMT5PVWSPOFUBM"SYJW <-J>.&35"DPVTUJD.VTJD6OEFSTUBOEJOH.PEFMXJUI-BSHF4DBMF4FMGTVQFSWJTFE5SBJOJOH :-JFUBM *$-3
  13. σʔληοτɿMusicQA [Liu 24] • LLMʹΑͬͯQ&AϖΞΛMusicCaps (Ωϟϓγϣϯ)΍MagnaTagATune (λ ά)͔Βੜ੒ • 2छྨͷํ๏Ͱ࣭໰จΛ࡞੒

    • Closed-end: ΠϯετϥΫγϣϯͰ ࣭໰Λ༩͑ͯɼͦͷ౴͑Λੜ੒͢Δ • Open-end: Ϟσϧࣗ਎ʹ࣭໰Λੜ੒ ͤͯ͞ɼͦͷ࣭໰ͷ౴͑΋ੜ੒͢Δ 17 <-JV>-1.VTJD$BQT--.#BTFE1TFVEP.VTJD$BQUJPOJOH FUBM*$"441 "DDFQUFE  Closed Open
  14. ϚϧνϞʔμϧදݱֶश͕Մೳʹ͢Δ͜ͱ • จষʹΑΔԻָݕࡧ • จষͷText embeddingͱϥΠϒϥϦʹ͋ΔԻָͷ Audio embeddingؒͷ݁ͼ͚͕ͭͰ͖ΔΑ͏ʹ • จ຺৘ใΛ΋ͭԻָಛ௃ྔͷ֫ಘ

    • ͢Ͱʹdiscriminableͳಛ௃͕֫ಘͰ͖͓ͯΓɼɹ θϩγϣοτֶश΍ɼ௿ϦιʔελεΫ΁ͷసҠ ֶश͕ظ଴ • ୯ମͰ͸ੜ੒͸Ͱ͖ͳ͍͕ɼMusicLMͰ͸ɹɹɹ ςΩετϓϩϯϓτͱԻָͷඥ෇͚ͷϞδϡʔϧ ͱͯ͠Ұ෦ʹར༻͞Ε͍ͯΔ [Agostinelli 23] 20 ϛϡʔττϥϯϖοτͱ ςφʔαοΫε͕ϝϩσΟʔΛ ԋ૗͢Δδϟζָۂ
  15. 21 • τϦϓϨοτֶश • ΞϯΧʔɼਖ਼ྫʢࣅ͍ͯΔ΋ͷʣɼෛྫʢҟͳΔ΋ͷʣΛ༻ҙ͠ɼ ΞϯΧʔ-ਖ਼ྫͷڑ཭ʹ ରͯ͠ΞϯΧʔ-ෛྫͷڑ཭Λେ͖͘͢Δ • θϩγϣοτԻָλά෇͚ [Choi

    19], λάʹΑΔԻָݕࡧ [Won 21] ౳ • ରরֶश • ΞϯΧʔͱਖ਼ྫͷڑ཭Λ࠷খԽ͠ɼΞϯΧʔͱෛྫͷڑ཭Λ࠷େԽ • ը૾-ςΩετؒͷϞσϧCLIP [Radford 21]Ͱ੒ޭͨ͠࿮૊Έ • ྆ऀͷੑೳΛൺֱͨ͠ࣄྫ΋͋Γ; ରরֶशͷํ͕Ұ౓ʹ࠷దԽͰ͖Δαϯϓϧ͕ଟ͍ [Doh 23] ࣮ݱͷͨΊͷख๏͸͍͔ͭ͘ <$IPJ>;FSPTIPU-FBSOJOHGPS"VEJPCBTFE.VTJD$MBTTJ fi DBUJPOBOE5BHHJOH+$IPJFUBM*4.*3 <8PO>.VMUJNPEBM.FUSJD-FBSOJOHGPS5BHCBTFE.VTJD3FUSJFWBM.XPOFUBM*$"441 <3BEGPSE>-FBSOJOH5SBOTGFSBCMF7JTVBM.PEFMT'SPN/BUVSBM-BOHVBHF4VQFSWJTJPO 3BEGPSEFUBM"S9JW <%PI>508"3%6/*7&34"-5&9550.64*$3&53*&7"- %PIFUBM *$"441
  16. τϦϓϨοτֶश 22 • 3ͭ૊ΈͷαϯϓϧΛ༻ҙ͠ɼɹɹɹ ͦΕͧΕAnchor, Positive sample, Negative sampleͱ͢Δ •

    Anchorʹର͠ྨࣅ౓͕ΑΓ͍ۙ΋ͷ ΛPositive sampleɼԕ͍΋ͷΛ Negative sampleͱ͠ɼͦΕͧΕͷɹ ڑ཭ͷࠩΛɼ͋Δ஋δʢϚʔδϯʣʹ ͳΔΑ͏ʹ࠷దԽ͢Δɽ
  17. M&LͷରরֶशϞσϧ: MuLan [Huang 22] / MusCall [Manco 22] • Իָ

    (Audio) ͱςΩετͷϚϧν ϞʔμϧֶशϞσϧ • ͱ΋ʹରরֶश͕ϕʔε • (ISMIR 2022Ͱωλ͕ඃΔͱ͍͏ح੻) • θϩγϣοτ෼ྨʹΑΔԻָλά෇͚ɼసҠ ֶशʹΑΔԻָλά෇͚ɼԻָݕࡧʹ͓͍ ͯɼੑೳΛ֬ೝ 25 <.BODP>$0/53"45*7&"6%*0-"/(6"(&-&"3/*/('03.64*$*.BODPFUBM*4.*3 <)VBOH>.6-"/"+0*/5&.#&%%*/(0'.64*$"6%*0"/%/"563"- -"/(6"(&2)VBOHFUBM*4.*3
  18. M&LͷରরֶशϞσϧ: CLaMP [SWu 23] • Իָ (Symbolic) ͱςΩετͷϚϧνϞʔμ ϧֶशϞσϧ •

    ରরֶश͕ϕʔε • M3ͱ͍͏ABC notationϕʔεͰָේͷࣗݾڭࢣ͋Γ ֶशϞσϧ΋ซͤͯఏҊ • WikiMusicDatasetͱ͍͏ָේʹର͢ΔΩϟϓγϣϯ σʔληοτΛఏҊ • ςΩετָේݕࡧΛՄೳʹ • θϩγϣοτ෼ྨ/సஔֶशͰྑੑೳ 26 <48V>$-B.1$POUSBTUJWF-BOHVBHF.VTJD1SFUSBJOJOHGPS$SPTT.PEBM 4ZNCPMJD.VTJD*OGPSNBUJPO3FUSJFWBM 4IBOHEB8VFUBM*4.*3
  19. 27 • Captioning • MusicCaps https://www.kaggle.com/datasets/googleai/musiccaps • LP-MusicCaps-MC https://huggingface.co/datasets/seungheondoh/LP-MusicCaps-MC •

    LP-MusicCaps-MSD https://huggingface.co/datasets/seungheondoh/LP-MusicCaps-MSD • LP-MusicCaps-MTT https://huggingface.co/datasets/seungheondoh/LP-MusicCaps-MTT • MusicBench https://huggingface.co/datasets/amaai-lab/MusicBench • The song describer dataset https://github.com/mulab-mir/song-describer-dataset • WikiMuTe https://huggingface.co/datasets/davanstrien/WikiMuTe • Question-answering • Music-AVQA https://gewu-lab.github.io/MUSIC-AVQA/ • MusicQA https://crypto-code.github.io/MU-LLaMA-Demo/#dataget ࢖͑Δσʔληοτͷ·ͱΊ
  20. ײ૝ a.k.a. ϞϠϞϠᶃ: ͳͥࣗવݴޠ͔ʁ • ԻָͱࣗવݴޠΛ݁ͼ͚ͭΔࢫΈ͸ɼԻ੠΍ը૾ͷͦΕͱ͸ҟͳΔʁ • ٕज़తʹɼ੝Γ্͕͍ͬͯΔࣗવݴޠॲཧͱ༥߹͢Δͷ͸໘ന͍ • ʮԻָͱʯ༥߹ͯ͠خ͍͜͠ͱ͸͋Δʁ

    • Իָͱࣗવݴޠ͸ͦ΋ͦ΋݁ͼ෇͚Ͱ͖Δͷ͔ • ࡢ೔ͷฏాઌੜͷ͓࿩ͷதʹ΋͋ͬͨ௨Γ྆ऀ͸ඇՄٯ • ໘ന͍, ͱ͍͏Ҏ্ʹԿΛ΋ͨΒͩ͢Ζ͏͔ 28
  21. ײ૝ a.k.a. ϞϠϞϠᶄ: ࣗ෼ͷλεΫʹಛԽ͍ͤͨ͞৔߹ • ΩϟϓγϣϯʹೖΔ΂͖ཁ݅͸Կʁ • Իָʹ͸͞·͟·ͳଐੑΛ͍࣋ͬͯΔ • Ͳ͜·ͰΛهड़͢Δ΂͖͔ʁ

    • Իݯ͔ΒΘ͔Δ৘ใ (Acoustics)ɼͲͷΑ͏ʹফඅ͞Εͨʢ͞ΕΔʣ͔ͷ ৘ใ (Cultural)ɼϝλσʔλ৘ใ (Editorial)ʹେผͰ͖Δ [Pachet 05] • Ͳ͏ΩϟϓγϣϯΛ૊Ή΂͖͔ • ΩϟϓγϣϯΛهड़͢Δ୯Ґ͸ۂશମͰOK? • ࠁҰࠁͰঢ়ଶ͸มԽ͢Δ 29 [Pachet 05]Pachet, F., 2005. Knowledge management and musical metadata. Idea Group, 12.
  22. ԻָͱࣗવݴޠॲཧͷϚϧνϞʔμϧֶशͷαʔϕΠ 30 • ಈ޲Λ঺հ • Իָ -> ςΩετ (Captioning, Question

    & Answering౳) • Իָ + ςΩετͷϚϧνϞʔμϧදݱֶश • ૉ๿ʹࢥ͍ͬͯΔٙ໰ΛͿͪ·͚ • ͦ΋ͦ΋࿦తͳ࿩΋ଟ͍ͱࢥ͏ͷͰɼٞ࿦ͷՐछͱͳΕ͹޾͍Ͱ͢ ·ͱΊ