Upgrade to Pro — share decks privately, control downloads, hide ads and more …

F0に基づいて伸縮された画像文字からの音声合成 [ASJ2024春]

F0に基づいて伸縮された画像文字からの音声合成 [ASJ2024春]

日本音響学会 2024年春季研究発表会の口頭発表スライド
------ 題目 -------
(3-2-4) F0に基づいて伸縮された画像文字からの音声合成 - ☆大中 緋慧,宮崎 亮一(徳山高専),高道 慎之介(東大院・情報理工)

Hien Ohnaka

March 08, 2024

Other Decks in Research


  1. Department of Computer Science and Electronic Engineering, National Institute of

    Technology, Tokuyama College 'ʹج͍ͮͯ৳ॖ͞Εͨ ը૾จࣈ͔ΒͷԻ੠߹੒ ˑେத ඀ܛɼٶ࡚ ྄Ұʢಙࢁߴઐʣɼ ߴಓ ৻೭հʢ౦େӃɾ৘ใཧ޻ʣ ೔ຊԻڹֶձୈճʢ೥य़قʣݚڀൃදձ 
  2. • ۙ೥ͷςΩετԻ੠߹੒ UFYUUPTQFFDI554  Ø ಡΈ্͛Ի੠ͷ߹੒ͷࣗવ͞͸ਓʹ͍ۙϨϕϧ·Ͱൃల <4IFO > ü Ԡ༻ઌɿൃԻτϨʔχϯάʹ͓͚Δ

    554ͷ׆༻ <(BSDJB > Ø ΑΓଟ༷ͳԻ੠Λ߹੒Մೳͱ͢Δ͜ͱΛ໨ࢦ͢ݚڀͷ૿Ճ ü தؒදݱͱͯ͠༧ଌ͞ΕΔ 'ͷखಈ੍ޚʹΑΔ߹੒Ի੠ͷૢ࡞ <3FO > ü ࣗવݴޠʹΑΔൃ࿩ελΠϧͷ੍ޚ <(VP >ͳͲ ü Ԡ༻ઌɿಈըίϯςϯπ੍࡞ͳͲͷϢʔβࣗ਎͕Ի੠Λ Ԡ༻ઌɿ੍ޚ͢ΔʢԻ੠σβΠϯʣঢ়گԼʹ͓͚Δ 554 ݚڀഎܠ 2

    ü ը૾จࣈͷ৳ॖʹجͮ͘ܧଓ௕ͷ੍ޚ • ӆ཯ʹجͮ͘มܗը૾จࣈΛ༻͍ͨൃԻࢦಋ <3VEF > Ø Ի੠ͷ௕͞ɼߴ͞ɼେ͖͞ʹ ج͍ͮͯจࣈΛมܗ Ø มܗ͞Εͨը૾จࣈΛఏࣔͨ͠ ൃԻࢦಋʹΑΓൃԻ͕վળ ݚڀഎܠɿը૾จࣈΛར༻ͨ͠Ի੠γεςϜ 3 <3VEF >ͷ 'JHΑΓҾ༻ うれしいͰ͢ DNN ೖྗɿը૾จࣈ “͏Ε͍͠”Ͱ͢ DNN ೖྗɿը૾ΦϊϚτϖ ը૾෯͔Βܧଓ௕Λ੍ޚ
  4. • ಈػ Ø ΦϊϚτϖͷ৳ॖදݱ΍ൃԻࢦಋʹ͓͚Δมܗը૾จࣈͷ༗ޮੑʹண໨ ˠը૾จࣈΛհͨ͠ӆ཯੍ޚ͕ՄೳͳԻ੠߹੒ʹΑͬͯ ΑΓྑ͍Ի੠σβΠϯͷͨΊͷ 554΍ൃԻࢦಋ͕࣮ݱͰ͖ΔͷͰ͸ • ఏҊख๏ɿը૾ߴ͞ͷ৳ॖʹجͮ͘ '੍ޚՄೳͳԻ੠߹੒

    Ø ը૾ߴ͞ͷ৳ॖʹج͖ͮ߹੒Ի੠ͷ 'Λࢹ֮తʹखಈ੍ޚՄೳ • ࣮ݧ Ø طଘख๏ΑΓ΋ ༏Εͨ '੍ޚੑೳ Ø ओ؍తʹ΋ը૾จࣈͱ߹੒Ի੠͕ Α͘ରԠ͢Δ͜ͱΛ֬ೝ ຊݚڀͷ֓ཁ 4 case1: basic heights Ի੠߹੒ָ͍͠ read-style pitch śƃŤŘŠƄřŤŘŦůŢŘ case2: flatten all heights robotic-style pitch Visual feature extractor Speech synthesis model & Vocoder Edit image height
  5. • ೖྗจࣈΛ 'ʹج͍ͮͯ৳ॖ͞Εͨը૾จࣈʹม׵͠ ͦΕΛೖྗ͢Δ͜ͱͰϝϧεϖΫτϩάϥϜΛग़ྗ Ø Ϟδϡʔϧ͸ʮจࣈ୯Ґ '༧ଌثʯʮը૾ߴ͞৳ॖثʯ ʮԻ੠߹੒Ϟσϧʯ͔Βߏ੒͞ΕΔɽ ఏҊख๏ɿશମ૾ 5

    Encoder ͓͸Α͏ Char.-level F0 Estimator Visual text generation Visual feature extractor Variance adaptor Decoder Mel Spectrogram predicted char.-level F0 stretched visual text Speaker ID Speaker Embedding Speech synthesis model
  6. • ༩͑ΒΕͨςΩετ͔ΒಡΈ্͛Ի੠ʹରԠ͢Δ'Λ ਪఆ͢ΔϞδϡʔϧ Ø ֶशɿਪఆ 'ͱਅͷ 'ͷฏۉೋ৐ޡࠩʹΑΔֶश Ø ਪ࿦ɿಡΈ্͛ελΠϧͷ 'Λग़ྗ

    • Ϟσϧߏ଄ Ø ೖྗɿςΩετͱΞΫηϯτϥϕϧ Ø 'BTU4QFFDIͱಉ༷ͷ FODPEFSͱ WBSJBODFBEBQUPSͷ QJUDIQSFEJDUPSΛ࢖༻ ఏҊख๏ɿจࣈ୯Ґ '༧ଌث 6 Encoder Speaker ID Speaker Embedding Pitch predictor ͓͸Α͏ Text analysis Text Embedding Accent Embedding hiragana accent Char.-level F0
  7. • 'ͱςΩετΛड͚औΓߴ͕͞৳ॖ͞Εͨը૾จࣈΛੜ੒ • ੜ੒ํ๏ Ø σʔληοτ಺ͷฏۉ ' fmean Λج४ʹ৳ॖޙͷը૾ߴ͞ hi

    Λ 'fi ͔Βܾఆ Ø '͕ߴ͍΄Ͳ৳ͼͨը૾ʹͳΔɽ Ø ֶश࣌ɿਅͷ '͔Β৳ॖΛܾఆ Ø ਪ࿦࣌ɿจࣈ୯Ґ '༧ଌث͔Β ಘΒΕͨ '͔Β৳ॖΛܾఆ ఏҊख๏ɿը૾ߴ͞৳ॖث 7 σϑΥϧτͷߴ͞ ϝϧεέʔϧม׵
  8. • 7JTVBMUFYUUPTQFFDI</BLBOP >ͱಉ༷ͷϞσϧ Ø ࠷ॳஈɿ7JTVBMGFBUVSFFYUSBDUPS ü จࣈܗঢ়ʢԻӆʣͱ৳ॖ౓߹͍ʢ'ʣͷ྆ํΛଊ͑Δ͜ͱΛظ଴ Ø ޙஈɿ'BTU4QFFDI<3FO >ͱಉ༷ͷϞσϧ

    ఏҊख๏ɿԻ੠߹੒Ϟσϧ 8 Slicing visual text Conv2d Batch norm. ReLU x 3 Reshape Linear Visual feature Visual feature extractor Encoder Mel Spectrogram Variance adaptor Decoder Speaker Embedding (b) Visual feature extractor (a) Speech synthesis model
  9. • ໨త Ø ߹੒Ի੠ͷ඼࣭ʢࣗવੑɼ'੍ޚੑʣͷධՁ Ø ը૾ͷ৳ॖͷมԽͱ߹੒Ի੠ͷ 'ͷมԽͷରԠ౓߹͍ͷධՁ • ࡾͭͷ࣮ݧΛ࣮ࢪ 

    ࣗવੑ.04$&3ʹΑΔجຊ඼࣭ධՁ  '੍ޚੑͷ٬؍ධՁ  खಈ੍ޚʹ͓͚Δը૾ߴ͞ͷ৳ॖͱ ߹੒Ի੠ 'ͷมԽͷରԠ౓߹͍ͷओ؍ධՁ ධՁ࣮ݧ 9
  10. ࣮ݧ৚݅ 10 ࢖༻ σʔλ σʔληοτ +74ίʔύε <5BLBNJDIJ >࿩ऀ σʔλ਺ ֶश

     ൃ࿩ɼਪ࿦ ൃ࿩ɼςετ ൃ࿩ ը૾จࣈϑΥϯτ ౳෯*1"FY(PUIJD ʢϑΥϯταΠζ  H ͸ QYʣ ಛ௃ྔ நग़ ΞϥΠϝϯτ +74ίʔύεͰఏڙ͞ΕͨԻૉΞϥΠϝϯτΛ࢖༻ 'நग़ 803-%<.PSJTF >ʢ fmean : 196.2 [Hz]ʣ Իڹಛ௃ྔ ࣍ݩͷϝϧεϖΫτϩάϥϜ Ϟσϧ ઃఆ 7JTVBM 'FBUVSF &YUSBDUPS ΧʔωϧαΠζ   ͷ $// ೾ܗੜ੒ ࣄલֶशࡁΈ )JGJ("/<,POH > ൺֱख๏ ͻΒ͕ͳೖྗ 'BTUTQFFDI<3FO > ʢWBSJBODFBEBQUPSͷ༧ଌ 'ͷखಈ੍ޚ͕Մೳʣ
  11. • ൺֱख๏ͱఏҊख๏ͷείΞද Ø /BUVSBMOFTT೔ຊޠ฼ޠ࿩ऀ ໊ʹΑΔࣗવੑ .04 Ø $&38IJTQFS<3BEGPME >CBTFϞσϧΛ༻͍ͨͻΒ͕ͳ୯Ґͷจࣈೝࣝ཰ جຊ඼࣭ͷධՁ

    11 ैདྷख๏ͱൺֱͯࣗ͠વੑɾ$&3ͷ྆ํͰ΍΍ྼΔ ˠը૾ͷ৳ॖʹΑΓจࣈܗঢ়͕มԽ͢ΔͨΊ ςΩετೖྗͱൺֱͯ͠Իӆ৘ใ͕΅΍͚ ൃ࿩಺༰ͷ໌ྎ౓͕௿Լ͢Δ͜ͱ͕ݪҼͱߟ͑ΒΕΔ
  12. • ೋͭͷ؍఺ͰධՁ  ࣗಈ੍ޚੑ ü ࣗಈੜ੒͞Εͨը૾จࣈʹ ରԠ͢Δ F0 ͷԻ੠Λ ߹੒Ͱ͖Δ͔

     खಈ੍ޚੑ ü ը૾จࣈͷߴ͞Λ Ճ޻ͨ͠ࡍʹ F0 ͕ ͦͷՃ޻ʹ௥ै͢Δ͔ Ø ൺֱख๏Ͱ͸7BSJBODFBEBQUPS͔Β༧ଌ͞Εͨ 'ܥྻͷૢ࡞ʹΑΔ '੍ޚΛ༻͍ͨɽ '੍ޚੑͷ٬؍ධՁ 12 Visual feature extractor Speech synthesis model & Vocoder F0 estimator Generate height- stretched visual text Text ࣗಈ੍ޚ खಈ੍ޚ
  13. '੍ޚੑͷ٬؍ධՁɿࣗಈ੍ޚੑ 13 • ֤จࣈΛ఺ͱͨ͠ࢄ෍ਤ Ø ԣ࣠ ü ൺֱख๏ɿfNFBO ʹର͢Δ தؒදݱ

    'ͷϝϧεέʔϧൺ ü ఏҊख๏ɿσϑΥϧτͷߴ͞ H ʹ ର͢Δ֤จࣈͷߴ͞ hi ͷൺ Ø ॎ࣠ɿ fNFBO ʹର͢Δ߹੒Ի੠ͷ 'ͷ ॎ࣠ɿϝϧεέʔϧൺ • ݁Ռ Ø ఏҊख๏ɼൺֱख๏ͱ΋ʹ ༩͑ΒΕͨ '৘ใΛ߹੒Ի੠΁ద੾ʹ൓өՄೳ
  14. 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0

    indicates fmean ) 0.5 1.0 1. Relative height (Relative predicted 0.5 1.0 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª Relative F0 on the mel-scale (1.0 indicates predicted F0s) 0.5 1.0 1.5 2.0 Relative height lative predicted pitch) 0.5 1.0 1.5 2.0 Stretch ratio (uniform) 0.5 1.0 1.5 2.0 Stretch ratio (onechar. in a sentence) 0.5 1.0 1.5 2.0 Stretch ratio (threechar. in a sentence) 1.22ª1.67 1.67ª2.11 2.11ª Proposed Conventional 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) Relative F0 on the mel-scale 0.5 1.0 1.5 2.0 Relative height (Relative predicted pitch) 0.5 1.0 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) 0.5 1.0 1. Relative height (Relative predicted 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates predicted F0s) 0.5 1.0 1.5 2.0 Relative height lative predicted pitch) 0.5 1.0 1.5 2.0 Stretch ratio (uniform) 0.5 1.0 1.5 2.0 Stretch ratio (onechar. in a sentence) 0.5 1.0 1.5 2.0 Stretch ratio (threechar. in a sentence) Proposed Conventional 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) Relative F0 on the mel-scale 0.5 1.0 1.5 2.0 Relative height (Relative predicted pitch) 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 e F0 on the mel-scale 0 indicates fmean ) 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª ve F0 on the mel-scale dicates predicted F0s) 1.22ª1.67 1.67ª2.11 2.11ª Proposed Conv 0.5 1.0 1.5 2.0 ve F0 on the mel-scale .0 indicates fmean ) ve F0 on the mel-scale 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) 0.5 1.0 1. Relative height (Relative predicted 0.5 1.0 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª Relative F0 on the mel-scale (1.0 indicates predicted F0s) 0.5 1.0 1.5 2.0 Relative height lative predicted pitch) 0.5 1.0 1.5 2.0 Stretch ratio (uniform) 0.5 1.0 1.5 2.0 Stretch ratio (onechar. in a sentence) 0.5 1.0 1.5 2.0 Stretch ratio (threechar. in a sentence) 1.22ª1.67 1.67ª2.11 2.11ª Proposed Conventional 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) Relative F0 on the mel-scale 0.5 1.0 1.5 2.0 Relative height (Relative predicted pitch) 0.5 1.0 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) Relative F0 on the mel-scale (1.0 indicates predicted F0s) 0.5 1.0 1.5 2.0 Relative height (Relative predicted pitch) 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 Stretch ratio 0.5 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª Ø ԣ࣠ ü ൺֱख๏ɿ༧ଌ͞Εͨ 'ܥྻͷ͏ͪ ࿈ଓ͢Δ͍ͣΕ͔ͷจࣈ෼Λ৐ׂͨ͡߹ ü ఏҊख๏ɿը૾จࣈͷ͏ͪ ࿈ଓ͢Δ͍ͣΕ͔ͷจࣈΛ৳ॖׂͨ͠߹ Ø ॎ࣠ɿखಈ੍ޚ͋Γͱͳ͠ͷ৔߹ͷ ߹੒Ի੠ 'ͷϝϧεέʔϧൺ Ø ϥϕϧɿߴ͞ H ͱมԽޙͷߴ͞ͷൺ • ݁Ռ Ø ൺֱख๏Ͱ͸มԽͷ౓߹͍͕େ͖͘ઈରతͳ ਺஋͕ fmean ͔Β཭Εͨࡍʹ߹੒Ի੠ '͕௥ै͠ͳ͍ɽ ˠ௿͍ʢߴ͍ʣ'ΛΑΓ௿͘ʢߴ͘ʣ͢Δࡍͷ ਖ਼֬ੑ͕௿͍͜ͱΛҙຯ͢Δɽ '੍ޚੑͷ٬؍ධՁɿखಈ੍ޚੑ 14
  15. 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0

    indicates fmean ) 0.5 1.0 1. Relative height (Relative predicted 0.5 1.0 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª Relative F0 on the mel-scale (1.0 indicates predicted F0s) 0.5 1.0 1.5 2.0 Relative height lative predicted pitch) 0.5 1.0 1.5 2.0 Stretch ratio (uniform) 0.5 1.0 1.5 2.0 Stretch ratio (onechar. in a sentence) 0.5 1.0 1.5 2.0 Stretch ratio (threechar. in a sentence) 1.22ª1.67 1.67ª2.11 2.11ª Proposed Conventional 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) Relative F0 on the mel-scale 0.5 1.0 1.5 2.0 Relative height (Relative predicted pitch) 0.5 1.0 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) 0.5 1.0 1. Relative height (Relative predicted 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates predicted F0s) 0.5 1.0 1.5 2.0 Relative height lative predicted pitch) 0.5 1.0 1.5 2.0 Stretch ratio (uniform) 0.5 1.0 1.5 2.0 Stretch ratio (onechar. in a sentence) 0.5 1.0 1.5 2.0 Stretch ratio (threechar. in a sentence) Proposed Conventional 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) Relative F0 on the mel-scale 0.5 1.0 1.5 2.0 Relative height (Relative predicted pitch) 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 e F0 on the mel-scale 0 indicates fmean ) 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª ve F0 on the mel-scale dicates predicted F0s) 1.22ª1.67 1.67ª2.11 2.11ª Proposed Conv 0.5 1.0 1.5 2.0 ve F0 on the mel-scale .0 indicates fmean ) ve F0 on the mel-scale 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) 0.5 1.0 1. Relative height (Relative predicted 0.5 1.0 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª Relative F0 on the mel-scale (1.0 indicates predicted F0s) 0.5 1.0 1.5 2.0 Relative height lative predicted pitch) 0.5 1.0 1.5 2.0 Stretch ratio (uniform) 0.5 1.0 1.5 2.0 Stretch ratio (onechar. in a sentence) 0.5 1.0 1.5 2.0 Stretch ratio (threechar. in a sentence) 1.22ª1.67 1.67ª2.11 2.11ª Proposed Conventional 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) Relative F0 on the mel-scale 0.5 1.0 1.5 2.0 Relative height (Relative predicted pitch) 0.5 1.0 1.5 2.0 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª Ø ԣ࣠ ü ఏҊख๏ɿը૾จࣈͷ͏ͪ จࣈΛ৳ॖʢखಈ੍ޚʣׂͨ͠߹ ü ൺֱख๏ɿ༧ଌ͞Εͨ 'ͷ ͏ͪจࣈ෼Λ৐ׂͨ͡߹ Ø ॎ࣠ɿखಈ੍ޚ͋Γͱͳ͠ͷ৔߹ͷ ߹੒Ի੠ 'ͷϝϧεέʔϧൺ Ø ϥϕϧɿߴ͞ H ͱมԽޙͷߴ͞ͷൺ • ݁Ռ Ø ఏҊख๏Ͱ͸͜ͷ໰୊͕ ؇࿨͞ΕΑ͘௥ै͍ͯ͠Δɽ ˠ'੍ޚੑͷ؍఺Ͱैདྷख๏ΑΓ΋༏Εͨ݁Ռ '੍ޚੑͷ٬؍ධՁɿखಈ੍ޚੑ 15 0.5 1.0 1.5 2.0 Relative F0 on the mel-scale (1.0 indicates fmean ) Relative F0 on the mel-scale (1.0 indicates predicted F0s) 0.5 1.0 1.5 2.0 Relative height (Relative predicted pitch) 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 Stretch ratio 0.5 ª0.33 0.33ª0.78 0.78ª1.22 1.22ª1.67 1.67ª2.11 2.11ª
  16. • ໨తɿը૾จࣈΛհͨ͠ӆ཯੍ޚՄೳͳԻ੠߹੒ͷ࣮ݱ • ఏҊख๏ɿը૾ߴ͞ͷ৳ॖʹجͮ͘ '੍ޚՄೳͳԻ੠߹੒ • ධՁ࣮ݧɿఏҊख๏ͷجຊੑೳΛௐࠪ Ø ैདྷख๏ͱൺֱͯࣗ͠વੑͰ΍΍ྼΔ͕ '੍ޚੑͰ༏Εͨੑೳ

    ˠը૾จࣈʹجͮ͘߹੒Ի੠ͷ '੍ޚ͕ՄೳͰ͋Δ͜ͱΛ֬ೝ • ࠓޙͷల๬ Ø ఏҊख๏ͷԻ੠σβΠϯʹ͓͚Δ༗ޮੑΛௐࠪ͢ΔͨΊͷ࣮ݧ Ø Իྔ΍ܧଓ௕·ͰؚΊͨมܗը૾จࣈΛ༻͍ͨख๏΁ͷ֦ுͱ ൃԻࢦಋʹ͓͚Δ༗ޮੑͷௐࠪ ͳͲ ·ͱΊ 21