[ASJ2020a] フルコンテキストラベル入力を用いたFastSpeech型ニューラルTTSモデルの比較

2fb31d9e685d0533a22475951301ec67?s=47 Takuma OKAMOTO
September 11, 2020

[ASJ2020a] フルコンテキストラベル入力を用いたFastSpeech型ニューラルTTSモデルの比較

2fb31d9e685d0533a22475951301ec67?s=128

Takuma OKAMOTO

September 11, 2020
Tweet

Transcript

  1. ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ 'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ ̋Ԭຊ୓ຏɼށాஐج ɼࢤլ๕ଇɼՏҪ߃ ৘ใ௨৴ݚڀػߏɼ໊ݹ԰େֶ  UI4FQU "4+"OOVBM.FFUJOH"VUVNO! $PNQBSJTPOPG'BTU4QFFDICBTFEOFVSBM554NPEFMT XJUIGVMMDPOUFYUMBCFMJOQVU

    
  2. *OUSPEVDUJPO 1VSQPTF 'BTU4QFFDICBTFEOFVSBM554NPEFMTXJUIGVMMDPOUFYUMBCFMJOQVU "MJHO554 +%*5 &YQFSJNFOUT $PODMVTJPOT "OOPVODFNFOU 0VUMJOF 

  3. &OEUPFOEܕχϡʔϥϧςΩετԻ੠߹੒ 554  4FRVFODFUPTFRVFODF 4FRTFR Ϟσϧɿ5BDPUSPOɼ5SBOTGPSNFSɼ%FFQ7PJDF ςΩετ ·ͨ͸Իૉ ͔ΒԻڹಛ௃ྔΛ௚઀ੜ੒ɼԻૉΞϥΠϝϯτෆཁ ՝୊ɿ஫ҙػߏ༧ଌࣦഊʹΑΔԻૉͷܽམɼ܁Γฦ͠ˠ࣮αʔϏεʹͱͬͯ͸க໋తɼࣗݾճؼϞσϧ

    ࠷ઌ୺ͷϑϧ&OEUPFOEςΩετԻ੠߹੒ɿ&"54ɼ'BTU4QFFDI Իૉ͔ΒԻ੠೾ܗΛͭͷωοτϫʔΫͰ௚઀ੜ੒ɼԻૉΞϥΠϝϯτෆཁɼඇࣗݾճؼϞσϧ ՝୊ɿԻ࣭ʹվળͷ༨஍͋Γ ɼجຊप೾਺Λհ͢Δͷ͸&OEUPFOEͱݴ͍͍ͬͯͷ͔  ҆ఆܕχϡʔϥϧ554Ϟσϧ #-45. 5BDPEFDɿ50LBNPUPFUBM "436 )..Ͱਪఆͨ͠ڧ੍ΞϥΠϝϯτͱ5BDPUSPOͷσίʔμΛ࢖༻ˠ҆ఆ͔ͭߴ඼࣭ ՝୊ɿผ్)..ͷԻૉΞϥΠϝϯτֶश͕ඞཁɼࣗݾճؼϞσϧ 'BTU4QFFDIɿ:3FOFUBM /FVS*14 ॱ఻ൖܕ5SBOTGPSNFSɼඇࣗݾճؼ ௒ߴ଎ੜ੒ ɼԻૉΞϥΠϝϯτɿڭࢣ5SBOTGPSNFS ஌ࣝৠཹ ڭࢣੜెֶश ʹΑΓࣗݾճؼ5SBOTGPSNFSͱಉ౳ͷԻ࣭͔ͭ҆ఆͳੜ੒Λ࣮ݱ ՝୊ɿ஌ࣝৠཹ͕ඞཁɼ-+4QFFDIίʔύεͷͨΊ݁Ռ͕಄ଧͪͳՄೳੑ *OUSPEVDUJPO  3FBMUJNFGBDUPS (16 $16T    
  4. ϑϧίϯςΩετϥϕϧΛ༻͍ͨ೔ຊޠχϡʔϥϧςΩετԻ੠߹੒ ԻૉೖྗͷΈΑΓ΋ϑϧίϯςΩετϥϕϧΛ༻͍ͨํ͕ߴ඼࣭ ஌ࣝৠཹͳ͠ͷ'BTU4QFFDI͸ࣗݾճؼϞσϧʹ͸ٴ͹ͳ͍ 'BTU4QFFDIͰ͸Իૉܧଓ௕ਪఆ͸ผͷωοτϫʔΫΛ༻͍ͨํ͕ߴਫ਼౓ σϞαϯϓϧɿIUUQTBTUBTUSFDOJDUHPKQEFNP@TBNQMFTJDBTTQ@@PLBNPUPJOEFYIUNM 1SFWJPVTSFTVMUT  Tacotron 2 Transformer

    BLSTM Taco2dec WaveGlow STRAIGHT Original FastSpeech Mean opinion score WG(256) PWG Analysis-synthesis Transformer Taco2dec Only phoneme Full-context label input WaveGlow WG(256) PWG WaveGlow WaveGlow WaveGlow (a) (b) ԬຊΒɼԻߨ࿦य़ ˞$07*%ͷͨΊதࢭ
  5. ஌ࣝৠཹෆཁͷ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧ "MJHO554ɿ;;FOHFUBM *$"441 ࠞ߹ີ౓ωοτϫʔΫʹΑΓԻૉΞϥΠϝϯτΛਪఆɼԻૉܧଓ௕͸ผωοτϫʔΫͰਪఆɼจࣈೖྗ ӳޠ  +*%5 +PJOUMZUSBJOFE%VSBUJPO*OGPSNFE5SBOTGPSNFS ɿ%-JNFUBM 

    ࣗݾճؼ5SBOTGPSNFSͱ'BTU4QFFDIΛಉֶ࣌शɼԻૉೖྗ ؖࠃޠ  'BTU4QFFDIɿ:3FOFUBM  ԻૉΞϥΠϝϯτʹ.POUSFBMGPSDFEBMJHOFS .'" Λ࢖༻ɼجຊप೾਺Λ్தͰར༻ɼԻૉೖྗ ӳޠ  'BTU1JUDIɿ"-︎BO DVDLJ  ԻૉΞϥΠϝϯτʹ5BDPUSPOͷਪఆ݁ՌΛ࢖༻ɼجຊप೾਺Λ్தͰར༻ɼԻૉೖྗ ӳޠ  ໨తɿϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕ೔ຊޠ554Ϟσϧͷൺֱ )..ɼ.'"ɼ5BDPUSPOɼ"MJHO554ɼ+%*5ͷछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ ˠͦΕͧΕͷϞσϧͰ࢖༻͍ͯ͠ΔΞϥΠϝϯτํ͕ࣜҟͳΔͨΊ 'BTU4QFFDIͷϞσϧߏ଄ͷҧ͍ "MJHO554ɼ+%*5  ˠ"MJHO554͓Αͼ+%*5Ͱ͸'BTU4QFFDI ॱ఻ൖܕ5SBOTGPSNFS ͷϞσϧߏ଄͕ҟͳΔͨΊ 1VSQPTF  ೔ຊޠχϡʔϥϧ554ͷ௒ߴ଎ੜ੒ϞσϧͷߴԻ࣭Խ͸Մೳ͔
  6. Length Regulator + ∼Positional Encoding FFT Block Linear Layer Mel-spectrogram

    Halign N× N× FFT Block + ∼Positional Encoding H Linear Layer ×N Mix Density Network Forward Algorithm {yi }n p(yi |µj, Σj ) {(µj, Σj )}m − log αn,m Alignment Loss Only Training (2) Feed-Forward Transformer (3) Mix Density Network Full-context label 1 × 1 Conv Layer Multi-Head Attention Add & Norm Conv 1D Add & Norm (1) FFT Block N× FFT Block + ∼Positional Encoding Linear Layer Duration Sequence (4) Duration Predictor Full-context label 1 × 1 Conv Layer ֶशํ๏ ΦϦδφϧ  εςοϓɿΤϯίʔμ͓Αͼࠞ߹ॏΈωοτϫʔΫͷֶश ࠞ߹ॏΈωοτϫʔΫ͔Βଟ࣍ݩਖ਼ن෼෍ͷฏۉͱ෼ࢄΛֶश ɹฏۉˠԻૉʮYʯͷฏۉతͳϝϧεϖΫτϩάϥϜ ɹ෼ࢄˠԻૉʮYʯͷεϖΫτϩάϥϜͷ͹Β͖ͭʹରԠ 7JUFSCJΞϧΰϦζϜʹΑΓԻૉΞϥΠϝϯτΛऔಘ εςοϓɿσίʔμͷֶश ΤϯίʔμΛݻఆͯ͠σίʔμͷΈΛֶश εςοϓɿಉ࣌࠷దԽ 'BTU4QFFDIͱࠞ߹ॏΈωοτϫʔΫΛಉֶ࣌श ɹԻૉΞϥΠϝϯτ͸ֶशͷ౎౓ߋ৽͞ΕΔ εςοϓ ࠷ޙʹ֬ఆͨ͠ԻૉΞϥΠϝϯτͰԻૉܧଓ௕ਪఆωοτϫʔΫΛֶश ֶशํ๏ͷৄࡉ *$"441ԻڹԻ੠ಡΈձͷࢿྉΛࢀরͷ͜ͱ IUUQTCJUMZ8XTF "MJHO554 
  7. ֶशํ๏ 'BTU4QFFDIͱࣗݾճؼܕ5SBOTGPSNFSͷσίʔμ Λಉ࣌࠷దԽ -ଛࣦɿԻڹಛ௃ྔਪఆ -ଛࣦɿԻૉܧଓ௕ਪఆ ஫ҙػߏͷର֯Խͷଅਐ $5$ଛࣦɿԻૉܥྻΛ σίʔμग़ྗ͔Βٯਪఆ ༠ಋ஫ҙଛࣦ ߹੒࣌

    'BTU4QFFDIͷΈΛਪ࿦  Length Regulator FFT Block Linear Layer + ∼Positional Encoding Halign N× N× FFT Block + ∼Positional Encoding H Full-context label 1 × 1 Conv Layer Encoder Pre-net Duration Predictor Attention Mechanism Decoder Pre-net Mel-spectrogram + ∼Positional Encoding Linear Layer Decoder Linear Layer CTC loss Guided attention loss L1 loss L2 loss L1 loss Only Training 1 2 3 4 5 phoneme +%*5
  8. ԻڹϞσϧ 'BTU4QFFDIܕϞσϧ ॱ఻ൖܕ5SBOTGPSNFSɿ''5 ͷൺֱ "MJHO554ܕϞσϧPS+*%5ܕϞσϧɿνϟωϧ਺ɼ''5ϒϩοΫͷߏ଄ɼ''5ϒϩοΫ૯਺౳͕ҟͳΔ ˞ͦΕͧΕ'BTU4QFFDI෦ͷΈΛ୯ಠͰֶशɼԻૉܧଓ௕ਪఆͳ͠ 7BOJMMB+%*5ɿࣗݾճؼܕ5SBOTGPSNFSσίʔμ͓ΑͼԻૉܧଓ௕ਪఆؚΉ ԻૉΞϥΠϝϯτํࣜ ).. )54ɼ.FSMJO

     .POUSFBM'PSDFE"MJHOFS ."' ɿ(.. -%" ,BMEJ  "MJHO554ɿεςοϓͷֶशͷΈ͔ΒಘΒΕΔԻૉΞϥΠϝϯτɼεςοϓͷಉ࣌࠷దԽͳ͠ +%*5ɿ'BTU4QFFDIͷΤϯίʔμ͓Αͼࣗݾճؼܕ5SBOTGPSNFSσίʔμͷΈΛֶश 5BDPUSPOɿֶशޙͷֶशηοτͷਪ࿦࣌ͷ஫ҙॏΈ͔ΒԻૉΞϥΠϝϯτΛऔಘ ˠͦΕͧΕผ్''5ܕԻૉܧଓ௕ਪఆΛֶश ''5ɼ#-45. 5BDPEFD  ֶशσʔλɿ೔ຊޠঁੑϓϩ࿩ऀ໊ɿ ൃ࿩ ໿࣌ؒ ɼL)[ ϑϧίϯςΩετϥϕϧɿԻૉ࣍ݩ ΞΫηϯτ৘ใ࣍ݩˠܭ࣍ݩ &YQFSJNFOUBMDPOEJUJPOT 
  9. ܭଌ৚݅ɿ1Z5PSDI࣮૷ (16ɿ/7*%*"5FTMB7 $16ɿ*OUFM9FPO ԻڹϞσϧɿ࠷େίΞ࢖༻ 8BWF(MPXɿίΞ࢖༻ ϊʔυͷ࠷େ਺  ݁Ռ 'BTU4QFFDIܕϞσϧࣗݾճؼϞσϧ $16࢖༻࣌΋35'ఔ౓

    $16Λ༻͍ͨϦΞϧλΠϜχϡʔϥϧ554 -1$/FU!L)[ ˠ5PUBM35'ɿ   ,.BUTVCBSB 50LBNPUP 35BLBTIJNB 55BLJHVDIJ 55PEB :4IJHBBOE),BXBJ *OWFTUJHBUJPOPG USBJOJOHEBUBTJ[FGPSSFBMUJNFOFVSBMWPDPEFSTPO$16T z"DPVTU4DJ5FDI BDDFQUFE UPBQQFBS 3FTVMUTPGSFBMUJNFGBDUPST 35'T  ౓ͳ݁Ռ͕ಘΒΕ͓ͯΓɼAlignTTS ผωοτϫʔΫΛ࢖༻͍ͯ͠Δɻͦ Ͱ͸ɼVanilla JDI-T Ҏ֎͸ɼશͯ (4) ͷܧଓ௕ਪఆωοτϫʔΫΛ༻͍ ɻ·ͨɼൺֱͷͨΊɼࣗݾճؼܕϞσ ίϯςΩετϥϕϧೖྗܕ Tacotron erɼBLSTM+Taco2dec[6] ΋ಋೖ͢Δɻ o2dec ʹ΋ 5 छྨͷԻૉΞϥΠϝϯτ Δɻ Ի੠ 23,828 ൃ࿩ (໿ 18 ࣌ؒ) Λֶशηο ετηοτͱ͠ɼαϯϓϦϯάप೾਺ ͨɻϝϧεϖΫτϩάϥϜ͸จݙ [5, 6] ͠ɼϑʔϨϜγϑτྔ͸ 12.5 ms ͱ͠ ςΩετϥϕϧ͸ɼԻૉ 39 ࣍ݩͱΞΫ ࣍ݩͷܭ 48 ࣍ݩͱͨ͠ɻਪఆͨ͠ϝϧ ϥϜΛԻ੠೾ܗ΁ͱม׵͢Δχϡʔϥϧ Table 1 Results of inference real-time factors (RTFs) of neural network models with an NVIDIA Tesla V100 and Intel Xeon 6152 CPUs. “FFT” de- notes feed-forward Transformer. GPU CPUs BLSTM+Taco2dec 0.015 0.21 Tacotron 2 0.063 0.22 Transformer 0.55 3.2 FFT (AlignTTS) 0.005 0.026 FFT (JDI-T) 0.005 0.026 FFT duration model 0.0007 0.0024 WaveGlow 0.066 2.1 ໊ͷ੒ਓ೔ຊޠ฼ޠ࿩ऀͰ͋Γɼจݙ [5, 6] ͱಉ༷ɼ
  10. .04SFTVMUTBOEEFNPTBNQMFT  Mean opinion score FFT (AlignTTS) FFT (JDI-T) BLSTM+Taco2dec

    Tacotron 2 Transformer Original Vanilla JDI-T Alignment Acoustic model WaveGlow (analysis-synthesis) HMM MFA Tacotron 2 AlignTTS JDI-T HMM MFA Tacotron 2 AlignTTS JDI-T HMM MFA Tacotron 2 AlignTTS JDI-T Non-autoregressive Autoregressive Seq2seq 'BTU4QFFDIܕϞσϧ͸ࣗݾճؼϞσϧʹ͸ٴ͹ͳ͍
  11. ݁Ռߟ࡯ 'BTU4QFFDIܕϞσϧ ඇࣗݾճؼϞσϧ ͸ࣗݾճؼϞσϧʹ͸Ի࣭͸ٴ͹ͳ͍ Իڹಛ௃ྔͷࣗݾճؼ͸ॏཁ ˠجຊप೾਺౳ͷิॿ৘ใʹΑΔԻ࣭ͷվળ 'BTU4QFFDIɼ'BTU1JUDI  ''5ߏ଄ͷҧ͍͸ͳ͠ "MJHO554PS+%*5

     ΞϥΠϝϯτํࣜ "MJHO554Ҏ֎͸༏Ґࠩͳ͠ˠ+%*5ͷԻૉΞϥΠϝϯτ͸ྑ޷ "MJHO554Ͱ΋ϑϧίϯςΩετϥϕϧΛ࢖༻ˠϥϕϧͷҧ͍͕ѱӨڹˠԻૉͷΈͰͷݕ౼ ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ "MJHO554༻͓Αͼ+%*5༻Ϟσϧʹ͓͍ͯछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ $16Λ༻͍ͨߴ଎ੜ੒Λ֬ೝ Ի࣭͸ࣗݾճؼϞσϧ 5BDPUSPOɼ5SBOTGPSNFSɼ#-45. 5BDPEFD ʹ͸ٴ͹ͳ͍ %JTDVTTJPOTBOEDPODMVTJPOT 
  12.  ໦ ʙ ۚ ͚͍͸Μͳ3%ϑΣΞˏΦϯϥΠϯ χϡʔϥϧ࿩଎ม׵ٕज़ͷ঺հ ԬຊΒɼzෳ਺࿩ऀ8BWF/FUϘίʔμΛ༻͍ͨχϡʔϥϧ࿩଎ม׵ͷࢼΈzˏ೥݄41ݚڀձ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ˞$07*%ͷͨΊதࢭ "OOPVODFNFOU 

    Residual block Residual block Residual block Residual block + ReLU 1 × 1 CNN ReLU 1 × 1 CNN Softmax p(xn |x0, · · · , xn−1 ) Skip connections · · · · · · Residual block + 1 × 1 CNN 2 × 1 dilated CNN × tanh σ Upsample layer Bidirectional GRU Mel-spectrogram Upsample layer Mel-spectrogram (a) Bidirectional GRU Upsample layer Mel-spectrogram (b) Bidirectional GRU for rate conversion Resampling for rate conversion Resampling Mean opinion score ST R A IG H T SD -W aveN et SI-W aveN et (a) (a) (b) (b) W SO LA 4QFFDISBUFDPOWFSTJPOSBUF 5SBJOFEVTJOH+74DPSQVT 3FTBNQMJOHBDPVTUJDGFBUVSFT GPSTQFFDISBUFDPOWFSTJPO