Slide 1

Slide 1 text

ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ 'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ ̋Ԭຊ୓ຏɼށాஐج ɼࢤլ๕ଇɼՏҪ߃ ৘ใ௨৴ݚڀػߏɼ໊ݹ԰େֶ UI4FQU "4+"OOVBM.FFUJOH"VUVNO! $PNQBSJTPOPG'BTU4QFFDICBTFEOFVSBM554NPEFMT XJUIGVMMDPOUFYUMBCFMJOQVU

Slide 2

Slide 2 text

*OUSPEVDUJPO 1VSQPTF 'BTU4QFFDICBTFEOFVSBM554NPEFMTXJUIGVMMDPOUFYUMBCFMJOQVU "MJHO554 +%*5 &YQFSJNFOUT $PODMVTJPOT "OOPVODFNFOU 0VUMJOF

Slide 3

Slide 3 text

&OEUPFOEܕχϡʔϥϧςΩετԻ੠߹੒ 554 4FRVFODFUPTFRVFODF 4FRTFR Ϟσϧɿ5BDPUSPOɼ5SBOTGPSNFSɼ%FFQ7PJDF ςΩετ ·ͨ͸Իૉ ͔ΒԻڹಛ௃ྔΛ௚઀ੜ੒ɼԻૉΞϥΠϝϯτෆཁ ՝୊ɿ஫ҙػߏ༧ଌࣦഊʹΑΔԻૉͷܽམɼ܁Γฦ͠ˠ࣮αʔϏεʹͱͬͯ͸க໋తɼࣗݾճؼϞσϧ ࠷ઌ୺ͷϑϧ&OEUPFOEςΩετԻ੠߹੒ɿ&"54ɼ'BTU4QFFDI Իૉ͔ΒԻ੠೾ܗΛͭͷωοτϫʔΫͰ௚઀ੜ੒ɼԻૉΞϥΠϝϯτෆཁɼඇࣗݾճؼϞσϧ ՝୊ɿԻ࣭ʹվળͷ༨஍͋Γ ɼجຊप೾਺Λհ͢Δͷ͸&OEUPFOEͱݴ͍͍ͬͯͷ͔ ҆ఆܕχϡʔϥϧ554Ϟσϧ #-45.5BDPEFDɿ50LBNPUPFUBM "436 )..Ͱਪఆͨ͠ڧ੍ΞϥΠϝϯτͱ5BDPUSPOͷσίʔμΛ࢖༻ˠ҆ఆ͔ͭߴ඼࣭ ՝୊ɿผ్)..ͷԻૉΞϥΠϝϯτֶश͕ඞཁɼࣗݾճؼϞσϧ 'BTU4QFFDIɿ:3FOFUBM /FVS*14 ॱ఻ൖܕ5SBOTGPSNFSɼඇࣗݾճؼ ௒ߴ଎ੜ੒ ɼԻૉΞϥΠϝϯτɿڭࢣ5SBOTGPSNFS ஌ࣝৠཹ ڭࢣੜెֶश ʹΑΓࣗݾճؼ5SBOTGPSNFSͱಉ౳ͷԻ࣭͔ͭ҆ఆͳੜ੒Λ࣮ݱ ՝୊ɿ஌ࣝৠཹ͕ඞཁɼ-+4QFFDIίʔύεͷͨΊ݁Ռ͕಄ଧͪͳՄೳੑ *OUSPEVDUJPO 3FBMUJNFGBDUPS (16 $16T

Slide 4

Slide 4 text

ϑϧίϯςΩετϥϕϧΛ༻͍ͨ೔ຊޠχϡʔϥϧςΩετԻ੠߹੒ ԻૉೖྗͷΈΑΓ΋ϑϧίϯςΩετϥϕϧΛ༻͍ͨํ͕ߴ඼࣭ ஌ࣝৠཹͳ͠ͷ'BTU4QFFDI͸ࣗݾճؼϞσϧʹ͸ٴ͹ͳ͍ 'BTU4QFFDIͰ͸Իૉܧଓ௕ਪఆ͸ผͷωοτϫʔΫΛ༻͍ͨํ͕ߴਫ਼౓ σϞαϯϓϧɿIUUQTBTUBTUSFDOJDUHPKQEFNP@TBNQMFTJDBTTQ@@PLBNPUPJOEFYIUNM 1SFWJPVTSFTVMUT Tacotron 2 Transformer BLSTM Taco2dec WaveGlow STRAIGHT Original FastSpeech Mean opinion score WG(256) PWG Analysis-synthesis Transformer Taco2dec Only phoneme Full-context label input WaveGlow WG(256) PWG WaveGlow WaveGlow WaveGlow (a) (b) ԬຊΒɼԻߨ࿦य़ ˞$07*%ͷͨΊதࢭ

Slide 5

Slide 5 text

஌ࣝৠཹෆཁͷ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧ "MJHO554ɿ;;FOHFUBM *$"441 ࠞ߹ີ౓ωοτϫʔΫʹΑΓԻૉΞϥΠϝϯτΛਪఆɼԻૉܧଓ௕͸ผωοτϫʔΫͰਪఆɼจࣈೖྗ ӳޠ +*%5 +PJOUMZUSBJOFE%VSBUJPO*OGPSNFE5SBOTGPSNFS ɿ%-JNFUBM ࣗݾճؼ5SBOTGPSNFSͱ'BTU4QFFDIΛಉֶ࣌शɼԻૉೖྗ ؖࠃޠ 'BTU4QFFDIɿ:3FOFUBM ԻૉΞϥΠϝϯτʹ.POUSFBMGPSDFEBMJHOFS .'" Λ࢖༻ɼجຊप೾਺Λ్தͰར༻ɼԻૉೖྗ ӳޠ 'BTU1JUDIɿ"-︎BO DVDLJ ԻૉΞϥΠϝϯτʹ5BDPUSPOͷਪఆ݁ՌΛ࢖༻ɼجຊप೾਺Λ్தͰར༻ɼԻૉೖྗ ӳޠ ໨తɿϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕ೔ຊޠ554Ϟσϧͷൺֱ )..ɼ.'"ɼ5BDPUSPOɼ"MJHO554ɼ+%*5ͷछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ ˠͦΕͧΕͷϞσϧͰ࢖༻͍ͯ͠ΔΞϥΠϝϯτํ͕ࣜҟͳΔͨΊ 'BTU4QFFDIͷϞσϧߏ଄ͷҧ͍ "MJHO554ɼ+%*5 ˠ"MJHO554͓Αͼ+%*5Ͱ͸'BTU4QFFDI ॱ఻ൖܕ5SBOTGPSNFS ͷϞσϧߏ଄͕ҟͳΔͨΊ 1VSQPTF ೔ຊޠχϡʔϥϧ554ͷ௒ߴ଎ੜ੒ϞσϧͷߴԻ࣭Խ͸Մೳ͔

Slide 6

Slide 6 text

Length Regulator + ∼Positional Encoding FFT Block Linear Layer Mel-spectrogram Halign N× N× FFT Block + ∼Positional Encoding H Linear Layer ×N Mix Density Network Forward Algorithm {yi }n p(yi |µj, Σj ) {(µj, Σj )}m − log αn,m Alignment Loss Only Training (2) Feed-Forward Transformer (3) Mix Density Network Full-context label 1 × 1 Conv Layer Multi-Head Attention Add & Norm Conv 1D Add & Norm (1) FFT Block N× FFT Block + ∼Positional Encoding Linear Layer Duration Sequence (4) Duration Predictor Full-context label 1 × 1 Conv Layer ֶशํ๏ ΦϦδφϧ εςοϓɿΤϯίʔμ͓Αͼࠞ߹ॏΈωοτϫʔΫͷֶश ࠞ߹ॏΈωοτϫʔΫ͔Βଟ࣍ݩਖ਼ن෼෍ͷฏۉͱ෼ࢄΛֶश ɹฏۉˠԻૉʮYʯͷฏۉతͳϝϧεϖΫτϩάϥϜ ɹ෼ࢄˠԻૉʮYʯͷεϖΫτϩάϥϜͷ͹Β͖ͭʹରԠ 7JUFSCJΞϧΰϦζϜʹΑΓԻૉΞϥΠϝϯτΛऔಘ εςοϓɿσίʔμͷֶश ΤϯίʔμΛݻఆͯ͠σίʔμͷΈΛֶश εςοϓɿಉ࣌࠷దԽ 'BTU4QFFDIͱࠞ߹ॏΈωοτϫʔΫΛಉֶ࣌श ɹԻૉΞϥΠϝϯτ͸ֶशͷ౎౓ߋ৽͞ΕΔ εςοϓ ࠷ޙʹ֬ఆͨ͠ԻૉΞϥΠϝϯτͰԻૉܧଓ௕ਪఆωοτϫʔΫΛֶश ֶशํ๏ͷৄࡉ *$"441ԻڹԻ੠ಡΈձͷࢿྉΛࢀরͷ͜ͱ IUUQTCJUMZ8XTF "MJHO554

Slide 7

Slide 7 text

ֶशํ๏ 'BTU4QFFDIͱࣗݾճؼܕ5SBOTGPSNFSͷσίʔμ Λಉ࣌࠷దԽ -ଛࣦɿԻڹಛ௃ྔਪఆ -ଛࣦɿԻૉܧଓ௕ਪఆ ஫ҙػߏͷର֯Խͷଅਐ $5$ଛࣦɿԻૉܥྻΛ σίʔμग़ྗ͔Βٯਪఆ ༠ಋ஫ҙଛࣦ ߹੒࣌ 'BTU4QFFDIͷΈΛਪ࿦ Length Regulator FFT Block Linear Layer + ∼Positional Encoding Halign N× N× FFT Block + ∼Positional Encoding H Full-context label 1 × 1 Conv Layer Encoder Pre-net Duration Predictor Attention Mechanism Decoder Pre-net Mel-spectrogram + ∼Positional Encoding Linear Layer Decoder Linear Layer CTC loss Guided attention loss L1 loss L2 loss L1 loss Only Training 1 2 3 4 5 phoneme +%*5

Slide 8

Slide 8 text

ԻڹϞσϧ 'BTU4QFFDIܕϞσϧ ॱ఻ൖܕ5SBOTGPSNFSɿ''5 ͷൺֱ "MJHO554ܕϞσϧPS+*%5ܕϞσϧɿνϟωϧ਺ɼ''5ϒϩοΫͷߏ଄ɼ''5ϒϩοΫ૯਺౳͕ҟͳΔ ˞ͦΕͧΕ'BTU4QFFDI෦ͷΈΛ୯ಠͰֶशɼԻૉܧଓ௕ਪఆͳ͠ 7BOJMMB+%*5ɿࣗݾճؼܕ5SBOTGPSNFSσίʔμ͓ΑͼԻૉܧଓ௕ਪఆؚΉ ԻૉΞϥΠϝϯτํࣜ ).. )54ɼ.FSMJO .POUSFBM'PSDFE"MJHOFS ."' ɿ(..-%" ,BMEJ "MJHO554ɿεςοϓͷֶशͷΈ͔ΒಘΒΕΔԻૉΞϥΠϝϯτɼεςοϓͷಉ࣌࠷దԽͳ͠ +%*5ɿ'BTU4QFFDIͷΤϯίʔμ͓Αͼࣗݾճؼܕ5SBOTGPSNFSσίʔμͷΈΛֶश 5BDPUSPOɿֶशޙͷֶशηοτͷਪ࿦࣌ͷ஫ҙॏΈ͔ΒԻૉΞϥΠϝϯτΛऔಘ ˠͦΕͧΕผ్''5ܕԻૉܧଓ௕ਪఆΛֶश ''5ɼ#-45.5BDPEFD ֶशσʔλɿ೔ຊޠঁੑϓϩ࿩ऀ໊ɿ ൃ࿩ ໿࣌ؒ ɼL)[ ϑϧίϯςΩετϥϕϧɿԻૉ࣍ݩΞΫηϯτ৘ใ࣍ݩˠܭ࣍ݩ &YQFSJNFOUBMDPOEJUJPOT

Slide 9

Slide 9 text

ܭଌ৚݅ɿ1Z5PSDI࣮૷ (16ɿ/7*%*"5FTMB7 $16ɿ*OUFM9FPO ԻڹϞσϧɿ࠷େίΞ࢖༻ 8BWF(MPXɿίΞ࢖༻ ϊʔυͷ࠷େ਺ ݁Ռ 'BTU4QFFDIܕϞσϧࣗݾճؼϞσϧ $16࢖༻࣌΋35'ఔ౓ $16Λ༻͍ͨϦΞϧλΠϜχϡʔϥϧ554 -1$/FU!L)[ ˠ5PUBM35'ɿ ,.BUTVCBSB 50LBNPUP 35BLBTIJNB 55BLJHVDIJ 55PEB :4IJHBBOE),BXBJ *OWFTUJHBUJPOPG USBJOJOHEBUBTJ[FGPSSFBMUJNFOFVSBMWPDPEFSTPO$16T z"DPVTU4DJ5FDI BDDFQUFE UPBQQFBS 3FTVMUTPGSFBMUJNFGBDUPST 35'T ౓ͳ݁Ռ͕ಘΒΕ͓ͯΓɼAlignTTS ผωοτϫʔΫΛ࢖༻͍ͯ͠Δɻͦ Ͱ͸ɼVanilla JDI-T Ҏ֎͸ɼશͯ (4) ͷܧଓ௕ਪఆωοτϫʔΫΛ༻͍ ɻ·ͨɼൺֱͷͨΊɼࣗݾճؼܕϞσ ίϯςΩετϥϕϧೖྗܕ Tacotron erɼBLSTM+Taco2dec[6] ΋ಋೖ͢Δɻ o2dec ʹ΋ 5 छྨͷԻૉΞϥΠϝϯτ Δɻ Ի੠ 23,828 ൃ࿩ (໿ 18 ࣌ؒ) Λֶशηο ετηοτͱ͠ɼαϯϓϦϯάप೾਺ ͨɻϝϧεϖΫτϩάϥϜ͸จݙ [5, 6] ͠ɼϑʔϨϜγϑτྔ͸ 12.5 ms ͱ͠ ςΩετϥϕϧ͸ɼԻૉ 39 ࣍ݩͱΞΫ ࣍ݩͷܭ 48 ࣍ݩͱͨ͠ɻਪఆͨ͠ϝϧ ϥϜΛԻ੠೾ܗ΁ͱม׵͢Δχϡʔϥϧ Table 1 Results of inference real-time factors (RTFs) of neural network models with an NVIDIA Tesla V100 and Intel Xeon 6152 CPUs. “FFT” de- notes feed-forward Transformer. GPU CPUs BLSTM+Taco2dec 0.015 0.21 Tacotron 2 0.063 0.22 Transformer 0.55 3.2 FFT (AlignTTS) 0.005 0.026 FFT (JDI-T) 0.005 0.026 FFT duration model 0.0007 0.0024 WaveGlow 0.066 2.1 ໊ͷ੒ਓ೔ຊޠ฼ޠ࿩ऀͰ͋Γɼจݙ [5, 6] ͱಉ༷ɼ

Slide 10

Slide 10 text

.04SFTVMUTBOEEFNPTBNQMFT Mean opinion score FFT (AlignTTS) FFT (JDI-T) BLSTM+Taco2dec Tacotron 2 Transformer Original Vanilla JDI-T Alignment Acoustic model WaveGlow (analysis-synthesis) HMM MFA Tacotron 2 AlignTTS JDI-T HMM MFA Tacotron 2 AlignTTS JDI-T HMM MFA Tacotron 2 AlignTTS JDI-T Non-autoregressive Autoregressive Seq2seq 'BTU4QFFDIܕϞσϧ͸ࣗݾճؼϞσϧʹ͸ٴ͹ͳ͍

Slide 11

Slide 11 text

݁Ռߟ࡯ 'BTU4QFFDIܕϞσϧ ඇࣗݾճؼϞσϧ ͸ࣗݾճؼϞσϧʹ͸Ի࣭͸ٴ͹ͳ͍ Իڹಛ௃ྔͷࣗݾճؼ͸ॏཁ ˠجຊप೾਺౳ͷิॿ৘ใʹΑΔԻ࣭ͷվળ 'BTU4QFFDIɼ'BTU1JUDI ''5ߏ଄ͷҧ͍͸ͳ͠ "MJHO554PS+%*5 ΞϥΠϝϯτํࣜ "MJHO554Ҏ֎͸༏Ґࠩͳ͠ˠ+%*5ͷԻૉΞϥΠϝϯτ͸ྑ޷ "MJHO554Ͱ΋ϑϧίϯςΩετϥϕϧΛ࢖༻ˠϥϕϧͷҧ͍͕ѱӨڹˠԻૉͷΈͰͷݕ౼ ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ "MJHO554༻͓Αͼ+%*5༻Ϟσϧʹ͓͍ͯछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ $16Λ༻͍ͨߴ଎ੜ੒Λ֬ೝ Ի࣭͸ࣗݾճؼϞσϧ 5BDPUSPOɼ5SBOTGPSNFSɼ#-45.5BDPEFD ʹ͸ٴ͹ͳ͍ %JTDVTTJPOTBOEDPODMVTJPOT

Slide 12

Slide 12 text

໦ ʙ ۚ ͚͍͸Μͳ3%ϑΣΞˏΦϯϥΠϯ χϡʔϥϧ࿩଎ม׵ٕज़ͷ঺հ ԬຊΒɼzෳ਺࿩ऀ8BWF/FUϘίʔμΛ༻͍ͨχϡʔϥϧ࿩଎ม׵ͷࢼΈzˏ೥݄41ݚڀձ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ˞$07*%ͷͨΊதࢭ "OOPVODFNFOU Residual block Residual block Residual block Residual block + ReLU 1 × 1 CNN ReLU 1 × 1 CNN Softmax p(xn |x0, · · · , xn−1 ) Skip connections · · · · · · Residual block + 1 × 1 CNN 2 × 1 dilated CNN × tanh σ Upsample layer Bidirectional GRU Mel-spectrogram Upsample layer Mel-spectrogram (a) Bidirectional GRU Upsample layer Mel-spectrogram (b) Bidirectional GRU for rate conversion Resampling for rate conversion Resampling Mean opinion score ST R A IG H T SD -W aveN et SI-W aveN et (a) (a) (b) (b) W SO LA 4QFFDISBUFDPOWFSTJPOSBUF 5SBJOFEVTJOH+74DPSQVT 3FTBNQMJOHBDPVTUJDGFBUVSFT GPSTQFFDISBUFDPOWFSTJPO