Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
[ASJ2020a] フルコンテキストラベル入力を用いたFastSpeech型ニューラルTTS...
Search
Takuma OKAMOTO
September 11, 2020
Research
1.2k
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
[ASJ2020a] フルコンテキストラベル入力を用いたFastSpeech型ニューラルTTSモデルの比較
Takuma OKAMOTO
September 11, 2020
More Decks by Takuma OKAMOTO
See All by Takuma OKAMOTO
2025/7/5 応用音響研究会招待講演@北海道大学
takuma_okamoto
1
300
2025/1/30「システムデザイン論」@東京都立大学日野キャンパス
takuma_okamoto
0
190
[INTERSPEECH 2024] Challenge of singing voice synthesis using only text-to-speech corpus with FIRNet source-filter neural vocoder
takuma_okamoto
0
250
[Internoise 2023 (invited)] Multilingual sound spot synthesis systems
takuma_okamoto
0
420
マルチスポット再生 meets 多言語同時通訳システム
takuma_okamoto
0
260
[SPEASIP 2023招待講演] マルチスポット再生 meets 多言語ニューラル音声合成 ~実装 is ホンマに all we need~
takuma_okamoto
1
380
和歌山大学2022年度教養科目「世界の情報通信を知る」:音響・音声情報処理編
takuma_okamoto
0
260
[asj2022a] 16チャネル小型円形スピーカアレイを用いたマルチスポット再生システムの実装
takuma_okamoto
0
540
[asj2022a] Harmonic-Net+:高調波入力とLayerwise-Quasi-Periodic畳み込みを用いた基本周波数制御可能な高速ニューラルボコーダ
takuma_okamoto
0
360
Other Decks in Research
See All in Research
定数整数除算・剰余算最適化再考
herumi
1
130
LLMアプリケーションの透明性について
fufufukakaka
0
240
Harness Engineering and Al Agent
kzinmr
3
1.7k
明日から使える!研究効率化ツール入門
matsui_528
13
7.3k
多様なデータを許容し学習し続ける模倣学習 / Advanced Imitation Learning for VLA
prinlab
0
220
typst の使い方:言語学を研究する学生のために
gitomochang
0
460
「車1割削減、渋滞半減、公共交通2倍」を 熊本から岡山へ@RACDA設立30周年記念都市交通フォーラム2026
trafficbrain
1
1.2k
NII S. Koyama's Lab Research Overview AY2026
skoyamalab
0
320
Can We Teach Logical Reasoning to LLMs? – An Approach Using Synthetic Corpora (AAAI 2026 bridge keynote)
morishtr
1
260
Fukui Shibiten 39 - AI Art
butchi
0
130
IEEE AIxVR 2026 Keynote Talk: "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"
miso2024
0
200
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
shunk031
4
1k
Featured
See All Featured
How to build a perfect <img>
jonoalderson
1
5.7k
Building a Scalable Design System with Sketch
lauravandoore
463
34k
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
210
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
201
75k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
250
1.3M
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
8.2k
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
200
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.8k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Agile Leadership in an Agile Organization
kimpetersen
PRO
0
170
The Curse of the Amulet
leimatthew05
1
13k
Navigating the moral maze — ethical principles for Al-driven product design
skipperchong
2
390
Transcript
ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ 'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ ̋Ԭຊຏɼށాஐج ɼࢤլ๕ଇɼՏҪ߃ ใ௨৴ݚڀػߏɼ໊ݹେֶ UI4FQU "4+"OOVBM.FFUJOH"VUVNO! $PNQBSJTPOPG'BTU4QFFDICBTFEOFVSBM554NPEFMT XJUIGVMMDPOUFYUMBCFMJOQVU
*OUSPEVDUJPO 1VSQPTF 'BTU4QFFDICBTFEOFVSBM554NPEFMTXJUIGVMMDPOUFYUMBCFMJOQVU "MJHO554 +%*5 &YQFSJNFOUT $PODMVTJPOT "OOPVODFNFOU 0VUMJOF
&OEUPFOEܕχϡʔϥϧςΩετԻ߹ 554 4FRVFODFUPTFRVFODF 4FRTFR Ϟσϧɿ5BDPUSPOɼ5SBOTGPSNFSɼ%FFQ7PJDF ςΩετ ·ͨԻૉ ͔ΒԻڹಛྔΛੜɼԻૉΞϥΠϝϯτෆཁ ՝ɿҙػߏ༧ଌࣦഊʹΑΔԻૉͷܽམɼ܁Γฦ͠ˠ࣮αʔϏεʹͱͬͯக໋తɼࣗݾճؼϞσϧ
࠷ઌͷϑϧ&OEUPFOEςΩετԻ߹ɿ&"54ɼ'BTU4QFFDI Իૉ͔ΒԻܗΛͭͷωοτϫʔΫͰੜɼԻૉΞϥΠϝϯτෆཁɼඇࣗݾճؼϞσϧ ՝ɿԻ࣭ʹվળͷ༨͋Γ ɼجຊपΛհ͢Δͷ&OEUPFOEͱݴ͍͍ͬͯͷ͔ ҆ఆܕχϡʔϥϧ554Ϟσϧ #-45. 5BDPEFDɿ50LBNPUPFUBM "436 )..Ͱਪఆͨ͠ڧ੍ΞϥΠϝϯτͱ5BDPUSPOͷσίʔμΛ༻ˠ҆ఆ͔ͭߴ࣭ ՝ɿผ్)..ͷԻૉΞϥΠϝϯτֶश͕ඞཁɼࣗݾճؼϞσϧ 'BTU4QFFDIɿ:3FOFUBM /FVS*14 ॱൖܕ5SBOTGPSNFSɼඇࣗݾճؼ ߴੜ ɼԻૉΞϥΠϝϯτɿڭࢣ5SBOTGPSNFS ࣝৠཹ ڭࢣੜెֶश ʹΑΓࣗݾճؼ5SBOTGPSNFSͱಉͷԻ࣭͔ͭ҆ఆͳੜΛ࣮ݱ ՝ɿࣝৠཹ͕ඞཁɼ-+4QFFDIίʔύεͷͨΊ݁Ռ͕಄ଧͪͳՄೳੑ *OUSPEVDUJPO 3FBMUJNFGBDUPS (16 $16T
ϑϧίϯςΩετϥϕϧΛ༻͍ͨຊޠχϡʔϥϧςΩετԻ߹ ԻૉೖྗͷΈΑΓϑϧίϯςΩετϥϕϧΛ༻͍ͨํ͕ߴ࣭ ࣝৠཹͳ͠ͷ'BTU4QFFDIࣗݾճؼϞσϧʹٴͳ͍ 'BTU4QFFDIͰԻૉܧଓਪఆผͷωοτϫʔΫΛ༻͍ͨํ͕ߴਫ਼ σϞαϯϓϧɿIUUQTBTUBTUSFDOJDUHPKQEFNP@TBNQMFTJDBTTQ@@PLBNPUPJOEFYIUNM 1SFWJPVTSFTVMUT Tacotron 2 Transformer
BLSTM Taco2dec WaveGlow STRAIGHT Original FastSpeech Mean opinion score WG(256) PWG Analysis-synthesis Transformer Taco2dec Only phoneme Full-context label input WaveGlow WG(256) PWG WaveGlow WaveGlow WaveGlow (a) (b) ԬຊΒɼԻߨय़ ˞$07*%ͷͨΊதࢭ
ࣝৠཹෆཁͷ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧ "MJHO554ɿ;;FOHFUBM *$"441 ࠞ߹ີωοτϫʔΫʹΑΓԻૉΞϥΠϝϯτΛਪఆɼԻૉܧଓผωοτϫʔΫͰਪఆɼจࣈೖྗ ӳޠ +*%5 +PJOUMZUSBJOFE%VSBUJPO*OGPSNFE5SBOTGPSNFS ɿ%-JNFUBM
ࣗݾճؼ5SBOTGPSNFSͱ'BTU4QFFDIΛಉֶ࣌शɼԻૉೖྗ ؖࠃޠ 'BTU4QFFDIɿ:3FOFUBM ԻૉΞϥΠϝϯτʹ.POUSFBMGPSDFEBMJHOFS .'" Λ༻ɼجຊपΛ్தͰར༻ɼԻૉೖྗ ӳޠ 'BTU1JUDIɿ"-︎BO DVDLJ ԻૉΞϥΠϝϯτʹ5BDPUSPOͷਪఆ݁ՌΛ༻ɼجຊपΛ్தͰར༻ɼԻૉೖྗ ӳޠ తɿϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕຊޠ554Ϟσϧͷൺֱ )..ɼ.'"ɼ5BDPUSPOɼ"MJHO554ɼ+%*5ͷछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ ˠͦΕͧΕͷϞσϧͰ༻͍ͯ͠ΔΞϥΠϝϯτํ͕ࣜҟͳΔͨΊ 'BTU4QFFDIͷϞσϧߏͷҧ͍ "MJHO554ɼ+%*5 ˠ"MJHO554͓Αͼ+%*5Ͱ'BTU4QFFDI ॱൖܕ5SBOTGPSNFS ͷϞσϧߏ͕ҟͳΔͨΊ 1VSQPTF ຊޠχϡʔϥϧ554ͷߴੜϞσϧͷߴԻ࣭ԽՄೳ͔
Length Regulator + ∼Positional Encoding FFT Block Linear Layer Mel-spectrogram
Halign N× N× FFT Block + ∼Positional Encoding H Linear Layer ×N Mix Density Network Forward Algorithm {yi }n p(yi |µj, Σj ) {(µj, Σj )}m − log αn,m Alignment Loss Only Training (2) Feed-Forward Transformer (3) Mix Density Network Full-context label 1 × 1 Conv Layer Multi-Head Attention Add & Norm Conv 1D Add & Norm (1) FFT Block N× FFT Block + ∼Positional Encoding Linear Layer Duration Sequence (4) Duration Predictor Full-context label 1 × 1 Conv Layer ֶशํ๏ ΦϦδφϧ εςοϓɿΤϯίʔμ͓Αͼࠞ߹ॏΈωοτϫʔΫͷֶश ࠞ߹ॏΈωοτϫʔΫ͔Βଟ࣍ݩਖ਼نͷฏۉͱࢄΛֶश ɹฏۉˠԻૉʮYʯͷฏۉతͳϝϧεϖΫτϩάϥϜ ɹࢄˠԻૉʮYʯͷεϖΫτϩάϥϜͷΒ͖ͭʹରԠ 7JUFSCJΞϧΰϦζϜʹΑΓԻૉΞϥΠϝϯτΛऔಘ εςοϓɿσίʔμͷֶश ΤϯίʔμΛݻఆͯ͠σίʔμͷΈΛֶश εςοϓɿಉ࣌࠷దԽ 'BTU4QFFDIͱࠞ߹ॏΈωοτϫʔΫΛಉֶ࣌श ɹԻૉΞϥΠϝϯτֶशͷߋ৽͞ΕΔ εςοϓ ࠷ޙʹ֬ఆͨ͠ԻૉΞϥΠϝϯτͰԻૉܧଓਪఆωοτϫʔΫΛֶश ֶशํ๏ͷৄࡉ *$"441ԻڹԻಡΈձͷࢿྉΛࢀরͷ͜ͱ IUUQTCJUMZ8XTF "MJHO554
ֶशํ๏ 'BTU4QFFDIͱࣗݾճؼܕ5SBOTGPSNFSͷσίʔμ Λಉ࣌࠷దԽ -ଛࣦɿԻڹಛྔਪఆ -ଛࣦɿԻૉܧଓਪఆ ҙػߏͷର֯Խͷଅਐ $5$ଛࣦɿԻૉܥྻΛ σίʔμग़ྗ͔Βٯਪఆ ༠ಋҙଛࣦ ߹࣌
'BTU4QFFDIͷΈΛਪ Length Regulator FFT Block Linear Layer + ∼Positional Encoding Halign N× N× FFT Block + ∼Positional Encoding H Full-context label 1 × 1 Conv Layer Encoder Pre-net Duration Predictor Attention Mechanism Decoder Pre-net Mel-spectrogram + ∼Positional Encoding Linear Layer Decoder Linear Layer CTC loss Guided attention loss L1 loss L2 loss L1 loss Only Training 1 2 3 4 5 phoneme +%*5
ԻڹϞσϧ 'BTU4QFFDIܕϞσϧ ॱൖܕ5SBOTGPSNFSɿ''5 ͷൺֱ "MJHO554ܕϞσϧPS+*%5ܕϞσϧɿνϟωϧɼ''5ϒϩοΫͷߏɼ''5ϒϩοΫ૯͕ҟͳΔ ˞ͦΕͧΕ'BTU4QFFDI෦ͷΈΛ୯ಠͰֶशɼԻૉܧଓਪఆͳ͠ 7BOJMMB+%*5ɿࣗݾճؼܕ5SBOTGPSNFSσίʔμ͓ΑͼԻૉܧଓਪఆؚΉ ԻૉΞϥΠϝϯτํࣜ ).. )54ɼ.FSMJO
.POUSFBM'PSDFE"MJHOFS ."' ɿ(.. -%" ,BMEJ "MJHO554ɿεςοϓͷֶशͷΈ͔ΒಘΒΕΔԻૉΞϥΠϝϯτɼεςοϓͷಉ࣌࠷దԽͳ͠ +%*5ɿ'BTU4QFFDIͷΤϯίʔμ͓Αͼࣗݾճؼܕ5SBOTGPSNFSσίʔμͷΈΛֶश 5BDPUSPOɿֶशޙͷֶशηοτͷਪ࣌ͷҙॏΈ͔ΒԻૉΞϥΠϝϯτΛऔಘ ˠͦΕͧΕผ్''5ܕԻૉܧଓਪఆΛֶश ''5ɼ#-45. 5BDPEFD ֶशσʔλɿຊޠঁੑϓϩऀ໊ɿ ൃ ࣌ؒ ɼL)[ ϑϧίϯςΩετϥϕϧɿԻૉ࣍ݩ ΞΫηϯτใ࣍ݩˠܭ࣍ݩ &YQFSJNFOUBMDPOEJUJPOT
ܭଌ݅ɿ1Z5PSDI࣮ (16ɿ/7*%*"5FTMB7 $16ɿ*OUFM9FPO ԻڹϞσϧɿ࠷େίΞ༻ 8BWF(MPXɿίΞ༻ ϊʔυͷ࠷େ ݁Ռ 'BTU4QFFDIܕϞσϧࣗݾճؼϞσϧ $16༻࣌35'ఔ
$16Λ༻͍ͨϦΞϧλΠϜχϡʔϥϧ554 -1$/FU!L)[ ˠ5PUBM35'ɿ ,.BUTVCBSB 50LBNPUP 35BLBTIJNB 55BLJHVDIJ 55PEB :4IJHBBOE),BXBJ *OWFTUJHBUJPOPG USBJOJOHEBUBTJ[FGPSSFBMUJNFOFVSBMWPDPEFSTPO$16T z"DPVTU4DJ5FDI BDDFQUFE UPBQQFBS 3FTVMUTPGSFBMUJNFGBDUPST 35'T ͳ݁Ռ͕ಘΒΕ͓ͯΓɼAlignTTS ผωοτϫʔΫΛ༻͍ͯ͠Δɻͦ ͰɼVanilla JDI-T Ҏ֎ɼશͯ (4) ͷܧଓਪఆωοτϫʔΫΛ༻͍ ɻ·ͨɼൺֱͷͨΊɼࣗݾճؼܕϞσ ίϯςΩετϥϕϧೖྗܕ Tacotron erɼBLSTM+Taco2dec[6] ಋೖ͢Δɻ o2dec ʹ 5 छྨͷԻૉΞϥΠϝϯτ Δɻ Ի 23,828 ൃ ( 18 ࣌ؒ) Λֶशηο ετηοτͱ͠ɼαϯϓϦϯάप ͨɻϝϧεϖΫτϩάϥϜจݙ [5, 6] ͠ɼϑʔϨϜγϑτྔ 12.5 ms ͱ͠ ςΩετϥϕϧɼԻૉ 39 ࣍ݩͱΞΫ ࣍ݩͷܭ 48 ࣍ݩͱͨ͠ɻਪఆͨ͠ϝϧ ϥϜΛԻܗͱม͢Δχϡʔϥϧ Table 1 Results of inference real-time factors (RTFs) of neural network models with an NVIDIA Tesla V100 and Intel Xeon 6152 CPUs. “FFT” de- notes feed-forward Transformer. GPU CPUs BLSTM+Taco2dec 0.015 0.21 Tacotron 2 0.063 0.22 Transformer 0.55 3.2 FFT (AlignTTS) 0.005 0.026 FFT (JDI-T) 0.005 0.026 FFT duration model 0.0007 0.0024 WaveGlow 0.066 2.1 ໊ͷਓຊޠޠऀͰ͋Γɼจݙ [5, 6] ͱಉ༷ɼ
.04SFTVMUTBOEEFNPTBNQMFT Mean opinion score FFT (AlignTTS) FFT (JDI-T) BLSTM+Taco2dec
Tacotron 2 Transformer Original Vanilla JDI-T Alignment Acoustic model WaveGlow (analysis-synthesis) HMM MFA Tacotron 2 AlignTTS JDI-T HMM MFA Tacotron 2 AlignTTS JDI-T HMM MFA Tacotron 2 AlignTTS JDI-T Non-autoregressive Autoregressive Seq2seq 'BTU4QFFDIܕϞσϧࣗݾճؼϞσϧʹٴͳ͍
݁Ռߟ 'BTU4QFFDIܕϞσϧ ඇࣗݾճؼϞσϧ ࣗݾճؼϞσϧʹԻ࣭ٴͳ͍ Իڹಛྔͷࣗݾճؼॏཁ ˠجຊपͷิॿใʹΑΔԻ࣭ͷվળ 'BTU4QFFDIɼ'BTU1JUDI ''5ߏͷҧ͍ͳ͠ "MJHO554PS+%*5
ΞϥΠϝϯτํࣜ "MJHO554Ҏ֎༏Ґࠩͳ͠ˠ+%*5ͷԻૉΞϥΠϝϯτྑ "MJHO554ͰϑϧίϯςΩετϥϕϧΛ༻ˠϥϕϧͷҧ͍͕ѱӨڹˠԻૉͷΈͰͷݕ౼ ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ "MJHO554༻͓Αͼ+%*5༻Ϟσϧʹ͓͍ͯछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ $16Λ༻͍ͨߴੜΛ֬ೝ Ի࣭ࣗݾճؼϞσϧ 5BDPUSPOɼ5SBOTGPSNFSɼ#-45. 5BDPEFD ʹٴͳ͍ %JTDVTTJPOTBOEDPODMVTJPOT
ʙ ۚ ͚͍Μͳ3%ϑΣΞˏΦϯϥΠϯ χϡʔϥϧมٕज़ͷհ ԬຊΒɼzෳऀ8BWF/FUϘίʔμΛ༻͍ͨχϡʔϥϧมͷࢼΈzˏ݄41ݚڀձ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ˞$07*%ͷͨΊதࢭ "OOPVODFNFOU
Residual block Residual block Residual block Residual block + ReLU 1 × 1 CNN ReLU 1 × 1 CNN Softmax p(xn |x0, · · · , xn−1 ) Skip connections · · · · · · Residual block + 1 × 1 CNN 2 × 1 dilated CNN × tanh σ Upsample layer Bidirectional GRU Mel-spectrogram Upsample layer Mel-spectrogram (a) Bidirectional GRU Upsample layer Mel-spectrogram (b) Bidirectional GRU for rate conversion Resampling for rate conversion Resampling Mean opinion score ST R A IG H T SD -W aveN et SI-W aveN et (a) (a) (b) (b) W SO LA 4QFFDISBUFDPOWFSTJPOSBUF 5SBJOFEVTJOH+74DPSQVT 3FTBNQMJOHBDPVTUJDGFBUVSFT GPSTQFFDISBUFDPOWFSTJPO