Slide 9
Slide 9 text
ܭଌ݅ɿ1Z5PSDI࣮
(16ɿ/7*%*"5FTMB7
$16ɿ*OUFM9FPO
ԻڹϞσϧɿ࠷େίΞ༻
8BWF(MPXɿίΞ༻ ϊʔυͷ࠷େ
݁Ռ
'BTU4QFFDIܕϞσϧࣗݾճؼϞσϧ
$16༻࣌35'ఔ
$16Λ༻͍ͨϦΞϧλΠϜχϡʔϥϧ554
-1$/FU!L)[
ˠ5PUBM35'ɿ
,.BUTVCBSB
50LBNPUP
35BLBTIJNB
55BLJHVDIJ
55PEB
:4IJHBBOE),BXBJ
*OWFTUJHBUJPOPG
USBJOJOHEBUBTJ[FGPSSFBMUJNFOFVSBMWPDPEFSTPO$16T
z"DPVTU4DJ5FDI BDDFQUFE
UPBQQFBS
3FTVMUTPGSFBMUJNFGBDUPST 35'T
ͳ݁Ռ͕ಘΒΕ͓ͯΓɼAlignTTS
ผωοτϫʔΫΛ༻͍ͯ͠Δɻͦ
ͰɼVanilla JDI-T Ҏ֎ɼશͯ
(4) ͷܧଓਪఆωοτϫʔΫΛ༻͍
ɻ·ͨɼൺֱͷͨΊɼࣗݾճؼܕϞσ
ίϯςΩετϥϕϧೖྗܕ Tacotron
erɼBLSTM+Taco2dec[6] ಋೖ͢Δɻ
o2dec ʹ 5 छྨͷԻૉΞϥΠϝϯτ
Δɻ
Ի 23,828 ൃ ( 18 ࣌ؒ) Λֶशηο
ετηοτͱ͠ɼαϯϓϦϯάप
ͨɻϝϧεϖΫτϩάϥϜจݙ [5, 6]
͠ɼϑʔϨϜγϑτྔ 12.5 ms ͱ͠
ςΩετϥϕϧɼԻૉ 39 ࣍ݩͱΞΫ
࣍ݩͷܭ 48 ࣍ݩͱͨ͠ɻਪఆͨ͠ϝϧ
ϥϜΛԻܗͱม͢Δχϡʔϥϧ
Table 1 Results of inference real-time factors
(RTFs) of neural network models with an NVIDIA
Tesla V100 and Intel Xeon 6152 CPUs. “FFT” de-
notes feed-forward Transformer.
GPU CPUs
BLSTM+Taco2dec 0.015 0.21
Tacotron 2 0.063 0.22
Transformer 0.55 3.2
FFT (AlignTTS) 0.005 0.026
FFT (JDI-T) 0.005 0.026
FFT duration model 0.0007 0.0024
WaveGlow 0.066 2.1
໊ͷਓຊޠޠऀͰ͋Γɼจݙ [5, 6] ͱಉ༷ɼ