Slide 39
Slide 39 text
参考⽂献
[1] Arik, Sercan, et al. "Deep voice 2: Multi-speaker neural text-to-speech." arXiv preprint arXiv:1705.08947 (2017).
[2] Chen, Mingjian, et al. "MultiSpeech: Multi-speaker text to speech with transformer." arXiv preprint
arXiv:2006.04664 (2020).
[3] Cooper, Erica, et al. "Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker
embeddings." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2020.
[4] Chen, Yutian, et al. "Sample efficient adaptive text-to-speech." arXiv preprint arXiv:1809.10460 (2018).
[5] Wang, Yuxuan, et al. "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech
synthesis." International Conference on Machine Learning. PMLR, 2018.
[6] Hsu, Wei-Ning, et al. "Hierarchical generative modeling for controllable speech synthesis." arXiv preprint
arXiv:1810.07217 (2018).
[7] Jia, Ye, et al. "Transfer learning from speaker verification to multispeaker text-to-speech synthesis." arXiv
preprint arXiv:1806.04558 (2018).
[8] Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari. "DNN-based Speaker Embedding Using Subjective
Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis." arXiv preprint arXiv:1907.08294 (2019).
[9] Chien, Chung-Ming, et al. "Investigating on Incorporating Pretrained and Learnable Speaker Representations
for Multi-Speaker Multi-Style Text-to-Speech." arXiv preprint arXiv:2103.04088 (2021).
39