neural text-to-speech." arXiv preprint arXiv:1705.08947 (2017). [2] Chen, Mingjian, et al. "MultiSpeech: Multi-speaker text to speech with transformer." arXiv preprint arXiv:2006.04664 (2020). [3] Cooper, Erica, et al. "Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. [4] Chen, Yutian, et al. "Sample efficient adaptive text-to-speech." arXiv preprint arXiv:1809.10460 (2018). [5] Wang, Yuxuan, et al. "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis." International Conference on Machine Learning. PMLR, 2018. [6] Hsu, Wei-Ning, et al. "Hierarchical generative modeling for controllable speech synthesis." arXiv preprint arXiv:1810.07217 (2018). [7] Jia, Ye, et al. "Transfer learning from speaker verification to multispeaker text-to-speech synthesis." arXiv preprint arXiv:1806.04558 (2018). [8] Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari. "DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis." arXiv preprint arXiv:1907.08294 (2019). [9] Chien, Chung-Ming, et al. "Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech." arXiv preprint arXiv:2103.04088 (2021). 39