generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019. [4] R. Yamamoto et al., “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi- resolution spectrogram,” in Proc. ICASSP, 2020, pp. 703. [5] J. Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017, pp. 2223--2232. [6] S. O. Arik, et al, “Fast spectrogram inversion using multi-head convolutional neural networks,” IEEE Signal Procees. Letters, 2019. 2017. [7] K. Kumar, et al, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in Proc. NeurIPS, 2019, pp. 14881–14892. [8] Y. Geng, et al, “Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech,” arXiv preprint arXiv:2005.05106, 2020. [9] J. Yang, et al., “VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network,” in Proc. INTERSPEECH, 2020. [10] J. Kong, et al, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020. [11] Y. Wo, et al., “Quasi-Periodic Parallel WaveGAN vocoder: A non-autoregressive pitch-dependent dilated convolution model for parametric speech generation,” in Proc. INTERSPEECH, 2020. [12] Y. Ren et al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020. [13] J. Chen et al., “HiFiSinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020. Icon made by Pixel perfect from www.flaticon.com