Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TTS Skins: Speaker Conversion via ASR

peisuke
November 20, 2020

TTS Skins: Speaker Conversion via ASR

Interspeech2020音声読み会発表資料

peisuke

November 20, 2020
Tweet

More Decks by peisuke

Other Decks in Technology

Transcript

  1. TTS Skins: Speaker Conversion via ASR Authors: A. Polyak, L.

    Wolf, Y. Taigman presenter: @peisuke
  2. 2016 ABEJA 2016 Twitter @peisuke Github https://github.com/peisuke Qiita https://qiita.com/peisuke SlideShare

    https://www.slideshare.net/FujimotoKeisuke
  3. • • TTS Skins: Speaker Conversion via ASR • •

    • ASR WaveNet • • ASR
  4. • Text-to-Speech 100 • • Text-to-Speech • • TTS

  5. • ASR F0

  6. • • Jasper: An End-to-End Convolutional Neural Acoustic Model •

    https://github.com/NVIDIA/OpenSeq2Seq • • 1DConv-BN-ReL • Skip-Connection • Pre-trained •
  7. • WaveNet • condition • https://github.com/NVIDIA/nv-WaveNet • • • •

    F0 •
  8. • • Look up table pytorch Embedding • • •

    F0 • • fine tuning
  9. • • LibriTTS VCTK • • Many-to-many seen unseen •

    TTS • • • MOS • Mel cepstral distortion • Speaker classification • • WaveNet AutoEncoder • PPG
  10. Seen • Seen-to-seen • A B • • Identification F0

    LibriTTS VCTK MOS MCD Identification MOS MCD Identification Full method 3.78±0.83 96.12 4.08±0.75 8.76±1.72 98.97 w/o F0 3.61±0.83 96.96 3.59±0.96 8.99±1.5 96.89 AE baseline 2.89±0.88 29.19 3.46±1.07 9.45±1.63 69.26 PPG 2.82±0.91 94.01 2.67±0.93 9.19±1.50 98.77 PPG2 2.87±1.00 95.77 3.03±1.06 9.18±1.52 96.24
  11. Uneen • Uneen-to-seen • A B • LibriTTS VCTK MOS

    MCD Identification MOS MCD Identification Full method 3.70±0.80 97.10 4.05±0.74 8.94±1.53 98.33 w/o F0 3.67±0.82 97.15 3.62±0.99 9.25±1.62 95.69 AE baseline 3.02±0.89 32.55 3.83±0.91 9.65±1.51 66.20 PPG 2.79±0.93 94.05 2.89±0.93 9.45±1.45 97.45 PPG2 2.71±0.93 95.43 3.19±1.04 9.79±1.86 97.25
  12. TTS • TTS • TTS LibriTTS VCTK MOS MCD Identification

    MOS MCD Identification Original TTS 4.25±0.77 10.12±1.27 4.37±0.80 14.52±2.40 Full method 3.67±0.81 8.13±0.95 96.06 4.17±0.88 12.68±2.17 99.25 w/o F0 3.47±0.76 8.43±0.97 96.66 3.75±1.07 13.06±2.26 96.36 AE baseline 3.02±0.84 9.38±1.09 60.26 3.85±1.05 13.81±2.29 75.56 PPG 2.91±0.94 8.52±0.93 96.63 3.50±0.83 12.45±1.92 98.36 PPG2 2.85±0.87 8.76±1.06 95.08 3.66±1.03 12.57±2.10 97.62
  13. • The voice conversion challenge 2018 • 1 81 4-5

    • Hub Spoke • Hub Spoke MOS Similarity MOS Similarity Ours 3.84±0.85 2.87±1.14 4.00±0.55 3.14±0.97 N10 3.92±0.75 2.83±1.20 3.98±0.52 3.13±0.97 N17 3.27±0.95 2.77±1.17 3.40±0.88 3.05±0.96
  14. • • TTS • ASR F0 Conditional WaveNet •