Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TTS Skins: Speaker Conversion via ASR

peisuke
November 20, 2020

TTS Skins: Speaker Conversion via ASR

Interspeech2020音声読み会発表資料

peisuke

November 20, 2020
Tweet

More Decks by peisuke

Other Decks in Technology

Transcript

  1. TTS Skins: Speaker Conversion
    via ASR
    Authors: A. Polyak, L. Wolf, Y. Taigman
    presenter: @peisuke

    View Slide

  2. 2016
    ABEJA 2016
    Twitter @peisuke
    Github https://github.com/peisuke
    Qiita https://qiita.com/peisuke
    SlideShare https://www.slideshare.net/FujimotoKeisuke

    View Slide


  3. • TTS Skins: Speaker Conversion via ASR


    • ASR WaveNet

    • ASR

    View Slide

  4. • Text-to-Speech
    100

    • Text-to-Speech

    • TTS

    View Slide

  5. • ASR
    F0

    View Slide


  6. • Jasper: An End-to-End Convolutional Neural Acoustic Model
    • https://github.com/NVIDIA/OpenSeq2Seq

    • 1DConv-BN-ReL
    • Skip-Connection
    • Pre-trained

    View Slide

  7. • WaveNet
    • condition
    • https://github.com/NVIDIA/nv-WaveNet



    • F0

    View Slide


  8. • Look up table pytorch Embedding


    • F0

    • fine tuning

    View Slide


  9. • LibriTTS VCTK

    • Many-to-many seen unseen
    • TTS


    • MOS
    • Mel cepstral distortion
    • Speaker classification

    • WaveNet AutoEncoder
    • PPG

    View Slide

  10. Seen
    • Seen-to-seen
    • A B

    • Identification F0
    LibriTTS VCTK
    MOS MCD Identification MOS MCD Identification
    Full method 3.78±0.83 96.12 4.08±0.75 8.76±1.72 98.97
    w/o F0 3.61±0.83 96.96 3.59±0.96 8.99±1.5 96.89
    AE baseline 2.89±0.88 29.19 3.46±1.07 9.45±1.63 69.26
    PPG 2.82±0.91 94.01 2.67±0.93 9.19±1.50 98.77
    PPG2 2.87±1.00 95.77 3.03±1.06 9.18±1.52 96.24

    View Slide

  11. Uneen
    • Uneen-to-seen
    • A B

    LibriTTS VCTK
    MOS MCD Identification MOS MCD Identification
    Full method 3.70±0.80 97.10 4.05±0.74 8.94±1.53 98.33
    w/o F0 3.67±0.82 97.15 3.62±0.99 9.25±1.62 95.69
    AE baseline 3.02±0.89 32.55 3.83±0.91 9.65±1.51 66.20
    PPG 2.79±0.93 94.05 2.89±0.93 9.45±1.45 97.45
    PPG2 2.71±0.93 95.43 3.19±1.04 9.79±1.86 97.25

    View Slide

  12. TTS
    • TTS
    • TTS
    LibriTTS VCTK
    MOS MCD Identification MOS MCD Identification
    Original TTS 4.25±0.77 10.12±1.27 4.37±0.80 14.52±2.40
    Full method 3.67±0.81 8.13±0.95 96.06 4.17±0.88 12.68±2.17 99.25
    w/o F0 3.47±0.76 8.43±0.97 96.66 3.75±1.07 13.06±2.26 96.36
    AE baseline 3.02±0.84 9.38±1.09 60.26 3.85±1.05 13.81±2.29 75.56
    PPG 2.91±0.94 8.52±0.93 96.63 3.50±0.83 12.45±1.92 98.36
    PPG2 2.85±0.87 8.76±1.06 95.08 3.66±1.03 12.57±2.10 97.62

    View Slide

  13. • The voice conversion challenge 2018
    • 1 81 4-5
    • Hub Spoke

    Hub Spoke
    MOS Similarity MOS Similarity
    Ours 3.84±0.85 2.87±1.14 4.00±0.55 3.14±0.97
    N10 3.92±0.75 2.83±1.20 3.98±0.52 3.13±0.97
    N17 3.27±0.95 2.77±1.17 3.40±0.88 3.05±0.96

    View Slide


  14. • TTS
    • ASR
    F0 Conditional WaveNet

    View Slide