$30 off During Our Annual Pro Sale. View Details »

[ASJ2020a] フルコンテキストラベル入力を用いたFastSpeech型ニューラルTTSモデルの比較

Takuma OKAMOTO
September 11, 2020

[ASJ2020a] フルコンテキストラベル入力を用いたFastSpeech型ニューラルTTSモデルの比較

Takuma OKAMOTO

September 11, 2020
Tweet

More Decks by Takuma OKAMOTO

Other Decks in Research

Transcript

  1. ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ
    'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ
    ̋Ԭຊ୓ຏɼށాஐج ɼࢤլ๕ଇɼՏҪ߃
    ৘ใ௨৴ݚڀػߏɼ໊ݹ԰େֶ

    UI4FQU "4+"OOVBM.FFUJOH"VUVNO!
    $PNQBSJTPOPG'BTU4QFFDICBTFEOFVSBM554NPEFMT
    XJUIGVMMDPOUFYUMBCFMJOQVU

    View Slide

  2. *OUSPEVDUJPO
    1VSQPTF
    'BTU4QFFDICBTFEOFVSBM554NPEFMTXJUIGVMMDPOUFYUMBCFMJOQVU
    "MJHO554
    +%*5
    &YQFSJNFOUT
    $PODMVTJPOT
    "OOPVODFNFOU
    0VUMJOF

    View Slide

  3. &OEUPFOEܕχϡʔϥϧςΩετԻ੠߹੒ 554

    4FRVFODFUPTFRVFODF 4FRTFR
    Ϟσϧɿ5BDPUSPOɼ5SBOTGPSNFSɼ%FFQ7PJDF
    ςΩετ ·ͨ͸Իૉ
    ͔ΒԻڹಛ௃ྔΛ௚઀ੜ੒ɼԻૉΞϥΠϝϯτෆཁ
    ՝୊ɿ஫ҙػߏ༧ଌࣦഊʹΑΔԻૉͷܽམɼ܁Γฦ͠ˠ࣮αʔϏεʹͱͬͯ͸க໋తɼࣗݾճؼϞσϧ
    ࠷ઌ୺ͷϑϧ&OEUPFOEςΩετԻ੠߹੒ɿ&"54ɼ'BTU4QFFDI
    Իૉ͔ΒԻ੠೾ܗΛͭͷωοτϫʔΫͰ௚઀ੜ੒ɼԻૉΞϥΠϝϯτෆཁɼඇࣗݾճؼϞσϧ
    ՝୊ɿԻ࣭ʹվળͷ༨஍͋Γ ɼجຊप೾਺Λհ͢Δͷ͸&OEUPFOEͱݴ͍͍ͬͯͷ͔

    ҆ఆܕχϡʔϥϧ554Ϟσϧ
    #-45.5BDPEFDɿ50LBNPUPFUBM "436
    )..Ͱਪఆͨ͠ڧ੍ΞϥΠϝϯτͱ5BDPUSPOͷσίʔμΛ࢖༻ˠ҆ఆ͔ͭߴ඼࣭
    ՝୊ɿผ్)..ͷԻૉΞϥΠϝϯτֶश͕ඞཁɼࣗݾճؼϞσϧ
    'BTU4QFFDIɿ:3FOFUBM /FVS*14
    ॱ఻ൖܕ5SBOTGPSNFSɼඇࣗݾճؼ ௒ߴ଎ੜ੒
    ɼԻૉΞϥΠϝϯτɿڭࢣ5SBOTGPSNFS
    ஌ࣝৠཹ ڭࢣੜెֶश
    ʹΑΓࣗݾճؼ5SBOTGPSNFSͱಉ౳ͷԻ࣭͔ͭ҆ఆͳੜ੒Λ࣮ݱ
    ՝୊ɿ஌ࣝৠཹ͕ඞཁɼ-+4QFFDIίʔύεͷͨΊ݁Ռ͕಄ଧͪͳՄೳੑ
    *OUSPEVDUJPO

    3FBMUJNFGBDUPS
    (16 $16T



    View Slide

  4. ϑϧίϯςΩετϥϕϧΛ༻͍ͨ೔ຊޠχϡʔϥϧςΩετԻ੠߹੒
    ԻૉೖྗͷΈΑΓ΋ϑϧίϯςΩετϥϕϧΛ༻͍ͨํ͕ߴ඼࣭
    ஌ࣝৠཹͳ͠ͷ'BTU4QFFDI͸ࣗݾճؼϞσϧʹ͸ٴ͹ͳ͍
    'BTU4QFFDIͰ͸Իૉܧଓ௕ਪఆ͸ผͷωοτϫʔΫΛ༻͍ͨํ͕ߴਫ਼౓
    σϞαϯϓϧɿIUUQTBTUBTUSFDOJDUHPKQEFNP@TBNQMFTJDBTTQ@@PLBNPUPJOEFYIUNM
    1SFWJPVTSFTVMUT

    Tacotron 2
    Transformer
    BLSTM
    Taco2dec
    WaveGlow
    STRAIGHT
    Original
    FastSpeech
    Mean opinion score
    WG(256)
    PWG
    Analysis-synthesis
    Transformer
    Taco2dec
    Only phoneme Full-context label input
    WaveGlow
    WG(256)
    PWG
    WaveGlow
    WaveGlow WaveGlow
    (a) (b)
    ԬຊΒɼԻߨ࿦य़
    ˞$07*%ͷͨΊதࢭ

    View Slide

  5. ஌ࣝৠཹෆཁͷ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧ
    "MJHO554ɿ;;FOHFUBM *$"441
    ࠞ߹ີ౓ωοτϫʔΫʹΑΓԻૉΞϥΠϝϯτΛਪఆɼԻૉܧଓ௕͸ผωοτϫʔΫͰਪఆɼจࣈೖྗ ӳޠ

    +*%5 +PJOUMZUSBJOFE%VSBUJPO*OGPSNFE5SBOTGPSNFS
    ɿ%-JNFUBM
    ࣗݾճؼ5SBOTGPSNFSͱ'BTU4QFFDIΛಉֶ࣌शɼԻૉೖྗ ؖࠃޠ

    'BTU4QFFDIɿ:3FOFUBM
    ԻૉΞϥΠϝϯτʹ.POUSFBMGPSDFEBMJHOFS .'"
    Λ࢖༻ɼجຊप೾਺Λ్தͰར༻ɼԻૉೖྗ ӳޠ

    'BTU1JUDIɿ"-︎BO
    DVDLJ
    ԻૉΞϥΠϝϯτʹ5BDPUSPOͷਪఆ݁ՌΛ࢖༻ɼجຊप೾਺Λ్தͰར༻ɼԻૉೖྗ ӳޠ

    ໨తɿϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕ೔ຊޠ554Ϟσϧͷൺֱ
    )..ɼ.'"ɼ5BDPUSPOɼ"MJHO554ɼ+%*5ͷछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ
    ˠͦΕͧΕͷϞσϧͰ࢖༻͍ͯ͠ΔΞϥΠϝϯτํ͕ࣜҟͳΔͨΊ
    'BTU4QFFDIͷϞσϧߏ଄ͷҧ͍ "MJHO554ɼ+%*5

    ˠ"MJHO554͓Αͼ+%*5Ͱ͸'BTU4QFFDI ॱ఻ൖܕ5SBOTGPSNFS
    ͷϞσϧߏ଄͕ҟͳΔͨΊ
    1VSQPTF

    ೔ຊޠχϡʔϥϧ554ͷ௒ߴ଎ੜ੒ϞσϧͷߴԻ࣭Խ͸Մೳ͔

    View Slide

  6. Length Regulator
    + ∼Positional
    Encoding
    FFT Block
    Linear Layer
    Mel-spectrogram
    Halign

    N× FFT Block
    + ∼Positional
    Encoding
    H
    Linear Layer ×N
    Mix Density Network
    Forward Algorithm
    {yi
    }n
    p(yi
    |µj, Σj
    )
    {(µj, Σj
    )}m
    − log αn,m
    Alignment Loss
    Only Training
    (2) Feed-Forward Transformer (3) Mix Density Network
    Full-context label
    1 × 1 Conv Layer
    Multi-Head
    Attention
    Add & Norm
    Conv 1D
    Add & Norm
    (1) FFT Block
    N× FFT Block
    + ∼Positional
    Encoding
    Linear Layer
    Duration Sequence
    (4) Duration Predictor
    Full-context label
    1 × 1 Conv Layer
    ֶशํ๏ ΦϦδφϧ

    εςοϓɿΤϯίʔμ͓Αͼࠞ߹ॏΈωοτϫʔΫͷֶश
    ࠞ߹ॏΈωοτϫʔΫ͔Βଟ࣍ݩਖ਼ن෼෍ͷฏۉͱ෼ࢄΛֶश
    ɹฏۉˠԻૉʮYʯͷฏۉతͳϝϧεϖΫτϩάϥϜ
    ɹ෼ࢄˠԻૉʮYʯͷεϖΫτϩάϥϜͷ͹Β͖ͭʹରԠ
    7JUFSCJΞϧΰϦζϜʹΑΓԻૉΞϥΠϝϯτΛऔಘ
    εςοϓɿσίʔμͷֶश
    ΤϯίʔμΛݻఆͯ͠σίʔμͷΈΛֶश
    εςοϓɿಉ࣌࠷దԽ
    'BTU4QFFDIͱࠞ߹ॏΈωοτϫʔΫΛಉֶ࣌श
    ɹԻૉΞϥΠϝϯτ͸ֶशͷ౎౓ߋ৽͞ΕΔ
    εςοϓ
    ࠷ޙʹ֬ఆͨ͠ԻૉΞϥΠϝϯτͰԻૉܧଓ௕ਪఆωοτϫʔΫΛֶश
    ֶशํ๏ͷৄࡉ
    *$"441ԻڹԻ੠ಡΈձͷࢿྉΛࢀরͷ͜ͱ
    IUUQTCJUMZ8XTF
    "MJHO554

    View Slide

  7. ֶशํ๏
    'BTU4QFFDIͱࣗݾճؼܕ5SBOTGPSNFSͷσίʔμ
    Λಉ࣌࠷దԽ
    -ଛࣦɿԻڹಛ௃ྔਪఆ
    -ଛࣦɿԻૉܧଓ௕ਪఆ
    ஫ҙػߏͷର֯Խͷଅਐ
    $5$ଛࣦɿԻૉܥྻΛ
    σίʔμग़ྗ͔Βٯਪఆ
    ༠ಋ஫ҙଛࣦ
    ߹੒࣌
    'BTU4QFFDIͷΈΛਪ࿦

    Length Regulator
    FFT Block
    Linear Layer
    + ∼Positional
    Encoding
    Halign

    N× FFT Block
    + ∼Positional
    Encoding
    H
    Full-context label
    1 × 1 Conv Layer
    Encoder Pre-net
    Duration Predictor
    Attention Mechanism
    Decoder Pre-net
    Mel-spectrogram
    + ∼Positional
    Encoding
    Linear Layer
    Decoder
    Linear Layer
    CTC loss
    Guided attention loss
    L1 loss
    L2 loss
    L1 loss
    Only Training
    1
    2
    3
    4
    5
    phoneme
    +%*5

    View Slide

  8. ԻڹϞσϧ
    'BTU4QFFDIܕϞσϧ ॱ఻ൖܕ5SBOTGPSNFSɿ''5
    ͷൺֱ
    "MJHO554ܕϞσϧPS+*%5ܕϞσϧɿνϟωϧ਺ɼ''5ϒϩοΫͷߏ଄ɼ''5ϒϩοΫ૯਺౳͕ҟͳΔ
    ˞ͦΕͧΕ'BTU4QFFDI෦ͷΈΛ୯ಠͰֶशɼԻૉܧଓ௕ਪఆͳ͠
    7BOJMMB+%*5ɿࣗݾճؼܕ5SBOTGPSNFSσίʔμ͓ΑͼԻૉܧଓ௕ਪఆؚΉ
    ԻૉΞϥΠϝϯτํࣜ
    ).. )54ɼ.FSMJO

    .POUSFBM'PSDFE"MJHOFS ."'
    ɿ(..-%" ,BMEJ

    "MJHO554ɿεςοϓͷֶशͷΈ͔ΒಘΒΕΔԻૉΞϥΠϝϯτɼεςοϓͷಉ࣌࠷దԽͳ͠
    +%*5ɿ'BTU4QFFDIͷΤϯίʔμ͓Αͼࣗݾճؼܕ5SBOTGPSNFSσίʔμͷΈΛֶश
    5BDPUSPOɿֶशޙͷֶशηοτͷਪ࿦࣌ͷ஫ҙॏΈ͔ΒԻૉΞϥΠϝϯτΛऔಘ
    ˠͦΕͧΕผ్''5ܕԻૉܧଓ௕ਪఆΛֶश ''5ɼ#-45.5BDPEFD

    ֶशσʔλɿ೔ຊޠঁੑϓϩ࿩ऀ໊ɿ ൃ࿩ ໿࣌ؒ
    ɼL)[
    ϑϧίϯςΩετϥϕϧɿԻૉ࣍ݩΞΫηϯτ৘ใ࣍ݩˠܭ࣍ݩ
    &YQFSJNFOUBMDPOEJUJPOT

    View Slide

  9. ܭଌ৚݅ɿ1Z5PSDI࣮૷
    (16ɿ/7*%*"5FTMB7
    $16ɿ*OUFM9FPO
    ԻڹϞσϧɿ࠷େίΞ࢖༻
    8BWF(MPXɿίΞ࢖༻ ϊʔυͷ࠷େ਺

    ݁Ռ
    'BTU4QFFDIܕϞσϧࣗݾճؼϞσϧ
    $16࢖༻࣌΋35'ఔ౓
    $16Λ༻͍ͨϦΞϧλΠϜχϡʔϥϧ554
    -1$/FU!L)[
    ˠ5PUBM35'ɿ

    ,.BUTVCBSB 50LBNPUP 35BLBTIJNB 55BLJHVDIJ 55PEB :4IJHBBOE),BXBJ *OWFTUJHBUJPOPG
    USBJOJOHEBUBTJ[FGPSSFBMUJNFOFVSBMWPDPEFSTPO$16T z"DPVTU4DJ5FDI BDDFQUFE UPBQQFBS

    3FTVMUTPGSFBMUJNFGBDUPST 35'T


    ౓ͳ݁Ռ͕ಘΒΕ͓ͯΓɼAlignTTS
    ผωοτϫʔΫΛ࢖༻͍ͯ͠Δɻͦ
    Ͱ͸ɼVanilla JDI-T Ҏ֎͸ɼશͯ
    (4) ͷܧଓ௕ਪఆωοτϫʔΫΛ༻͍
    ɻ·ͨɼൺֱͷͨΊɼࣗݾճؼܕϞσ
    ίϯςΩετϥϕϧೖྗܕ Tacotron
    erɼBLSTM+Taco2dec[6] ΋ಋೖ͢Δɻ
    o2dec ʹ΋ 5 छྨͷԻૉΞϥΠϝϯτ
    Δɻ
    Ի੠ 23,828 ൃ࿩ (໿ 18 ࣌ؒ) Λֶशηο
    ετηοτͱ͠ɼαϯϓϦϯάप೾਺
    ͨɻϝϧεϖΫτϩάϥϜ͸จݙ [5, 6]
    ͠ɼϑʔϨϜγϑτྔ͸ 12.5 ms ͱ͠
    ςΩετϥϕϧ͸ɼԻૉ 39 ࣍ݩͱΞΫ
    ࣍ݩͷܭ 48 ࣍ݩͱͨ͠ɻਪఆͨ͠ϝϧ
    ϥϜΛԻ੠೾ܗ΁ͱม׵͢Δχϡʔϥϧ
    Table 1 Results of inference real-time factors
    (RTFs) of neural network models with an NVIDIA
    Tesla V100 and Intel Xeon 6152 CPUs. “FFT” de-
    notes feed-forward Transformer.
    GPU CPUs
    BLSTM+Taco2dec 0.015 0.21
    Tacotron 2 0.063 0.22
    Transformer 0.55 3.2
    FFT (AlignTTS) 0.005 0.026
    FFT (JDI-T) 0.005 0.026
    FFT duration model 0.0007 0.0024
    WaveGlow 0.066 2.1
    ໊ͷ੒ਓ೔ຊޠ฼ޠ࿩ऀͰ͋Γɼจݙ [5, 6] ͱಉ༷ɼ

    View Slide

  10. .04SFTVMUTBOEEFNPTBNQMFT

    Mean opinion score
    FFT (AlignTTS) FFT (JDI-T) BLSTM+Taco2dec
    Tacotron 2
    Transformer
    Original
    Vanilla
    JDI-T
    Alignment Acoustic
    model
    WaveGlow
    (analysis-synthesis)
    HMM
    MFA
    Tacotron 2
    AlignTTS
    JDI-T
    HMM
    MFA
    Tacotron 2
    AlignTTS
    JDI-T
    HMM
    MFA
    Tacotron 2
    AlignTTS
    JDI-T
    Non-autoregressive Autoregressive
    Seq2seq
    'BTU4QFFDIܕϞσϧ͸ࣗݾճؼϞσϧʹ͸ٴ͹ͳ͍

    View Slide

  11. ݁Ռߟ࡯
    'BTU4QFFDIܕϞσϧ ඇࣗݾճؼϞσϧ
    ͸ࣗݾճؼϞσϧʹ͸Ի࣭͸ٴ͹ͳ͍
    Իڹಛ௃ྔͷࣗݾճؼ͸ॏཁ
    ˠجຊप೾਺౳ͷิॿ৘ใʹΑΔԻ࣭ͷվળ 'BTU4QFFDIɼ'BTU1JUDI

    ''5ߏ଄ͷҧ͍͸ͳ͠ "MJHO554PS+%*5

    ΞϥΠϝϯτํࣜ
    "MJHO554Ҏ֎͸༏Ґࠩͳ͠ˠ+%*5ͷԻૉΞϥΠϝϯτ͸ྑ޷
    "MJHO554Ͱ΋ϑϧίϯςΩετϥϕϧΛ࢖༻ˠϥϕϧͷҧ͍͕ѱӨڹˠԻૉͷΈͰͷݕ౼
    ϑϧίϯςΩετϥϕϧೖྗΛ༻͍ͨ'BTU4QFFDIܕχϡʔϥϧ554Ϟσϧͷൺֱ
    "MJHO554༻͓Αͼ+%*5༻Ϟσϧʹ͓͍ͯछྨͷԻૉΞϥΠϝϯτํࣜΛൺֱ
    $16Λ༻͍ͨߴ଎ੜ੒Λ֬ೝ
    Ի࣭͸ࣗݾճؼϞσϧ 5BDPUSPOɼ5SBOTGPSNFSɼ#-45.5BDPEFD
    ʹ͸ٴ͹ͳ͍
    %JTDVTTJPOTBOEDPODMVTJPOT

    View Slide


  12. ʙ ۚ
    ͚͍͸Μͳ3%ϑΣΞˏΦϯϥΠϯ
    χϡʔϥϧ࿩଎ม׵ٕज़ͷ঺հ
    ԬຊΒɼzෳ਺࿩ऀ8BWF/FUϘίʔμΛ༻͍ͨχϡʔϥϧ࿩଎ม׵ͷࢼΈzˏ೥݄41ݚڀձ
    ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ˞$07*%ͷͨΊதࢭ
    "OOPVODFNFOU

    Residual block
    Residual block
    Residual block
    Residual block
    +
    ReLU
    1 × 1 CNN
    ReLU
    1 × 1 CNN
    Softmax
    p(xn
    |x0, · · · , xn−1
    )
    Skip connections
    · · ·
    · · ·
    Residual block
    +
    1 × 1 CNN
    2 × 1 dilated
    CNN
    ×
    tanh σ
    Upsample layer
    Bidirectional GRU
    Mel-spectrogram
    Upsample layer
    Mel-spectrogram
    (a)
    Bidirectional GRU
    Upsample layer
    Mel-spectrogram
    (b)
    Bidirectional GRU
    for rate conversion
    Resampling
    for rate conversion
    Resampling
    Mean opinion score
    ST
    R
    A
    IG
    H
    T
    SD
    -W
    aveN
    et
    SI-W
    aveN
    et
    (a) (a)
    (b) (b)
    W
    SO
    LA
    4QFFDISBUFDPOWFSTJPOSBUF
    5SBJOFEVTJOH+74DPSQVT
    3FTBNQMJOHBDPVTUJDGFBUVSFT
    GPSTQFFDISBUFDPOWFSTJPO

    View Slide