Target-speaker voice activity detection:
A novel approach for multi-speaker diarization in a dinner party scenario

74927b1ad1e14d4694190389f7cccd28?s=47 Hitoshi Suda
November 20, 2020

Target-speaker voice activity detection:
A novel approach for multi-speaker diarization in a dinner party scenario

INTERSPEECH 2020 読み会の資料です。

74927b1ad1e14d4694190389f7cccd28?s=128

Hitoshi Suda

November 20, 2020
Tweet

Transcript

  1. Target-speaker voice activity detection: 
 A novel approach for multi-speaker

    diarization in a dinner party scenario ਢాਔࢤʢ౦ژେֶʣ !*/5&341&&$)ಡΈձ
  2.  ࣗݾ঺հ w ౦େᜊ౻େีݚ% w ϊϯύϥϨϧ/.'੠࣭ม׵ w */5&341&&$)ʹͯൃද w ՎএऀμΠΞϥΠθʔγϣϯ

    
  3.  ౰֘࿦จ w *WBO.FEFOOJLPW FUBM 
 5BSHFUTQFBLFSWPJDFBDUJWJUZEFUFDUJPO 
 "OPWFMBQQSPBDIGPSNVMUJTQFBLFSEJBSJ[BUJPOJOBEJOOFSQBSUZTDFOBSJP w

    ϩγΞͷاۀ45$JOOPWBUJPOTͱ 
 αϯΫτϖςϧϒϧΫʹ͋Δ*5.0େֶ ࿦จΛબΜͩཧ༝ w 4UBUFPGUIFBSUͳμΠΞϥΠθʔγϣϯٕज़ͷ঺հ 
  4.  ·ͱΊ w 5BSHFUTQFBLFSʹର͢Δ7"%Λ࣮૷͢Δ͜ͱͰ 
 $)J.&νϟϨϯδʹ͓͍ͯߴ͍ੑೳͷμΠΞϥΠθʔγϣϯΛ࣮ݱͨ͠ w ਓ਺ط஌ͷ547"%ΛϚϧννϟωϧͰߦͬͨ w ؀ڥ͕ҟͳΔ৚݅΍ɺਓ਺ະ஌ͰͷධՁ͸ࠓޙͷ՝୊

    ײ૝ w ண࣮ͳ࣮૷ͱΑ͘ચ࿅͞Εͨॳظ஋ʹΑ࣮ͬͯݱ͞Ε͍ͯΔͱ͍͏ҹ৅ w ࠷ॳͷΫϥελϦϯάͷ෦෼͕͔ͳΓ࡞Γ͜·Ε͍ͯΔ w ਓ਺͕ط஌͔ͩΒ੒ཱͰ͖͍ͯΔؾ͕ͯ͘͠Δ w Φʔόʔϥοϓ۠ؒΛຊ౰ʹͪΌΜͱೝࣝͰ͖͍ͯͨͷ͔ؾʹͳΔ 
  5.  0VUMJOF w ໰୊ઃఆ w طଘख๏ w ఏҊ๏ w 5BSHFUTQFBLFS7"%

    w .VMUJTQFBLFSUBSHFUTQFBLFS7"% 
  6.  μΠΞϥΠθʔγϣϯʢEJBSJ[BUJPOʣ w ձ࿩Ի੠ʹରͯ͠8IPTQFBLTXIFO Λղ͘໰୊  ࿩ऀ" ࿩ऀ# ࿩ऀ$ ࿩ऀ%

  7.  ධՁํ๏ɿEJBSJ[BUJPOFSSPSSBUFʢ%&3ʣ w ͷ΂ൃ࿩࣌ؒʹର͢ΔɺछྨͷޡΓ࣌ؒͷ૯࿨ w ௿͍΄͏͕ੑೳ͕ߴ͍ w ࿩ऀͷύʔϛϡςʔγϣϯ͸࠷΋%&3͕௿͘ͳΔΑ͏ʹબ͹ΕΔ 

  8.  $)J.&$IBMMFOHFʢʣ w ձ࿩Ի੠ͷ"43ͷνϟϨϯδ w ϫʔΫγϣοϓ͸೥*$"441ͷαςϥΠτ w Ի੠͸SFBMIPNFͷSFBMEJOOFSQBSUZճ෼ʢ࣮ࡍʹ͸BDUFEΒ͍͠ʣ w 5SBJOEFWFWBMɿ

    w ͦΕͧΕ໿࣌ؒ w νϟωϧͷ,JOFDUʷͭ w ొ৔ਓ෺͸ͦΕͧΕਓͰݻఆ w ෦԰Λࣗ༝ʹҠಈͰ͖Δ  IUUQTDIJNFDIBMMFOHFHJUIVCJPDIJNFJOEFYIUNM
  9.  $)J.&$IBMMFOHFʢʣ w ձ࿩Ի੠ͷ"43ͷνϟϨϯδ w 5BTLʢ5SBDLʣ͕ͭ͋Γɺ 
 5SBDL͸μΠΞϥΠθʔγϣϯͷHSPVOEUSVUIΛ༻͍ͯΑ͍ɺ 
 5SBDL͸μΠΞϥΠθʔγϣϯ΋ඞཁʢԻ੠ͷΈ͔Βೝࣝ͢Δʣ

    w σʔλ͸ಉ͡ w ࠷ऴతͳධՁ࣠͸XPSEFSSPSSBUF w 3BOLJOH"Ͱ͸ೝࣝ༻ͷϞσϧ͕ڞ௨ɺ#Ͱ͸ೝࣝϞσϧ·ͰؚΊͯߏங w 5SBDLʷ3BOLJOH"#ͷ௨ΓͰධՁ͞ΕΔ 
  10.  5BTLʢԻ੠ͷΈʣɾ3BOLJOH"ʢೝࣝϞσϧHJWFOʣ  8&3 ख๏ EFW FWBM 45$*OOPWBUJPOT*5.0 ΫϥελϦϯά๏ʴ547"% 

     +PIOT)PQLJOT ΫϥελϦϯά๏ʴ0WFSMBQQPTUQSPDFTTJOH   645$ .VMUJDIBOOFM4QFFDI4FQBSBUJPOʴΫϥελϦϯά   1BEFSCPSO6OJWFSTJUZ &OEUPFOE/FVSBM%JBSJ[BUJPO   #SOP6OJWFSTJUZPG5FDIOPMPHZ 7#Yʴ(VJEFE4PVSDF4FQBSBUJPO   $MPVEXBML9.6 ΫϥελϦϯά๏   $JUZ6OJWFSTJUZPG/FX:PSL ΫϥελϦϯά๏ʴ0WFSMBQQPTUQSPDFTTJOH   "DBEFNJB4JOJDB ΫϥελϦϯά๏   #BTFMJOF ΫϥελϦϯά๏  
  11.  %JBSJ[BUJPOJTIBSE w μΠΞϥΠθʔγϣϯ͸೉͍͠ w ϚΠΫ͔Βͷڑ཭͕ԕ͍͜ͱ͕ଟ͍ w ࡶԻ͕͋Δ͜ͱ͕ଟ͍ w ձ࿩Ի੠Ͱ͋Δ

    w ൃ࿩͕Φʔόʔϥοϓ͢Δ       USBJO      EFW      FWBM      ಉ࣌ʹൃ࿩͍ͯ͠Δਓ਺ͷׂ߹ ʜ୭΋ൃ࿩͍ͯ͠ͳ͍ʗ$)J.&σʔληοτ
  12.  0VUMJOF w ໰୊ઃఆ w طଘख๏ w ఏҊ๏ w 5BSHFUTQFBLFS7"%

    w .VMUJTQFBLFSUBSHFUTQFBLFS7"% 
  13.  ݹయతͳख๏ɿΫϥελϦϯά๏ w ద౓ʹηάϝϯςʔγϣϯˠ֤ηάϝϯτͷ࿩ऀදݱΛΫϥελϦϯά 
 
 w Φʔόʔϥοϓ۠ؒ͸ޙॲཧతʹղܾͰ͖Δ<,#PBLZF >ͳͲ w

    %&3ʢEFWTFUʣ  7"% ηάϝϯςʔγϣϯ ࿩ऀදݱͷநग़ ΫϥελϦϯά ,#PBLZF FUBM0WFSMBQQFETQFFDIEFUFDUJPOGPSJNQSPWFETQFBLFSEJBSJ[BUJPOJONVMUJQBSUZNFFUJOHT*O1SPD*$"441 QQr 
  14.  W0 2 RD⇥F and b0 2 RD project an

    input feature into D- sional vector. Encoder(p) t (·) is the p-th encoder block which s an input sequence of D-dimensional vectors and outputs a ensional vector e (p) t at time index t. We use P encoder blocks ed by the output layer for frame-wise posteriors. e architecture of the encoder block is depicted in Fig. 2. onfiguration of the encoder block is almost the same as the the Speech-Transformer introduced in [44], but without posi- encoding. The encoder block has two sub-layers. The first is i-head self-attention layer, and the second is a position-wise orward layer. Multi-head self-attention layer ulti-head self-attention layer transforms a sequence of input s as follows. The sequence of vectors (e (p 1) t |t = 1, · · · , T) verted into a RT ⇥D matrix, followed by layer normalization p 1) = LayerNorm([e (p 1) 1 · · · e (p 1) T ]>) 2 RT ⇥D . (6) Label 1 Label 2 Label 1 Label 2 BCE BCE minimum Permutation 1 Permutation 2 Permutation-free loss Linear + Sigmoid Output 1 Output 2 LayerNorm Encoder block Log-Mel LayerNorm Multi-head self-attention P blocks Position-wise FF LayerNorm Input Linear Encoder block ¯ E(p 1) ¯ E(p,SA) E(p,FF) E(p,SA) Fig. 2. Two-speaker SA-EEND model trained with permutation-free &OEUPFOE/FVSBM%JBSJ[BUJPOʢ&&/%ʣ w εϖΫτϩάϥϜʹରͯ͠௚઀ 
 μΠΞϥΠθʔγϣϯ݁ՌΛग़ྗ w ֶशσʔλྔ͕େྔʹඞཁ w υϝΠϯͷ߹Θͳ͍σʔλʹ੬ऑ w ௕࣌ؒͷσʔλʹ࢖͑ͳ͍ w ޙஈͰύʔϛϡςʔγϣϯΛղ͘ඞཁੑ w %&3ʢEFWTFUʣ  : 'VKJUB FUBM&OEUPFOEOFVSBMTQFBLFSEJBSJ[BUJPOXJUITFMGBUUFOUJPO*O1SPD"436 QQr  <: 'VKJUB >
  15.  0VUMJOF w ໰୊ઃఆ w طଘख๏ w ఏҊ๏ w 5BSHFUTQFBLFS7"%

    w .VMUJTQFBLFSUBSHFUTQFBLFS7"% 
  16.  5BSHFUTQFBLFS7"%ͷൃ૝ w ಛఆ࿩ऀͷ࿩ऀදݱΛೖྗͯ͠ɺͦͷ࿩ऀ͕ൃ࿩͍ͯ͠Δ͔Λ֤࣌ࠁͰࣝผ 
 
 w ྨࣅख๏ɿ7PJDF'JMUFS<28BOH >ɺ4QFBLFS#FBN<.%FMDSPJY >ͳͲ

    w ಛఆͷ࿩ऀදݱΛ࣋ͭ࿩ऀͷΈநग़͢ΔεϖΫτϩάϥϜͷιϑτϚεΫΛਪఆ  ࣝผث ࿩ऀදݱ .%FMDSPJY FUBM4JOHMF$IBOOFM5BSHFU4QFBLFS&YUSBDUJPOBOE3FDPHOJUJPOXJUI4QFBLFS#FBN*O1SPD*$"441 QQr  28BOH FUBM7PJDF'JMUFS5BSHFUFE7PJDF4FQBSBUJPOCZ4QFBLFS$POEJUJPOFE4QFDUSPHSBN.BTLJOH*O1SPD*/5&341&&$) QQr 
  17.  #BTFMJOF1FSTPOBM7"% w ಛఆ࿩ऀͷൃ࿩Λଈ࠲ʹݕग़͢ΔγεςϜ w Φʔόʔϥοϓ͸૝ఆ͍ͯ͠ͳ͍ w OPOTQFFDIɺUBSHFUTQFBLFSTQFFDIɺ 
 OPOUBSHFUTQFBLFSTQFFDIͷΫϥε෼ྨ

    w OPOTQFFDIͱOPOUBSHFUTQFBLFSTQFFDIͷ 
 ࠞಉͷॏΈΛܰ͘͢Δͱɺ࠷ऴతͳೝࣝਫ਼౓͕޲্͢Δ w ࿩ऀর߹ͷείΞ͸লུͰ͖Δ  4%JOH FUBM1FSTPOBM7"%4QFBLFS$POEJUJPOFE7PJDF"DUJWJUZ%FUFDUJPO*O1SPD4QFBLFS0EZTTFZ QQr  &RQFDW (PEHGGLQJ 6SHDNHU YHULILFDWLRQ 'HFLVLRQ )HDWXUH H[WUDFWLRQ 3HUVRQDO9$' (PEHGGLQJ 'HFLVLRQ )HDWXUH H[WUDFWLRQ 3HUVRQDO9$' (PEHGGLQJ 6SHDNHU YHULILFDWLRQ 'HFLVLRQ )HDWXUH H[WUDFWLRQ 3HUVRQDO9$' QSXWDXGLR ,QSXWDXGLR ,QSXWDXGLR &RQFDW &RQFDW E F G 6FRUHGFRQGLWLRQHGWUDLQLQJ (PEHGGLQJFRQGLWLRQHGWUDLQLQJ 6FRUHDQGHPEHGGLQJFRQGLWLRQHGWUDLQLQJ <4%JOH >
  18.  4JOHMFTQFBLFS547"% w 1FSTPOBM7"%ͱಉ༷ w %&3ʢޙॲཧޙʣ w ଞͷਓͷJWFDUPSΛ 
 ಉ࣌ʹೖྗͯ͠΋

    
 େ͖͘վળ͠ͳ͔ͬͨ  #-45.1ɿ#JEJSFDUJPOBM-45.1SPKFDUJPO
  19.  ֤࿩ऀ͝ͱͰ7"%Λ͢Δࠜຊతͳ໰୊఺ʢ༧૝ʣ w /POUBSHFUTQFBLFSTTQFFDIΛݕग़͢Δඞཁੑ͕͋Δ w ࿦จʹཅʹ͸ॻ͍ͯ͸ͳ͍͕ɺ͓ͦΒ͘ܦݧతʹͦ͏ͳͷͩͱࢥ͏ w ࿩ऀදݱ͕͍ۙ࿩ऀʹରͯ͠੬ऑ  w

    ଞͷ࿩ऀ͕൑໌͍ͯͨ͠΄͏͕ʢ௚ײతʹʣੑೳ͕ग़Δ  w ଞͷ࿩ऀ͕BXBSFͰ΋࣮ݧతʹ͸վળ͠ͳ͔ͬͨ w ಛఆͷ࿩ऀ͚ͩʹண໨͢Δͱ͍͏ෆۉߧײ͕ۤख 
  20.  0VUMJOF w ໰୊ઃఆ w طଘख๏ w ఏҊ๏ w 5BSHFUTQFBLFS7"%

    w .VMUJTQFBLFSUBSHFUTQFBLFS7"% 
  21.  BLSTMP layer. The model architecture is shown in Figure

    Note that parameters of the SD block are shared across spe ers, and it is trained jointly with the whole TS-VAD model. Figure 1: Single-channel TS-VAD scheme As we performed all the experiments in the Kaldi A .VMUJTQFBLFS547"% w ਓͷ7"%Λಉ࣌ʹਪఆ w $)J.&͸͢΂ͯͷԻ੠Ͱਓ w ֤࿩ऀʹରͯ͠XFJHIU͕ڞ༗͞Εͨ෦෼ʴ 
 ࠷ऴతͳ݁ՌΛग़ྗ͢Δ෦෼ʹ෼ׂ w ֶश͸্ஈɾԼஈಉ࣌ w ग़ྗ͸֤࿩ऀʹରͯ͠Ϋϥε෼ྨ w ਓͳͷͰग़ྗ͸࣍ݩ w ଛࣦ͸ͭͷަࠩΤϯτϩϐʔͷ࿨ w ޙॲཧޙ%&3  4%ɿ4QFBLFS%FUFDUJPO
  22.  ೚ҙ࿩ऀ਺547"% w .VMUJTQFBLFS7"%͸ਓݻఆ w ਓ਺Մมͷख๏΋ఏҊ͞Ε͍ͯΔ w Մม௕Λड͚͚ͭΔԿΒ͔ͷΞʔΩςΫνϟʹ 
 தؒ૚ΛೖΕΕ͹Α͍

    w ࣮ݧతͳධՁ͸͞Ε͍ͯͳ͍ w 4JOHMFTQFBLFSͷͱ͖ͱಉ͡໰୊͕ى͖ͦ͏ 
  23.  .VMUJDIBOOFM.VMUJTQFBLFS7"% w $)J.&Ͱ͸࠷େDIͷԻ੠͕࢖͑Δ w ֶश࣌͸ϥϯμϜʹબ୒ͨ͠νϟϯωϧΛར༻ w %&3͸DIͷͱ͖ΑΓ΄Ͳվળ w ֤νϟϯωϧͰฏۉΛͱͬͯ΋

    
 ͞΄ͲมΘΒͳ͍Β͍͠ 
  24.  JWFDUPSͷਪఆ w ࿩ऀදݱΛ஌Βͳ͍ͱ547"%Ͱ͖ͳ͍ w ͳΜΒ͔ͷํ๏Ͱ๬·͍͠JWFDUPSΛਪఆ͢Δඞཁ͕͋Δ  ࣝผث ࿩ऀදݱ

  25.  JWFDUPSͷਪఆ w JWFDUPSΛΫϥελϦϯά๏ͰٻΊͨμΠΞϥΠθʔγϣϯͷ݁Ռ͔Βܭࢉ w 547"%ˠJWFDUPSਪఆΛ൓෮͢Δ͜ͱͰɺΑΓੑೳ͕վળ͢Δ w ΠλϨʔγϣϯ·Ͱվળ͕ݟΒΕͨ w ϚχϡΞϧηάϝϯςʔγϣϯͰJWFDUPSΛநग़ͨ͠ͱ͖ͱಉ౳ͷੑೳ

    w %&3   ΫϥελϦϯά๏ ॳظJWFDUPSੜ੒ 547"% JWFDUPSੜ੒
  26.  ޙॲཧ  w 5ISFTIPMEJOHɿಛఆͷ஋ΑΓখ͍֬͞཰Λʹ w .FEJBO f i MUFSJOHɿ૭͔͚ͨ۠ؒ͠Ͱதԝ஋Λͱͬͯ-1'

    w ୹࣌ؒͷϙʔζɾൃ࿩ͷ࡟আ w )..Λ૊ΜͰϏλϏ୳ࡧ w 5ISFTIPMEJOHʴ.FEJBOͱಉ౳ͷੑೳ͕ಘΒΕΔ w "43ͷલʹ͸(VJEFETPVSDFTFQBSBUJPO<$#PFEEFLFS >Λద༻ $#PFEEFLFS FUBM'SPOUFOEQSPDFTTJOHGPSUIF$)J.&EJOOFSQBSUZTDFOBSJP*O1SPD$)J.&8PSLTIPQ 
  27.  ࠷ऴతͳγεςϜ 

  28.  ੑೳධՁ w ֶशσʔλ͸$)J.&࣌ؒʴ7PY$FMFC࣌ؒʴ.JYVQ w ࣮ݧ৚݅͸΄ͱΜͲॻ͔Ε͍ͯͳ͍͕,BMEJͷϨγϐ͸ެ։͞Ε͍ͯΔ w 'VTJPO͸νϟωϧ෼ͷγϯάϧνϟωϧϞσϧͱɺ 
 ͭ෼ͷϚϧννϟωϧϞσϧΛɺॏΈ෇͖ฏۉͨ͠΋ͷ

     EFWFMPQNFOU FWBMVBUJPO %&3 +&3 %&3 +&3 $)J.&#BTFMJOF     ɹΫϥελϦϯά๏     ɹɹʴ547"%     ɹɹɹʴ547"%     ɹɹɹʴ547"%.$     'VTJPO    
  29.  ·ͱΊ w 5BSHFUTQFBLFSʹର͢Δ7"%Λ࣮૷͢Δ͜ͱͰ 
 $)J.&νϟϨϯδʹ͓͍ͯߴ͍ੑೳͷμΠΞϥΠθʔγϣϯΛ࣮ݱͨ͠ w ਓ਺ط஌ͷ547"%ΛϚϧννϟωϧͰߦͬͨ w ؀ڥ͕ҟͳΔ৚݅΍ɺਓ਺ະ஌ͰͷධՁ͸ࠓޙͷ՝୊

    ײ૝ w ண࣮ͳ࣮૷ͱΑ͘ચ࿅͞Εͨॳظ஋ʹΑ࣮ͬͯݱ͞Ε͍ͯΔͱ͍͏ҹ৅ w ࠷ॳͷΫϥελϦϯάͷ෦෼͕͔ͳΓ࡞Γ͜·Ε͍ͯΔ w ਓ਺͕ط஌͔ͩΒ੒ཱͰ͖͍ͯΔؾ͕ͯ͘͠Δ w Φʔόʔϥοϓ۠ؒΛຊ౰ʹͪΌΜͱೝࣝͰ͖͍ͯͨͷ͔ؾʹͳΔ