Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Target-speaker voice activity detection:
A novel approach for multi-speaker diarization in a dinner party scenario

Hitoshi Suda
November 20, 2020

Target-speaker voice activity detection:
A novel approach for multi-speaker diarization in a dinner party scenario

INTERSPEECH 2020 読み会の資料です。

Hitoshi Suda

November 20, 2020
Tweet

More Decks by Hitoshi Suda

Other Decks in Science

Transcript

  1. Target-speaker voice activity detection: 
 A novel approach for multi-speaker

    diarization in a dinner party scenario ਢాਔࢤʢ౦ژେֶʣ !*/5&341&&$)ಡΈձ
  2.  ౰֘࿦จ w *WBO.FEFOOJLPW FUBM 
 5BSHFUTQFBLFSWPJDFBDUJWJUZEFUFDUJPO 
 "OPWFMBQQSPBDIGPSNVMUJTQFBLFSEJBSJ[BUJPOJOBEJOOFSQBSUZTDFOBSJP w

    ϩγΞͷاۀ45$JOOPWBUJPOTͱ 
 αϯΫτϖςϧϒϧΫʹ͋Δ*5.0େֶ ࿦จΛબΜͩཧ༝ w 4UBUFPGUIFBSUͳμΠΞϥΠθʔγϣϯٕज़ͷ঺հ 
  3.  ·ͱΊ w 5BSHFUTQFBLFSʹର͢Δ7"%Λ࣮૷͢Δ͜ͱͰ 
 $)J.&νϟϨϯδʹ͓͍ͯߴ͍ੑೳͷμΠΞϥΠθʔγϣϯΛ࣮ݱͨ͠ w ਓ਺ط஌ͷ547"%ΛϚϧννϟωϧͰߦͬͨ w ؀ڥ͕ҟͳΔ৚݅΍ɺਓ਺ະ஌ͰͷධՁ͸ࠓޙͷ՝୊

    ײ૝ w ண࣮ͳ࣮૷ͱΑ͘ચ࿅͞Εͨॳظ஋ʹΑ࣮ͬͯݱ͞Ε͍ͯΔͱ͍͏ҹ৅ w ࠷ॳͷΫϥελϦϯάͷ෦෼͕͔ͳΓ࡞Γ͜·Ε͍ͯΔ w ਓ਺͕ط஌͔ͩΒ੒ཱͰ͖͍ͯΔؾ͕ͯ͘͠Δ w Φʔόʔϥοϓ۠ؒΛຊ౰ʹͪΌΜͱೝࣝͰ͖͍ͯͨͷ͔ؾʹͳΔ 
  4.  $)J.&$IBMMFOHFʢʣ w ձ࿩Ի੠ͷ"43ͷνϟϨϯδ w ϫʔΫγϣοϓ͸೥*$"441ͷαςϥΠτ w Ի੠͸SFBMIPNFͷSFBMEJOOFSQBSUZճ෼ʢ࣮ࡍʹ͸BDUFEΒ͍͠ʣ w 5SBJOEFWFWBMɿ

    w ͦΕͧΕ໿࣌ؒ w νϟωϧͷ,JOFDUʷͭ w ొ৔ਓ෺͸ͦΕͧΕਓͰݻఆ w ෦԰Λࣗ༝ʹҠಈͰ͖Δ  IUUQTDIJNFDIBMMFOHFHJUIVCJPDIJNFJOEFYIUNM
  5.  $)J.&$IBMMFOHFʢʣ w ձ࿩Ի੠ͷ"43ͷνϟϨϯδ w 5BTLʢ5SBDLʣ͕ͭ͋Γɺ 
 5SBDL͸μΠΞϥΠθʔγϣϯͷHSPVOEUSVUIΛ༻͍ͯΑ͍ɺ 
 5SBDL͸μΠΞϥΠθʔγϣϯ΋ඞཁʢԻ੠ͷΈ͔Βೝࣝ͢Δʣ

    w σʔλ͸ಉ͡ w ࠷ऴతͳධՁ࣠͸XPSEFSSPSSBUF w 3BOLJOH"Ͱ͸ೝࣝ༻ͷϞσϧ͕ڞ௨ɺ#Ͱ͸ೝࣝϞσϧ·ͰؚΊͯߏங w 5SBDLʷ3BOLJOH"#ͷ௨ΓͰධՁ͞ΕΔ 
  6.  5BTLʢԻ੠ͷΈʣɾ3BOLJOH"ʢೝࣝϞσϧHJWFOʣ  8&3 ख๏ EFW FWBM 45$*OOPWBUJPOT*5.0 ΫϥελϦϯά๏ʴ547"% 

     +PIOT)PQLJOT ΫϥελϦϯά๏ʴ0WFSMBQQPTUQSPDFTTJOH   645$ .VMUJDIBOOFM4QFFDI4FQBSBUJPOʴΫϥελϦϯά   1BEFSCPSO6OJWFSTJUZ &OEUPFOE/FVSBM%JBSJ[BUJPO   #SOP6OJWFSTJUZPG5FDIOPMPHZ 7#Yʴ(VJEFE4PVSDF4FQBSBUJPO   $MPVEXBML9.6 ΫϥελϦϯά๏   $JUZ6OJWFSTJUZPG/FX:PSL ΫϥελϦϯά๏ʴ0WFSMBQQPTUQSPDFTTJOH   "DBEFNJB4JOJDB ΫϥελϦϯά๏   #BTFMJOF ΫϥελϦϯά๏  
  7.  %JBSJ[BUJPOJTIBSE w μΠΞϥΠθʔγϣϯ͸೉͍͠ w ϚΠΫ͔Βͷڑ཭͕ԕ͍͜ͱ͕ଟ͍ w ࡶԻ͕͋Δ͜ͱ͕ଟ͍ w ձ࿩Ի੠Ͱ͋Δ

    w ൃ࿩͕Φʔόʔϥοϓ͢Δ       USBJO      EFW      FWBM      ಉ࣌ʹൃ࿩͍ͯ͠Δਓ਺ͷׂ߹ ʜ୭΋ൃ࿩͍ͯ͠ͳ͍ʗ$)J.&σʔληοτ
  8.  ݹయతͳख๏ɿΫϥελϦϯά๏ w ద౓ʹηάϝϯςʔγϣϯˠ֤ηάϝϯτͷ࿩ऀදݱΛΫϥελϦϯά 
 
 w Φʔόʔϥοϓ۠ؒ͸ޙॲཧతʹղܾͰ͖Δ<,#PBLZF >ͳͲ w

    %&3ʢEFWTFUʣ  7"% ηάϝϯςʔγϣϯ ࿩ऀදݱͷநग़ ΫϥελϦϯά ,#PBLZF FUBM0WFSMBQQFETQFFDIEFUFDUJPOGPSJNQSPWFETQFBLFSEJBSJ[BUJPOJONVMUJQBSUZNFFUJOHT*O1SPD*$"441 QQr 
  9.  W0 2 RD⇥F and b0 2 RD project an

    input feature into D- sional vector. Encoder(p) t (·) is the p-th encoder block which s an input sequence of D-dimensional vectors and outputs a ensional vector e (p) t at time index t. We use P encoder blocks ed by the output layer for frame-wise posteriors. e architecture of the encoder block is depicted in Fig. 2. onfiguration of the encoder block is almost the same as the the Speech-Transformer introduced in [44], but without posi- encoding. The encoder block has two sub-layers. The first is i-head self-attention layer, and the second is a position-wise orward layer. Multi-head self-attention layer ulti-head self-attention layer transforms a sequence of input s as follows. The sequence of vectors (e (p 1) t |t = 1, · · · , T) verted into a RT ⇥D matrix, followed by layer normalization p 1) = LayerNorm([e (p 1) 1 · · · e (p 1) T ]>) 2 RT ⇥D . (6) Label 1 Label 2 Label 1 Label 2 BCE BCE minimum Permutation 1 Permutation 2 Permutation-free loss Linear + Sigmoid Output 1 Output 2 LayerNorm Encoder block Log-Mel LayerNorm Multi-head self-attention P blocks Position-wise FF LayerNorm Input Linear Encoder block ¯ E(p 1) ¯ E(p,SA) E(p,FF) E(p,SA) Fig. 2. Two-speaker SA-EEND model trained with permutation-free &OEUPFOE/FVSBM%JBSJ[BUJPOʢ&&/%ʣ w εϖΫτϩάϥϜʹରͯ͠௚઀ 
 μΠΞϥΠθʔγϣϯ݁ՌΛग़ྗ w ֶशσʔλྔ͕େྔʹඞཁ w υϝΠϯͷ߹Θͳ͍σʔλʹ੬ऑ w ௕࣌ؒͷσʔλʹ࢖͑ͳ͍ w ޙஈͰύʔϛϡςʔγϣϯΛղ͘ඞཁੑ w %&3ʢEFWTFUʣ  : 'VKJUB FUBM&OEUPFOEOFVSBMTQFBLFSEJBSJ[BUJPOXJUITFMGBUUFOUJPO*O1SPD"436 QQr  <: 'VKJUB >
  10.  5BSHFUTQFBLFS7"%ͷൃ૝ w ಛఆ࿩ऀͷ࿩ऀදݱΛೖྗͯ͠ɺͦͷ࿩ऀ͕ൃ࿩͍ͯ͠Δ͔Λ֤࣌ࠁͰࣝผ 
 
 w ྨࣅख๏ɿ7PJDF'JMUFS<28BOH >ɺ4QFBLFS#FBN<.%FMDSPJY >ͳͲ

    w ಛఆͷ࿩ऀදݱΛ࣋ͭ࿩ऀͷΈநग़͢ΔεϖΫτϩάϥϜͷιϑτϚεΫΛਪఆ  ࣝผث ࿩ऀදݱ .%FMDSPJY FUBM4JOHMF$IBOOFM5BSHFU4QFBLFS&YUSBDUJPOBOE3FDPHOJUJPOXJUI4QFBLFS#FBN*O1SPD*$"441 QQr  28BOH FUBM7PJDF'JMUFS5BSHFUFE7PJDF4FQBSBUJPOCZ4QFBLFS$POEJUJPOFE4QFDUSPHSBN.BTLJOH*O1SPD*/5&341&&$) QQr 
  11.  #BTFMJOF1FSTPOBM7"% w ಛఆ࿩ऀͷൃ࿩Λଈ࠲ʹݕग़͢ΔγεςϜ w Φʔόʔϥοϓ͸૝ఆ͍ͯ͠ͳ͍ w OPOTQFFDIɺUBSHFUTQFBLFSTQFFDIɺ 
 OPOUBSHFUTQFBLFSTQFFDIͷΫϥε෼ྨ

    w OPOTQFFDIͱOPOUBSHFUTQFBLFSTQFFDIͷ 
 ࠞಉͷॏΈΛܰ͘͢Δͱɺ࠷ऴతͳೝࣝਫ਼౓͕޲্͢Δ w ࿩ऀর߹ͷείΞ͸লུͰ͖Δ  4%JOH FUBM1FSTPOBM7"%4QFBLFS$POEJUJPOFE7PJDF"DUJWJUZ%FUFDUJPO*O1SPD4QFBLFS0EZTTFZ QQr  &RQFDW (PEHGGLQJ 6SHDNHU YHULILFDWLRQ 'HFLVLRQ )HDWXUH H[WUDFWLRQ 3HUVRQDO9$' (PEHGGLQJ 'HFLVLRQ )HDWXUH H[WUDFWLRQ 3HUVRQDO9$' (PEHGGLQJ 6SHDNHU YHULILFDWLRQ 'HFLVLRQ )HDWXUH H[WUDFWLRQ 3HUVRQDO9$' QSXWDXGLR ,QSXWDXGLR ,QSXWDXGLR &RQFDW &RQFDW E F G 6FRUHGFRQGLWLRQHGWUDLQLQJ (PEHGGLQJFRQGLWLRQHGWUDLQLQJ 6FRUHDQGHPEHGGLQJFRQGLWLRQHGWUDLQLQJ <4%JOH >
  12.  ֤࿩ऀ͝ͱͰ7"%Λ͢Δࠜຊతͳ໰୊఺ʢ༧૝ʣ w /POUBSHFUTQFBLFSTTQFFDIΛݕग़͢Δඞཁੑ͕͋Δ w ࿦จʹཅʹ͸ॻ͍ͯ͸ͳ͍͕ɺ͓ͦΒ͘ܦݧతʹͦ͏ͳͷͩͱࢥ͏ w ࿩ऀදݱ͕͍ۙ࿩ऀʹରͯ͠੬ऑ  w

    ଞͷ࿩ऀ͕൑໌͍ͯͨ͠΄͏͕ʢ௚ײతʹʣੑೳ͕ग़Δ  w ଞͷ࿩ऀ͕BXBSFͰ΋࣮ݧతʹ͸վળ͠ͳ͔ͬͨ w ಛఆͷ࿩ऀ͚ͩʹண໨͢Δͱ͍͏ෆۉߧײ͕ۤख 
  13.  BLSTMP layer. The model architecture is shown in Figure

    Note that parameters of the SD block are shared across spe ers, and it is trained jointly with the whole TS-VAD model. Figure 1: Single-channel TS-VAD scheme As we performed all the experiments in the Kaldi A .VMUJTQFBLFS547"% w ਓͷ7"%Λಉ࣌ʹਪఆ w $)J.&͸͢΂ͯͷԻ੠Ͱਓ w ֤࿩ऀʹରͯ͠XFJHIU͕ڞ༗͞Εͨ෦෼ʴ 
 ࠷ऴతͳ݁ՌΛग़ྗ͢Δ෦෼ʹ෼ׂ w ֶश͸্ஈɾԼஈಉ࣌ w ग़ྗ͸֤࿩ऀʹରͯ͠Ϋϥε෼ྨ w ਓͳͷͰग़ྗ͸࣍ݩ w ଛࣦ͸ͭͷަࠩΤϯτϩϐʔͷ࿨ w ޙॲཧޙ%&3  4%ɿ4QFBLFS%FUFDUJPO
  14.  ޙॲཧ  w 5ISFTIPMEJOHɿಛఆͷ஋ΑΓখ͍֬͞཰Λʹ w .FEJBO f i MUFSJOHɿ૭͔͚ͨ۠ؒ͠Ͱதԝ஋Λͱͬͯ-1'

    w ୹࣌ؒͷϙʔζɾൃ࿩ͷ࡟আ w )..Λ૊ΜͰϏλϏ୳ࡧ w 5ISFTIPMEJOHʴ.FEJBOͱಉ౳ͷੑೳ͕ಘΒΕΔ w "43ͷલʹ͸(VJEFETPVSDFTFQBSBUJPO<$#PFEEFLFS >Λద༻ $#PFEEFLFS FUBM'SPOUFOEQSPDFTTJOHGPSUIF$)J.&EJOOFSQBSUZTDFOBSJP*O1SPD$)J.&8PSLTIPQ 
  15.  ੑೳධՁ w ֶशσʔλ͸$)J.&࣌ؒʴ7PY$FMFC࣌ؒʴ.JYVQ w ࣮ݧ৚݅͸΄ͱΜͲॻ͔Ε͍ͯͳ͍͕,BMEJͷϨγϐ͸ެ։͞Ε͍ͯΔ w 'VTJPO͸νϟωϧ෼ͷγϯάϧνϟωϧϞσϧͱɺ 
 ͭ෼ͷϚϧννϟωϧϞσϧΛɺॏΈ෇͖ฏۉͨ͠΋ͷ

     EFWFMPQNFOU FWBMVBUJPO %&3 +&3 %&3 +&3 $)J.&#BTFMJOF     ɹΫϥελϦϯά๏     ɹɹʴ547"%     ɹɹɹʴ547"%     ɹɹɹʴ547"%.$     'VTJPO    
  16.  ·ͱΊ w 5BSHFUTQFBLFSʹର͢Δ7"%Λ࣮૷͢Δ͜ͱͰ 
 $)J.&νϟϨϯδʹ͓͍ͯߴ͍ੑೳͷμΠΞϥΠθʔγϣϯΛ࣮ݱͨ͠ w ਓ਺ط஌ͷ547"%ΛϚϧννϟωϧͰߦͬͨ w ؀ڥ͕ҟͳΔ৚݅΍ɺਓ਺ະ஌ͰͷධՁ͸ࠓޙͷ՝୊

    ײ૝ w ண࣮ͳ࣮૷ͱΑ͘ચ࿅͞Εͨॳظ஋ʹΑ࣮ͬͯݱ͞Ε͍ͯΔͱ͍͏ҹ৅ w ࠷ॳͷΫϥελϦϯάͷ෦෼͕͔ͳΓ࡞Γ͜·Ε͍ͯΔ w ਓ਺͕ط஌͔ͩΒ੒ཱͰ͖͍ͯΔؾ͕ͯ͘͠Δ w Φʔόʔϥοϓ۠ؒΛຊ౰ʹͪΌΜͱೝࣝͰ͖͍ͯͨͷ͔ؾʹͳΔ