input feature into D- sional vector. Encoder(p) t (·) is the p-th encoder block which s an input sequence of D-dimensional vectors and outputs a ensional vector e (p) t at time index t. We use P encoder blocks ed by the output layer for frame-wise posteriors. e architecture of the encoder block is depicted in Fig. 2. onfiguration of the encoder block is almost the same as the the Speech-Transformer introduced in [44], but without posi- encoding. The encoder block has two sub-layers. The first is i-head self-attention layer, and the second is a position-wise orward layer. Multi-head self-attention layer ulti-head self-attention layer transforms a sequence of input s as follows. The sequence of vectors (e (p 1) t |t = 1, · · · , T) verted into a RT ⇥D matrix, followed by layer normalization p 1) = LayerNorm([e (p 1) 1 · · · e (p 1) T ]>) 2 RT ⇥D . (6) Label 1 Label 2 Label 1 Label 2 BCE BCE minimum Permutation 1 Permutation 2 Permutation-free loss Linear + Sigmoid Output 1 Output 2 LayerNorm Encoder block Log-Mel LayerNorm Multi-head self-attention P blocks Position-wise FF LayerNorm Input Linear Encoder block ¯ E(p 1) ¯ E(p,SA) E(p,FF) E(p,SA) Fig. 2. Two-speaker SA-EEND model trained with permutation-free &OEUPFOE/FVSBM%JBSJ[BUJPOʢ&&/%ʣ w εϖΫτϩάϥϜʹରͯ͠ μΠΞϥΠθʔγϣϯ݁ՌΛग़ྗ w ֶशσʔλྔ͕େྔʹඞཁ w υϝΠϯͷ߹Θͳ͍σʔλʹ੬ऑ w ࣌ؒͷσʔλʹ͑ͳ͍ w ޙஈͰύʔϛϡςʔγϣϯΛղ͘ඞཁੑ w %&3ʢEFWTFUʣ : 'VKJUB FUBM&OEUPFOEOFVSBMTQFBLFSEJBSJ[BUJPOXJUITFMGBUUFOUJPO*O1SPD"436 QQr <: 'VKJUB >
Note that parameters of the SD block are shared across spe ers, and it is trained jointly with the whole TS-VAD model. Figure 1: Single-channel TS-VAD scheme As we performed all the experiments in the Kaldi A .VMUJTQFBLFS547"% w ਓͷ7"%Λಉ࣌ʹਪఆ w $)J.&ͯ͢ͷԻͰਓ w ֤ऀʹରͯ͠XFJHIU͕ڞ༗͞Εͨ෦ʴ ࠷ऴతͳ݁ՌΛग़ྗ͢Δ෦ʹׂ w ֶश্ஈɾԼஈಉ࣌ w ग़ྗ֤ऀʹରͯ͠Ϋϥεྨ w ਓͳͷͰग़ྗ࣍ݩ w ଛࣦͭͷަࠩΤϯτϩϐʔͷ w ޙॲཧޙ%&3 4%ɿ4QFBLFS%FUFDUJPO
w ࣌ؒͷϙʔζɾൃͷআ w )..ΛΜͰϏλϏ୳ࡧ w 5ISFTIPMEJOHʴ.FEJBOͱಉͷੑೳ͕ಘΒΕΔ w "43ͷલʹ(VJEFETPVSDFTFQBSBUJPO<$#PFEEFLFS >Λద༻ $#PFEEFLFS FUBM'SPOUFOEQSPDFTTJOHGPSUIF$)J.&EJOOFSQBSUZTDFOBSJP*O1SPD$)J.&8PSLTIPQ