Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Self-Conditioned CTCとその発展

Self-Conditioned CTCとその発展

小松達也(LINE株式会社 Senior Research Scientist)

LINEが開発してきた非自己回帰型の音声認識方式 Self-Conditioned CTCについて紹介します。Self-Conditioned CTCはCTC型の音声認識の一つであり、ニューラルネットワークの中間層での認識結果を補助情報として活用することにより、CTC型方式の高速性を保持しつつ、テキスト間の依存関係を考慮できる自己回帰型方式と同等の認識精度を発揮します。発表では、Self-Conditioned CTCの詳細とその後の発展手法、音声認識以外への応用などについて紹介します。

※Tokyo BISH Bash #08での発表資料です
https://tokyo-bish-bash.connpass.com/event/283268/

LINE Developers

June 15, 2023
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. ຊ೔ͷ಺༰ɿ4FMG$POEJUJPOFE$5$ • 4FMGDPOEJUJPOFE$5$ • $5$ʹجͮ͘ඇࣗݾճؼܕԻ੠ೝࣝํࣜ </P[BLJ *OUFSTQFFDI> • ߴ଎ʹਪ࿦͕ՄೳɺࣗݾճؼܕԻ੠ೝࣝʹඖఢ͢Δਪ࿦ਫ਼౓ •

    جຊख๏ͱͦͷൃలܗʹ͍ͭͯ঺հ  &ODPEFS O &ODPEFS O &ODPEFS  -JOFBS 4PGUNBY -JOFBS $5$ $5$ -JOFBS 4PGUNBY &ODPEFS / *OQVUBVEJP 0VUQVUUPLFO 4FMGDPOEJUJPOJOH
  2. $5$ʹجͮ͘ඇࣗݾճؼܕԻ੠ೝࣝ  ࠓ࠷΋ߴ͍֬཰Λ࣋ͭจࣈΛฒ΂ͯ ͗Ύͬͱ·ͱΊΔͱਖ਼ղʹͳΔΑ͏ʹֶश͢Δ ܁Γฦ͠ χϡʔϥϧωοτ ͜ Μ Μ ʹ

    ͪ ͸ ͸ ͜Μʹͪ͸ 1. p(こ) = 0.7 2. p(_) = 0.2 3. p(藤) = 0.0001 $POOFDUJPOJTU5FNQPSBM$MBTTJGJDBUJPO<(SBWFT > ͜ ࣗݾճؼܕํࣜ ʹਫ਼౓͸ߴ͍͕͋·Γ଎͘ͳ͍ "UUFOUJPO&OD%FD<$IPSPXTLJ > 3//5SBOTEVDFS<(SBWFT > ֤จࣈ͸ಠཱʹਪఆʢ৚݅෇͖ಠཱͷԾఆʣ ʹߴ଎͕ͩਫ਼౓͸ͦ͜·Ͱߴ͘ͳ͍ • ߴ଎ʹਪ࿦͕Ͱ͖Δೝࣝํࣜ
  3. $5$ʹجͮ͘ඇࣗݾճؼܕԻ੠ೝࣝ  &ODPEFS O &ODPEFS O &ODPEFS  $5$ -JOFBS

    4PGUNBY &ODPEFS / *OQVUBVEJP • ԻڹΤϯίʔμͱ$5$σίʔμͰߏ੒ ʢ5SBOTGPSNFS$POGPSNFSʣ • จࣈؒͷ৚݅෇͖ಠཱੑΛԾఆ • ߴ଎͕ͩਫ਼౓͸ͦ͜·Ͱߴ͘ͳ͍ 0VUQVUUPLFO ߴ଎͞Λอͪͭͭ ਫ਼౓Λߴ͍ͨ͘͠
  4. ઌߦݚڀɿ*OUFSNFEJBUF$5$<-FF > &ODPEFS O &ODPEFS O &ODPEFS  -JOFBS 4PGUNBY

    $5$ $5$ -JOFBS 4PGUNBY &ODPEFS / *OQVUBVEJP 0VUQVUUPLFO தؒ૚Ͱ$5$ϩεΛܭࢉ͠ਖ਼ଇԽ γϯϓϧ͕ͩͱͯ΋ޮՌత 
  5. ઌߦݚڀɿ*OUFSNFEJBUF$5$<-FF > &ODPEFS O &ODPEFS O &ODPEFS  -JOFBS 4PGUNBY

    $5$ $5$ -JOFBS 4PGUNBY &ODPEFS / *OQVUBVEJP 0VUQVUUPLFO  தؒ૚Ͱ$5$ϩεΛܭࢉ͠ਖ਼ଇԽ γϯϓϧ͕ͩͱͯ΋ޮՌత தؒ૚Ͱͷೝࣝ݁ՌΛ ΋ͬͱ༗ޮར༻Ͱ͖ΔͷͰ͸ʁ
  6. 4FMGDPOEJUJPOFE$5$</P[BLJ > &ODPEFS O &ODPEFS O &ODPEFS  -JOFBS 4PGUNBY

    -JOFBS $5$ $5$ -JOFBS 4PGUNBY &ODPEFS / *OQVUBVEJP 0VUQVUUPLFO 4FMGDPOEJUJPOJOH  தؒ૚ͷೝࣝ݁ՌΛޙஈʹϑΟʔυόοΫ ʹೝࣝ݁ՌΛߟྀ͠ͳ͕ΒͷΤϯίʔυΛՄೳʹ ʹจࣈؒͷؔ܎ੑΛֶ΂Δʂʂ
  7. Self-conditioned CTCΛ΋͏Ұ౓ݟͯΈΔ  CTC Input Target label Encoder Layer Encoder

    Layer ⋮ Encoder Layer Encoder Layer ⋮ Encoder Layer Encoder Layer ⋮ CTC CTC Output Output Output (intermediate) (intermediate) Linear!→# + Softmax Linear!→# + Softmax Linear#→! Linear!→# + Softmax Linear#→! ܁Γฦ͠ํࣜͱಉ౳ͷॲཧΛ ಺෦Ͱߦ͍ͬͯΔʁ ʢதؒʣ݁ՌΛʜ ྑ͘͢Δ ʢதؒʣ݁ՌΛʜ ྑ͘͢Δ
  8. ൃలܗ΁ͷ͍͔ͭ͘ͷΞΠσΞ  CTC Input Target label Encoder Layer Encoder Layer

    ⋮ Encoder Layer Encoder Layer ⋮ Encoder Layer Encoder Layer ⋮ CTC CTC Output Output Output (intermediate) (intermediate) Linear!→# + Softmax Linear!→# + Softmax Linear#→! Linear!→# + Softmax Linear#→! ʢதؒʣ݁ՌΛʜ ྑ͘͢Δ ʢதؒʣ݁ՌΛʜ ྑ͘͢Δ ᶃ தؒೝࣝ݁ՌʹΘ͟ͱޡΓΛ෇༩ ʮ܁Γฦͯ͠ྑ͘͢ΔʯΛΑΓ͏·ֶ͘श ˠ *OUFS"VH</BLBHPNF *OUFSTQFFDI> ᶄ தؒ૚͝ͱʹ׽ࣈͱಡΈΛަޓʹֶश ׽ࣈͱಡΈͷ૬ޓؔ܎Λֶश ˠ "MUFSOBUF$POEJUJPOJOH<'VKJUB 4-5> ᶅ ܁Γฦ͠ߏ଄ʹண໨ ಉҰͷ໾ׂΛ࣋ͭ෦෼Λ·ͱΊͯܰྔԽ ˠ 'PMEFE &ODPEFS<,PNBUTV *$"441>
  9. ᶃ*OUFS"VH </BLBHPNF JOUFSTQFFDI>  CTC Input Target label Encoder Layer

    Encoder Layer ⋮ Encoder Layer Encoder Layer ⋮ Encoder Layer Encoder Layer ⋮ CTC CTC Output Output Output (intermediate) (intermediate) Linear!→# + Softmax Linear!→# + Softmax Linear#→! Linear!→# + Softmax Linear#→! தؒ૚༧ଌʹରͯ͠Θ͟ͱޡΓΛ෇༩ தؒ૚$5$ͷ܁Γฦ͠ʹΑΓվળ ԿΛʮྑ͘͢Δʯ͔Λ໌ࣔతʹֶश 𝐈𝐧𝐭𝐞𝐫𝐀𝐮𝐠 𝐈𝐧𝐭𝐞𝐫𝐀𝐮𝐠
  10. ᶅ 'PMEFE&ODPEFS<,PNBUTV *$"441>  CTC Input Target label Encoder Layer

    Encoder Layer ⋮ Encoder Layer Encoder Layer ⋮ Encoder Layer Encoder Layer ⋮ CTC CTC Output Output Output (intermediate) (intermediate) ֤ઢܗ૚͸ڞ௨ͷύϥϝʔλ ೖग़ྗ͸ಉҰͷ্ۭؒʹࣹӨ Linear!→# + Softmax Linear!→# + Softmax Linear#→! Linear!→# + Softmax Linear#→! JOQVU ྨࣅͷೖग़ྗؔ܎ PVUQVU JOQVU PVUQVU
  11.  Linear!→# + Softmax Linear!→# + Softmax Linear#→! CTC Input

    Target label Encoder Layer Encoder Layer ⋮ Encoder Layer Encoder Layer ⋮ Encoder Layer Encoder Layer ⋮ Linear!→# + Softmax Linear#→! CTC CTC Output Output Output (intermediate) (intermediate) 4IBSFE 1BSBNFUFST ಉҰͷ໾ׂΛ࣋ͭΤϯίʔμϒϩοΫΛ ύϥϝʔλڞ௨ԽʹܰྔԽ ܁Γฦ͠ར༻ 'PMEFE &ODPEFS #BTF &ODPEFS ᶅ 'PMEFE&ODPEFS<,PNBUTV *$"441>
  12. ͦͷଞͷൃలܗ • ͳͥੑೳ޲্͢Δ͔ΛఆࣜԽɺ ਪ࿦࣌ʹΑΓΑ͍தؒ༧ଌͰ$POEJUJPOJOH • #FUUFSJOUFSNFEJBUFT<,PNBUTV *OUFSTQFFDI> • ֶश࣌ʹ௥Ճͷதؒ૚ਖ਼ଇԽ •

    *OUFS%FDPEFS <,PNBUTV 4-5> • ࿩ऀμΠΞϥΠθʔγϣϯ΁ͷԠ༻ • 4FMG$POEJUJPOFE/PO"VUPSFHSFTTJWF"UUSBDUPS<'VKJUB *$"441> 
  13. #FUUFS*OUFSNFEJBUFT <,PNBUTV *OUFSTQFFDI> 4FBSDIFEJOUFSNFEJBUFDPOEJUJPOJOH &ODPEFS O &ODPEFS O &ODPEFS 

    -JOFBS 4PGUNBY &NCFEEJOH "MJHONFOU 7JUFSCJ #FBN TFBSDI $5$ $5$ &YUFSOFM -. -JOFBS 4PGUNBY &ODPEFS / *OQVUBVEJP 0VUQVUUPLFO $POGPSNFS$5$ 4FMGDPOEJUJPOJOH ᶃ 'SBNFXJTF UPLFOQSPCBCJMJUZ ᶄ&TUJNBUF UPLFOTFRVFODF ᶅ "MJHOFTUJNBUFEUPLFOT UPGSBNFXJTFQSPCBCJMJUZ  *NQSPWFJOUFSNFEJBUFQSFEJDUJPOCZ BOFYUFSOBM-.BOECFBNTFBSDI (𝑇×𝑉) (𝑇×𝐷) (𝑇×𝐷) (𝑇×1) (𝐿×1)
  14. #FUUFS*OUFSNFEJBUFT <,PNBUTV *OUFSTQFFDI> .VMUJQBTTDPOEJUJPOJOH &ODPEFS O &ODPEFS O &ODPEFS 

    -JOFBS 4PGUNBY &NCFEEJOH "MJHONFOU 7JUFSCJ $5$ -JOFBS 4PGUNBY &ODPEFS / *OQVUBVEJP OE QBTTPVUQVU $POGPSNFS$5$ 4FMGDPOEJUJPOJOH &ODPEFS O &ODPEFS O &ODPEFS  $5$ -JOFBS 4PGUNBY &ODPEFS / *OQVUBVEJP TU QBTTPVUQVU $POGPSNFS$5$ 4FMG $POEJUJPOJOH &ODP &ODPE &ODP $5 -JOFBS &ODP *OQVU SE QBTT $POGPSNFS$5$ ᶄ "MJHOPVUQVUTPG QSFWJPVTJOGFSFODF ᶃ 'SBNFXJTF UPLFOQSPCBCJMJUZ  (𝑇×𝑉) (𝑇×𝐷) (𝑇×𝐷) (𝐿×1)
  15. ࿩ऀμΠΞϥΠθʔγϣϯ΁ͷԠ༻ • /FVSBM%JBSJ[BUJPO XJUI/POBVUPSFHSFTTJWF*OUFSNFEJBUF"UUSBDUPST <'VKJUB *$"441>  𝑋 TransEnc! 𝐴

    𝐸" Sigmoid(𝐴#𝐸" ) 𝑌 TransEnc" … Audio sequence Attractors Embeddings Speaker labels LSTM$%& LSTM'$& Autoregressive 𝑋 TransEnc( Audio sequence BEFORE Autoregressive attractor 𝐸( 𝑊𝐴( 𝐸( TransEnc()! + 𝐴( = Attn(𝑄, 𝐸( , 𝐸( ) Sigmoid(𝐴( #𝐸( ) 𝑌( AFTER Non-autoregressive intermediate attractors intermediate prediction conditioning Speaker-wise
  16. ·ͱΊ • 4FMG$POEJUJPOFE$5$ʹ͍ͭͯ঺հ • தؒ૚ʹ͓͚Δ༧ଌͱޙஈ΁ͷϑΟʔυόοΫߏ଄͸ۃΊͯ༗ޮ • छʑͷൃలܗʹ͍ͭͯ঺հ • *OUFS"VHɿதؒ૚ʹର͠ޡΓΛ෇༩ֶ͠श •

    "MUFSOBUF$POEJUJPOJOHɿදهԻૉؒͷؔ܎ੑΛֶश • 'PMEFE&ODPEFSɿ܁Γฦ͠ߏ଄Λ௿ύϥϝʔλԽ • #FUUFS*OUFSNFEJBUFɿΑΓΑ͍தؒ༧ଌʹΑΓ$POEJUJPOJOH • *OUFS%FDPEFSɿதؒ૚ʹର͢Δ௥Ճਖ਼ଇԽ • 4FMG$POEJUJPOFE%JBSJ[BUJPOɿ࿩ऀμΠΞϥΠͥʔγϣϯ΁ͷԠ༻ • ࠓճ঺հͰ͖ͳ͔ͬͨൃలܗ΋ • )JFSBSDIJDBM$POEJUJPOJOH<)JHVDIJ *$"441>ɿཻ౓Λม͑֊૚తʹ$POEJUJPOJOH • (BUFE*OUFSMBZFS$PMMBCPSBUJPO<*$"441>ɿ(BUFߏ଄ΛՃ͑ͨ$POEJUJPOJOH