Upgrade to Pro — share decks privately, control downloads, hide ads and more …

音声領域におけるLINEの研究開発

 音声領域におけるLINEの研究開発

木田祐介、小松達也(LINE)
電子情報通信学会 合同研究会におけるスポンサーワークショップの発表資料です
※応用音響研究会(EA)/ 信号処理研究会(SIP)/ 音声研究会(SP)/ 音声言語情報処理研究会(IPSJ-SLP)
https://www.ieice.org/ken/program/index.php?mode=program&tgs_regid=e0598d8c96d210d495b5658d544b1d74ca6b13b372932df5134795a99a6c2e9c&tgid=EA&layout=&lang=

A3966f193f4bef226a0d3e3c1f728d7f?s=128

LINE Developers
PRO

March 01, 2022
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Ի੠ྖҬʹ͓͚Δ -*/&ͷݚڀ։ൃ -*/&4QFFDI5FBN ໦ా ༞հ খদ ୡ໵

  2. -*/&ͷ"*ٕज़ྖҬ LINE AI Speech Video Voice NLU Data OCR Vision

    Face LINE Shopping Lens Adult Image Filter Scene Classification Ad image Filter Visual Search Analogous image Product Image Lip Reading Fashion Image Spot Clustering Food Image Indonesia LINE Split Bill LINE MUSIC Playlist OCR LINE CONOMI Handwritten Font Receipt OCR Credit card OCR Bill OCR Document Intelligence Identification Face Sign eKYC Face Sign Auto Cut Auto Cam Transcription Telephone network Voice recognition Single-Demand STT Simple voice High quality voice Voice Style Transfer Active Leaning Federated Leaning Action recognition Pose estimation Speech Note Vlive Auto Highlight Content Center AI CLOVA Dubbing LINE AiCall Gatebox Papago Video Insight LINE CLOVA AI Interactive Avatar Interactive Avatar Media 3D Avatar LINE Profile Lip Reading
  3. -*/&ͷԻ੠ϓϩμΫτ LINE AiCall Call answering CLOVA Note Transcription

  4. $-07"/PUF%FNP CLOVA Note Demo

  5. 5FDIOPMPHJFTJO$-07"/PUF End-to-End ASR based on Self-Supervised Learning Speaker Classification Keyword

    Boosting
  6. -*/&ͷԻ੠ٕज़ End-to-End / Hybrid ASR Text-to-Speech Acoustic Event Detection Multi-Channel

    Speech Processing Research Product
  7. ࠃࡍδϟʔφϧࠃࡍձٞ࠾୒࣮੷ 5PUBMBDDFQUBODFTJO Title Conf/Journal Author END TO END LEARNING FOR

    CONVOLUTIVE MULTI-CHANNEL WIENER FILTERING ICASSP2021 M. Togami Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual TTS ICASSP2021 T. Komatsu REFINEMENT OF DIRECTION OF ARRIVAL ESTIMATORS BY MAJORIZATION-MINIMIZATION OPTIMIZATION ON THE ARRAY MANIFOLD ICASSP2021 R. Scheibler SURROGATE SOURCE MODEL LEARNING FOR DETERMINED SOURCE SEPARATION ICASSP2021 R. Scheibler JOINT DEREVERBERATION AND SEPARATION WITH ITERATIVE SOURCE STEERING ICASSP2021 R. Scheibler, M. Togami PARALLEL WAVEFORM SYNTHESIS BASED ON GENERATIVE ADVERSARIAL NETWORKS WITH VOICING-AWARE CONDITIONAL DISCRIMINATORS ICASSP2021 R. Yamamoto TTS-BY-TTS: TTS-DRIVEN DATA AUGMENTATION FOR FAST AND HIGH-QUALITY SPEECH SYNTHESIS ICASSP2021 R. Yamamoto Independent Vector Analysis via Log-Quadratically Penalized Quadratic Minimization IEEE TSP R. Scheibler Multichannel Separation and Classification of Sound Events EUSIPCO2021 R. Scheibler, T. Komatsu, M. Togami Multi-Source Domain Adaptation with Sinkhorn Barycenter EUSIPCO2021 T. Komatsu Acoustic Event Detection with classifier chains INTERSPEECH2021 T. Komatsu Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions INTERSPEECH2021 T. Komatsu Sound Source Localization with Majorization Minimization INTERSPEECH2021 M. Togami Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation INTERSPEECH2021 Y. Nakagome, M. Togami High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model INTERSPEECH2021 R. Yamamoto Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis INTERSPEECH2021 K. Futamata, B. Park, R. Yamamoto, K. Tachibana Over-Determined Semi-Blind Speech Source Separation APSIPA2021 M. Togami COMPARISON OF LOW COMPLEXITY SELF-ATTENTION MECHANISMS FOR ACOUSTIC EVENT DETECTION APSIPA2021 T. Komatsu, R. Scheibler A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation ASRU2021 T. Komatsu Computationally-Efficient Overdetermined Blind Source Separation Based on Iterative Source Steering IEEE SPL R. Scheibler
  8. Ի੠ೝࣝ 4QFFDI3FDPHOJUJPO

  9. Ի੠ೝࣝʹؔ͢Δ࠷ۙͷऔΓ૊Έ • ඇࣗݾճؼܕԻ੠ೝࣝʢ/POBVUPSFHSFTTJWF"43ʣ • $5$ʹجͮ͘/POBVUPSFHSFTTJWF"43 • ςΩετΛฒྻʹग़ྗ͢ΔͨΊߴ଎ • ฒྻग़ྗ ˠ

    ςΩετؒͷґଘؔ܎ͷߟྀ͕ۤख • ߴ଎Ͱੑೳͷߴ͍$5$ʹجͮ͘ํࣜʹ͍ͭͯݚڀ • ೥Ҏ߱ͷऔΓ૊Έ • 3FMBYJOHUIF$POEJUJPOBM*OEFQFOEFODF"TTVNQUJPOPG$5$CBTFE"43 CZ$POEJUJPOJOHPO*OUFSNFEJBUF1SFEJDUJPOT </P[BLJBOE,PNBUTV *OUFSTQFFDI> • "$PNQBSBUJWF4UVEZPO/PO"VUPSFHSFTTJWF.PEFMJOHGPS4QFFDIUP5FYU(FOFSBUJPO <)JHVDIJ "436> • /PO"VUPSFHSFTTJWF"43XJUI4FMG$POEJUJPOFE'PMEFE&ODPEFST <,PNBUTV *$"441 BDDFQUFE > ͬͪ͜
  10. 3FMBYJOHUIF$POEJUJPOBM*OEFQFOEFODF"TTVNQUJPOPG$5$CBTFE"43 CZ$POEJUJPOJOHPO*OUFSNFEJBUF1SFEJDUJPOT </P[BLJBOE,PNBUTV *OUFSTQFFDI> 5SBOTGPSNFSͷதؒ૚Ͱ΋$5$ʹΑΔೝࣝΛߦ͍ޙஈͷϨΠϠʔ΁ϑΟʔυόοΫ 4FMGDPOEJUJPOFE$5$ தؒ૚ʹ͓͚Δೝࣝ݁ՌΛߟྀ͠ͳ͕Β࠷ऴͷೝࣝ݁ՌΛग़ྗ தؒ݁Ռ͕૚ΛܦΔ͝ͱʹʹྑ͘ͳͬͯΏ͘

  11. "$PNQBSBUJWF4UVEZPO/PO"VUPSFHSFTTJWF.PEFMJOHGPS4QFFDIUP5FYU(FOFSBUJPO <)JHVDIJ "436> ૣҴాେ ඉޱ͞ΜʹΑΔൺֱ࿦จ <)JHVDIJ "436> 4FMGDPOEJUJPOFE$5$͕ෳ਺ͷ࠷৽ख๏ͷதͰτοϓͷੑೳΛୡ੒

  12. /PO"VUPSFHSFTTJWF"43XJUI4FMG$POEJUJPOFE'PMEFE&ODPEFST <,PNBUTV *$"441 BDDFQUFE > 4FMGDPOEJUJPOFE$5$ͷܰྔԽʹؔ͢Δݚڀ 5SBOTGPSNFS $POGPSNFS ͷ֤૚͕ಉҰͷಇ͖Λ࣋ͭ͜ͱʹ஫໨ɺগ਺ϨΠϠʔΛ࠶ར༻͠ύϥϝʔλ࡟ݮ ಉҰͷઢܗ૚Ͱೝࣝ

    ಛ௃ۭ͕ؒྨࣅ ύϥϝʔλ਺ Ͱैདྷͱಉ౳ͷੑೳΛୡ੒
  13. ԻڹΠϕϯτݕग़ "DPVTUJD&WFOU%FUFDUJPO

  14. ԻڹΠϕϯτݕग़ "DPVTUJD&WFOU%FUFDUJPO "&%ʣ • ԻڹΠϕϯτͷछྨɺൃੜ࣌ؒΛਪఆ͢ΔλεΫ .VMUJMBCFMDMBTTJGJDBUJPO • %$"4&8PSLTIPQ$IBMMFOHF͕ϕʔεϥΠϯ΍σʔληοτΛެ։ • %$"4&UBTLͰ͸

    Ґ֫ಘ • $POWPMVUJPO"VHNFOUFE5SBOTGPSNFSGPS4FNJ4VQFSWJTFE4PVOE&WFOU%FUFDUJPO <.JZB[BLJ %$"4&> • ೥Ҏ߱ͷൃදจݙ • "DPVTUJD&WFOU%FUFDUJPOXJUI$MBTTJGJFS$IBJOT <,PNBUTV *OUFSTQFFDI> • $PNQBSJTPOPG-PX$PNQMFYJUZ4FMG"UUFOUJPO.FDIBOJTNTGPS"DPVTUJD&WFOU%FUFDUJPO <,PNBUTV "14*1"> • 4FMG4VQFSWJTFE-FBSOJOH.FUIPE6TJOH.VMUJQMF4BNQMJOH4USBUFHJFTGPS(FOFSBM1VSQPTF "VEJP3FQSFTFOUBUJPO <,VSPZBOBHJ *$"441 BDDFQUFE>
  15. "DPVTUJD&WFOU%FUFDUJPOXJUI$MBTTJGJFS$IBJOT <,PNBUTV *OUFSTQFFDI> ैདྷͷԻڹΠϕϯτݕग़͸֤ΠϕϯτΛฒྻʹ෼ྨͦ͠ΕͧΕͷ૬ޓؔ܎Λߟྀͨ͠෼ྨ͕ۤख ԻڹΠϕϯτؒͷ૬ޓؔ܎Λߟྀͨ͠$IBJOSVMFʹجͮ͘ԻڹΠϕϯτ෼ྨثΛఏҊ ैདྷ֤ΠϕϯτΛͦΕͧΕ෼ྨ ઢܗ૚ 4JHNPJEʣ ఏҊ๏Πϕϯτ͝ͱʹࣗݾճؼతʹ෼ྨ ͭͷσʔληοτͰੑೳ޲্Λ֬ೝ

    ಛʹ࣮ऩ࿥σʔλͰେ͖͘ੑೳվળ $MBTTJGJFS 'FBUVSF &YUSBDUPS *OQVU &WFOU" &WFOU# &WFOU$ 1SPQPTFE $MBTTJGJFS 'FBUVSF &YUSBDUPS *OQVU &WFOU" &WFOU# &WFOU$
  16. $PNQBSJTPOPG-PX$PNQMFYJUZ4FMG"UUFOUJPO.FDIBOJTNTGPS"DPVTUJD&WFOU%FUFDUJPO <,PNBUTVBOE4DIFJCMFS "14*1"> "MMFYQFSJNFOUTBSFDPOEVDUFECBTFEPO5SBOTGPSNFSCBTFE"&%<.JZB[BLJ *$"441> • ԻڹΠϕϯτݕग़ʹ͓͚Δ5SBOTGPSNFS TFMGBUUFOUJPO ͱܥྻ௕ͷؔ܎ʹ͍ͭͯௐࠪ •

    4FMGBUUFOUJPOͷܰྔԽʹؔ͢Δෳ਺ͷ࠷৽ख๏ͷੑೳධՁΛใࠂ
  17. 4FMG4VQFSWJTFE-FBSOJOH.FUIPE6TJOH.VMUJQMF4BNQMJOH4USBUFHJFTGPS(FOFSBM1VSQPTF "VEJP3FQSFTFOUBUJPO <,VSPZBOBHJBOE,PNBUTV *$"441 BDDFQUFE > • ࣗݾڭࢣ͋ΓֶशʹΑΔ൚༻తͳΦʔσΟΦදݱΛಘΔͨΊͷଛࣦؔ਺ʹؔ͢Δݚڀ • ࣗݾڭࢣ͋Γֶश

    ˠ ଛࣦؔ਺ఆ͕ٛॏཁ • ैདྷ͸୯Ұ؍఺͔Βͷଛࣦؔ਺Λར༻ FH ಉҰͷԻϑΝΠϧʢdඵͳͲʣ಺ͷԻ͸͍ͣΕ΋ྨࣅͱԾఆ • ԾఆΛຬͨ͞ͳ͍λεΫ΋ଘࡏ • ఏҊख๏ෳ਺ͷ؍఺͔Βଛࣦؔ਺Λઃܭ • ˠϚϧνλεΫͳࣗݾڭࢣ͋Γֶशʹج͖ͮɺΑΓ൚༻తͳදݱΛ֫ಘ
  18. ଟνϟϯωϧ৴߸ॲཧ .VMUJ$IBOOFM4JHOBM1SPDFTTJOH

  19. 4%3r .FEJVN3BSFXJUI'BTU$PNQVUBUJPOT<4DIFJCMFS> Goal: Accelerate Evaluation of Source Separation Algorithm MIX

    SEPARATE … … … !" ̂ !$ !$ ̂ !" Faster Evaluation → More Experiments → Better Algorithms Reference Sources Estimated Sources Evaluation (SDR) We make this step faster!
  20. 4%3r .FEJVN3BSFXJUI'BTU$PNQVUBUJPOT<4DIFJCMFS>  Signal-to-Distortion Ratio (SDR): • Similar to SNR

    • Allow distortion by short filter ℎ (L=512) Filter ℎ is obtained by solving a linear system SDR = 10 log ℎ ⋆ , - ℎ ⋆ , − ̂ , - ,: reference ̂ ,: estimate min ℎ ⋆ , − ̂ , - Algorithm 1) Compute statistics 2) Solve for 3 3) Filter 4 4) Compute SDR Conventional 5(7 log 7) 5(9:) (Gaussian Elimination) 5(7 log 9) 5(7) Proposed 5(7 log 7) 5(9 log 9) (Conjugate Gradient Descent) - 5(9) (9 ≪ 7) 7: Number of samples 9: Filter length Significant savings because • 7 can be long! 30 sec. at 16 kHz is 480000 samples! • 9 is also large with 512 taps
  21. 4%3r .FEJVN3BSFXJUI'BTU$PNQVUBUJPOT<4DIFJCMFS>  We release open-source implementation in Python with

    numpy/torch backend pip install fast-bss-eval Benchmark vs open implementations • mir_eval • sigsep/bss_eval 8x 12x 15x
  22. Ի੠߹੒ 5FYUUP4QFFDI

  23. 7PJDF5FBN Controllable, high-quality, and expressive TTS High-quality Neural Vocoder Controllable

    TTS with emotion strength Text Analyzer
  24. Controllable TTS with emotion strength Add emotion strength as the

    one of input parameters on an acoustic model [ a- ri^ ga- to- o- ] ͋Γ͕ͱ͏ Acoustic model Text analyzer Vocoder Text &NPUJPO strength 0 1 Happiness Waveform
  25. Prediction of emotion strength Consider two ways to predict emotion

    strength from speech 8BWFGPSN Human annotators weak strong IBQQJOFTT weak strong 4BEOFTT Label emotion strength by listening Feature extractor Predict emotion strength with ranking algorithm J: Annotate automatically L: In some cases, decreasing expressiveness in TTS J: More expressive in TTS L: High cost
  26. Acoustic model training Train acoustic model conditioning predicted emotion strength

    and emotion label. Acoustic model and vocoder are trained independently. [ a- ri^ ga- to- o- ] ありがとう Text analyzer Acoustic model Vocoder Text Speech Predicting Emotion strength 0 1 Emotion strength Happiness Emotion strength and emotion label Speech
  27. Speech samples Control emotion strength ! " )BQQZ 4BE #

  28. ࠂ஌̍࠾༻৘ใ ˙۩ମతͳۀ຿಺༰ ɾԻ੠ॲཧٕज़ʹؔ͢Δݚڀ։ൃ ɾ&OEUP&OEܕԻ੠ೝࣝϞσϧɺσίʔμɺٴͼपลٕज़ ɾ%//)..ܕԻ੠ೝࣝϞσϧʢԻڹϞσϧɺݴޠϞσϧʣɺσίʔμɺٴͼपลٕज़ ɾ&OEUP&OEԻ੠߹੒ʹ޲͚ͨԻڹϞσϧɾOFVSBMWPDPEFS ɾԻڹΠϕϯτݕग़ɺϚϧνϞʔμϧॲཧͳͲ৽نϓϩμΫτʹܨ͕Δٕज़ ɾ࣍ͷϓϩμΫτͷछͱͳΔٕज़ͷߟҊɺٴͼϓϩτλΠϓγεςϜͷ։ൃ ɾαʔόӡ༻ɺσϓϩΠɺӡ༻؀ڥߏஙͱվળʢ.-0QTʣ ˙ϙδγϣϯͷັྗ

    ɾ&OEUP&OEԻ੠ೝࣝ΍දݱྗʹண໨ͨ͠Ի੠߹੒ΛؚΉ࠷ઌ୺ͷٕज़ͷϓϩμΫτԽʹܞΘΔ͜ͱ͕Ͱ͖·͢ ɾ-*/&ΞϓϦ΍εϚʔτεϐʔΧʔ౳ͷطଘϓϩμΫτΛ௨ͯ͠େن໛ͳϢʔβʔͷੜ׆ʹӨڹΛ༩͑ΒΕΔͱ ಉ࣌ʹɺ-*/&"J$BMMͷΑ͏ʹͦΕ·Ͱੈʹͳ͔ͬͨ৽͍͠ϓϩμΫτͷ্ཱͪ͛ʹࢀը͢Δػձ͕͋Γ·͢ ɾ*$"441ɺ*/5&341&&$)ͱ͍ͬͨτοϓࠃࡍձٞʹͯൃදܦݧΛ࣋ͭଟ͘ͷΤϯδχΞɾϦαʔνϟʔͱ ڞʹݚڀ։ൃ͕Ͱ͖Δ؀ڥͰ͢ ˙։ൃ؀ڥ ɾ։ൃݴޠɿ1ZUIPO $$ 4IFMM4DSJQUͳͲ ɾ04ɿ-JOVY .BD049 ɾ෼ࢄॲཧɿ"QBDIF4QBSL )BEPPQ ɾίϯςφɿ%PDLFS ,VCFSOFUFT ɾ$*$%ɿ%SPOF "SHP +FOLJOT ɾ.-ϥΠϒϥϦɿ1Z5PSDI 5FOTPS'MPXͳͲ ɾͦͷଞɿ(JU)VC&OUFSQSJTF $POGMVFODF 4MBDL
  29. ࠂ஌̎ ೥ՆͷΠϯλʔϯʢ༧ఆʣ Ի੠ٕज़ʹؔΘΔ࣮՝୊ɾλεΫͷʹऔΓ૊Ή։ൃ ݚڀΠϯλʔϯΛืू༧ఆͰ͢ɻ ظؒɿ݄d݄ͷؒͰिؒʢ౔೔ॕٳΈʣ څ༩ࢧڅɿສԁʢສԁिʣ ۈ຿஍ɿΦϯϥΠϯ ςʔϚͷྫɿ ɾ&&Ի੠ೝࣝɺԻ੠߹੒ ɾ؀ڥԻ෼ੳ

    ɾϚϧνϞʔμϧ FHBVEJPWJTVBM"43ʣ ΤϯτϦʔαΠτ͸݄Լ०ʹΦʔϓϯ༧ఆͰ͢ɻΤϯτϦʔ͓଴͍ͪͯ͠·͢ɻ
  30. ࠂ஌̏ We Have a Presentation at NVIDIA GTC2022!! Title: Building

    Streaming End-to-end Speech Recognition Service with Triton Inference Server
  31. None