Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modern researches in the field of Speech Translation

Machinelearner
November 26, 2020

Modern researches in the field of Speech Translation

Machinelearner

November 26, 2020
Tweet

More Decks by Machinelearner

Other Decks in Education

Transcript

  1. Model Speech Translation Audio Translated text Model ASR* End-to-End MT**

    = + Model = Cascaded: End-to-End: *Automatic Speech recognition **Machine Translation ✚ SOTA quality for ASR and MT ✚ Large quantity of data ✚ Independent training — Error propagation
  2. 2. End-To-End outperform cascading (ASR+MT) + Transformer 4. Multilingual outperform

    Bilingual 1. DATA: Create multilingual datasets 3. Improving cascaded (ASR+MT) models 5. Toolkit Main groups of articles
  3. 1. DATA: Create multilingual datasets Gender Bias • 70% of

    speakers on TED Talks – Men • Peak of womes’s representation in the EU Parl has been 40% • Most of the ASR and MT data are generated by male speakers
  4. • Attempts to adapt transformer for audio • A few

    experiments outperform cascading approach • Adapt positional encoding scheme to the Speech Transformer • Speed up inference 1. 2. End-To-End outperform cascading with Transformer
  5. • downsampling input with CNN to make the train on

    GPUs • modeling the bidimensional nature of a spectrogram • add a distance penalty to the attention to bias it towards local context capture local 2D-invariant features model context 2. End-To-End outperform cascading with Transformer SAN - Self- Attention Network MHA - Multi-Head Attention
  6. • Text sequences have a stricter correlation with position, while

    audios don’t have Text: What? Why? Who? Where? When? Audio: …… what …… who … when…. • Speech sequences 10 − 60 times > transcript character sequence 2. End-To-End outperform cascading with Transformer
  7. • Overloaded decoder: • Pretrain encoder • Split encoder into

    3 parts • Multitasking • Knowledge distillation • Simultaneous inference 2. End-To-End outperform cascading with Transformer
  8. • Encoder should learn: • Acous\c knowledge • Seman\c knowledge

    • filter redundant states 2. End-To-End outperform cascading
  9. • Pretrain encoder on better source language • Use translations

    to other languages Concat: Merge: 3. Multilingual outperform Bilingual
  10. • Pretrain encoder on better source language • Use translations

    to other languages Concat: Merge: 4. Multilingual outperform Bilingual
  11. Pretrain ASR • pretrain the model on a high-resource ASR

    • fine-tune its parameters for ST 3. Improving cascaded (ASR+MT) models
  12. Pretrain ASR and NMT simultaneously: • Use adversarial regularizer in

    loss function to make ASR encoder closer to input of MT decoder 3. Improving cascaded (ASR+MT) models
  13. Measure quality for translation with ASR + MT ASR MT

    Reference [EN] Reference [RU] WER Output ASR(A)[EN] Audio Output MT (Output ASR [EN]) [RU] BLUE
  14. ASR MT Reference [EN] Reference [RU] WER ASR MT Output

    MT (Output ASR [EN]) [RU] Reference [EN] Output MT (Reference [EN]) [RU] BLUE WER Ideal model Output ASR(A)[EN] Output ASR(A)[EN] Audio Audio Translate reference with MT and compare with final output Output MT (Output ASR [EN]) [RU] BLUE
  15. Ideal model 100% ASR MT Output MT (Output ASR [EN])

    [RU] Reference [EN] Output MT (Reference [EN]) [RU] BLUE WER Ideal model Output ASR(A)[EN] Audio Absolute scale for measuring quality
  16. ASR MT Output MT (Output ASR [EN]) [RU] Reference [EN]

    Output MT (Reference [EN]) [RU] BLUEASR+MT WER E2E model Output E2E model(A[EN]) [RU] Output MT (Reference [EN]) [RU] BLUEE2E Output ASR(A)[EN] Audio Audio Comparing MT+ASR and E2E
  17. 63 68 20 100 0 BLUE 40 60 80 Datasets

    + Models Article 1 2 3 4 5 Datasets Fisher- CallHome LibriSpeech ST-TED How2 LibriSpeech MuST-C Fisher- CallHome Libri-trans MuST-C LibriSpeech CoVoST MuST-C LibriSpeech E2E Translation ASR + MT Current researches
  18. • No standard dataset for measuring quality • No experiments

    with noisy data • Very little data for measuring quality • No researches how number of utterances influence on quality Problems of current approaches MT ASR ST
  19. BLUE 20 40 60 80 100 0 104 105 5⋅105

    106 5⋅106 Noise ratio .1 .2 .3 .4 20 100 0 0.1 0.2 0.3 0.4 Noise ratio BLUE 40 60 80 E2E Translation ASR + MT Number of uYerances E2E Translation ASR + MT Add dimensions for measuring quality
  20. BIG DATA • Use synthetic data: STT model <text >

    à TTS à <audio> à MT(SOTA) à <translated text > Tacotron MT ST à
  21. BIG DATA • Bilingual translations: Да и сам Лорд Кентервиль,

    человек весьма щепетильный, счел своим долгом упомянуть об этом мистеру Отису, когда они пришли к соглашению.
  22. Resume • Use many languages for training; • Estimate only

    part of losses that occurs because of 2 models instead of 1; • Collect lots of high-quality diversity data!