Modern researches in the field of Speech Translation

Modern researches in the field of Speech Translation Alexandra Filimokhina

Model Speech Translation Audio Translated text Model ASR* End-to-End MT**
= + Model = Cascaded: End-to-End: *Automatic Speech recognition **Machine Translation ✚ SOTA quality for ASR and MT ✚ Large quantity of data ✚ Independent training — Error propagation

2. End-To-End outperform cascading (ASR+MT) + Transformer 4. Multilingual outperform
Bilingual 1. DATA: Create multilingual datasets 3. Improving cascaded (ASR+MT) models 5. Toolkit Main groups of articles

1. DATA: Create multilingual datasets Crowdsourcing Crowdsourcing TED Talks Crowdsourcing
Bible conversations ☎ debates Euro Parl

1. DATA: Create multilingual datasets Gender Bias • 70% of
speakers on TED Talks – Men • Peak of womes’s representation in the EU Parl has been 40% • Most of the ASR and MT data are generated by male speakers

• Attempts to adapt transformer for audio • A few
experiments outperform cascading approach • Adapt positional encoding scheme to the Speech Transformer • Speed up inference 1. 2. End-To-End outperform cascading with Transformer

• downsampling input with CNN to make the train on
GPUs • modeling the bidimensional nature of a spectrogram • add a distance penalty to the attention to bias it towards local context capture local 2D-invariant features model context 2. End-To-End outperform cascading with Transformer SAN - Self- Attention Network MHA - Multi-Head Attention

• Text sequences have a stricter correlation with position, while
audios don’t have Text: What? Why? Who? Where? When? Audio: …… what …… who … when…. • Speech sequences 10 − 60 times > transcript character sequence 2. End-To-End outperform cascading with Transformer

• Overloaded decoder: • Pretrain encoder • Split encoder into
3 parts • Multitasking • Knowledge distillation • Simultaneous inference 2. End-To-End outperform cascading with Transformer

• Encoder should learn: • Acous\c knowledge • Seman\c knowledge
• filter redundant states 2. End-To-End outperform cascading

2. End-To-End outperform cascading

• Pretrain encoder on better source language • Use translations
to other languages Concat: Merge: 3. Multilingual outperform Bilingual

• Pretrain encoder on better source language • Use translations
to other languages Concat: Merge: 4. Multilingual outperform Bilingual

4. MulLlingual outperform Bilingual

Pretrain ASR • pretrain the model on a high-resource ASR
• fine-tune its parameters for ST 3. Improving cascaded (ASR+MT) models

Pretrain ASR and NMT simultaneously: • Use adversarial regularizer in
loss function to make ASR encoder closer to input of MT decoder 3. Improving cascaded (ASR+MT) models

5. Toolkits

Where we are? • Baseline: Transformer • Quality? • Experiments?
• Metrics?

Measure quality for translation with ASR + MT ASR MT
Reference [EN] Reference [RU] WER Output ASR(A)[EN] Audio Output MT (Output ASR [EN]) [RU] BLUE

ASR MT Reference [EN] Reference [RU] WER ASR MT Output
MT (Output ASR [EN]) [RU] Reference [EN] Output MT (Reference [EN]) [RU] BLUE WER Ideal model Output ASR(A)[EN] Output ASR(A)[EN] Audio Audio Translate reference with MT and compare with final output Output MT (Output ASR [EN]) [RU] BLUE

Ideal model 100% ASR MT Output MT (Output ASR [EN])
[RU] Reference [EN] Output MT (Reference [EN]) [RU] BLUE WER Ideal model Output ASR(A)[EN] Audio Absolute scale for measuring quality

ASR MT Output MT (Output ASR [EN]) [RU] Reference [EN]
Output MT (Reference [EN]) [RU] BLUEASR+MT WER E2E model Output E2E model(A[EN]) [RU] Output MT (Reference [EN]) [RU] BLUEE2E Output ASR(A)[EN] Audio Audio Comparing MT+ASR and E2E

63 68 20 100 0 BLUE 40 60 80 Datasets
+ Models Article 1 2 3 4 5 Datasets Fisher- CallHome LibriSpeech ST-TED How2 LibriSpeech MuST-C Fisher- CallHome Libri-trans MuST-C LibriSpeech CoVoST MuST-C LibriSpeech E2E Translation ASR + MT Current researches

• No standard dataset for measuring quality • No experiments
with noisy data • Very little data for measuring quality • No researches how number of utterances influence on quality Problems of current approaches MT ASR ST

BLUE 20 40 60 80 100 0 104 105 5⋅105
106 5⋅106 Noise ratio .1 .2 .3 .4 20 100 0 0.1 0.2 0.3 0.4 Noise ratio BLUE 40 60 80 E2E Translation ASR + MT Number of uYerances E2E Translation ASR + MT Add dimensions for measuring quality

BIG DATA • Use synthetic data: STT model <text >
à TTS à <audio> à MT(SOTA) à <translated text > Tacotron MT ST à

BIG DATA • Bilingual translations: Да и сам Лорд Кентервиль,
человек весьма щепетильный, счел своим долгом упомянуть об этом мистеру Отису, когда они пришли к соглашению.

Resume • Use many languages for training; • Estimate only
part of losses that occurs because of 2 models instead of 1; • Collect lots of high-quality diversity data!

Modern researches in the field of Speech Transl...

Modern researches in the field of Speech Translation

Machinelearner

More Decks by Machinelearner

Other Decks in Education

Featured

Transcript