Modern researches in the field of
Speech Translation
Alexandra Filimokhina
Slide 2
Slide 2 text
Model
Speech Translation
Audio
Translated text
Model ASR*
End-to-End
MT**
= +
Model =
Cascaded:
End-to-End:
*Automatic Speech recognition
**Machine Translation
✚ SOTA quality for ASR and MT
✚ Large quantity of data
✚ Independent training
— Error propagation
Slide 3
Slide 3 text
2. End-To-End outperform cascading (ASR+MT) + Transformer
4. Multilingual outperform Bilingual
1. DATA: Create multilingual datasets
3. Improving cascaded (ASR+MT) models
5. Toolkit
Main groups of articles
1. DATA: Create multilingual datasets
Gender Bias
• 70% of speakers on TED Talks – Men
• Peak of womes’s representation in the EU Parl has been 40%
• Most of the ASR and MT data are generated by male speakers
Slide 6
Slide 6 text
• Attempts to adapt transformer for audio
• A few experiments outperform cascading approach
• Adapt positional encoding scheme to the Speech Transformer
• Speed up inference
1. 2. End-To-End outperform cascading with Transformer
Slide 7
Slide 7 text
• downsampling input with CNN to make the train on GPUs
• modeling the bidimensional nature of a spectrogram
• add a distance penalty to the attention to bias it towards local context
capture local 2D-invariant features
model context
2. End-To-End outperform cascading with Transformer
SAN - Self- Attention Network MHA - Multi-Head Attention
Slide 8
Slide 8 text
• Text sequences have a stricter correlation with position, while audios don’t have
Text: What? Why? Who? Where? When?
Audio: …… what …… who … when….
• Speech sequences 10 − 60 times > transcript character sequence
2. End-To-End outperform cascading with Transformer
Slide 9
Slide 9 text
• Overloaded decoder:
• Pretrain encoder
• Split encoder into 3 parts
• Multitasking
• Knowledge distillation
• Simultaneous inference
2. End-To-End outperform cascading with Transformer
Slide 10
Slide 10 text
• Encoder should learn:
• Acous\c knowledge
• Seman\c knowledge
• filter redundant states
2. End-To-End outperform cascading
Slide 11
Slide 11 text
2. End-To-End outperform cascading
Slide 12
Slide 12 text
• Pretrain encoder on better source language
• Use translations to other languages
Concat:
Merge:
3. Multilingual outperform Bilingual
Slide 13
Slide 13 text
• Pretrain encoder on better source language
• Use translations to other languages
Concat:
Merge:
4. Multilingual outperform Bilingual
Slide 14
Slide 14 text
4. MulLlingual outperform Bilingual
Slide 15
Slide 15 text
Pretrain ASR
• pretrain the model on a high-resource ASR
• fine-tune its parameters for ST
3. Improving cascaded (ASR+MT) models
Slide 16
Slide 16 text
Pretrain ASR and NMT simultaneously:
• Use adversarial regularizer in loss function to make
ASR encoder closer to input of MT decoder
3. Improving cascaded (ASR+MT) models
Slide 17
Slide 17 text
5. Toolkits
Slide 18
Slide 18 text
Where we are?
• Baseline: Transformer
• Quality?
• Experiments?
• Metrics?
Slide 19
Slide 19 text
Measure quality for translation with ASR + MT
ASR MT
Reference [EN] Reference [RU]
WER
Output ASR(A)[EN]
Audio
Output MT (Output ASR [EN]) [RU]
BLUE
Slide 20
Slide 20 text
ASR MT
Reference [EN] Reference [RU]
WER
ASR MT
Output MT (Output ASR [EN]) [RU]
Reference [EN] Output MT (Reference [EN]) [RU]
BLUE
WER
Ideal model
Output ASR(A)[EN]
Output ASR(A)[EN]
Audio
Audio
Translate reference with MT and compare with final output
Output MT (Output ASR [EN]) [RU]
BLUE
Slide 21
Slide 21 text
Ideal model 100%
ASR MT
Output MT (Output ASR [EN]) [RU]
Reference [EN] Output MT (Reference [EN]) [RU]
BLUE
WER
Ideal model
Output ASR(A)[EN]
Audio
Absolute scale for measuring quality
Slide 22
Slide 22 text
ASR MT
Output MT (Output ASR [EN]) [RU]
Reference [EN] Output MT (Reference [EN]) [RU]
BLUEASR+MT
WER
E2E model
Output E2E model(A[EN]) [RU]
Output MT (Reference [EN]) [RU]
BLUEE2E
Output ASR(A)[EN]
Audio
Audio
Comparing MT+ASR and E2E
• No standard dataset for measuring quality
• No experiments with noisy data
• Very little data for measuring quality
• No researches how number of utterances influence on quality
Problems of current approaches
MT ASR ST
Slide 25
Slide 25 text
BLUE
20
40
60
80
100
0
104 105 5⋅105 106 5⋅106
Noise ratio
.1
.2
.3
.4
20
100
0
0.1 0.2 0.3 0.4
Noise ratio
BLUE
40
60
80
E2E Translation
ASR + MT
Number of uYerances
E2E Translation
ASR + MT
Add dimensions for measuring quality
Slide 26
Slide 26 text
BIG DATA
• Use synthetic data: STT model
à TTS à à MT(SOTA) à
Tacotron
MT ST
à
Slide 27
Slide 27 text
BIG DATA
• Bilingual translations:
Да и сам Лорд Кентервиль, человек весьма щепетильный, счел
своим долгом упомянуть об этом мистеру Отису, когда они
пришли к соглашению.
Slide 28
Slide 28 text
Resume
• Use many languages for training;
• Estimate only part of losses that occurs because of 2 models instead of 1;
• Collect lots of high-quality diversity data!