Slide 1

Slide 1 text

Modern researches in the field of Speech Translation Alexandra Filimokhina

Slide 2

Slide 2 text

Model Speech Translation Audio Translated text Model ASR* End-to-End MT** = + Model = Cascaded: End-to-End: *Automatic Speech recognition **Machine Translation ✚ SOTA quality for ASR and MT ✚ Large quantity of data ✚ Independent training — Error propagation

Slide 3

Slide 3 text

2. End-To-End outperform cascading (ASR+MT) + Transformer 4. Multilingual outperform Bilingual 1. DATA: Create multilingual datasets 3. Improving cascaded (ASR+MT) models 5. Toolkit Main groups of articles

Slide 4

Slide 4 text

1. DATA: Create multilingual datasets Crowdsourcing Crowdsourcing TED Talks Crowdsourcing Bible conversations ☎ debates Euro Parl

Slide 5

Slide 5 text

1. DATA: Create multilingual datasets Gender Bias • 70% of speakers on TED Talks – Men • Peak of womes’s representation in the EU Parl has been 40% • Most of the ASR and MT data are generated by male speakers

Slide 6

Slide 6 text

• Attempts to adapt transformer for audio • A few experiments outperform cascading approach • Adapt positional encoding scheme to the Speech Transformer • Speed up inference 1. 2. End-To-End outperform cascading with Transformer

Slide 7

Slide 7 text

• downsampling input with CNN to make the train on GPUs • modeling the bidimensional nature of a spectrogram • add a distance penalty to the attention to bias it towards local context capture local 2D-invariant features model context 2. End-To-End outperform cascading with Transformer SAN - Self- Attention Network MHA - Multi-Head Attention

Slide 8

Slide 8 text

• Text sequences have a stricter correlation with position, while audios don’t have Text: What? Why? Who? Where? When? Audio: …… what …… who … when…. • Speech sequences 10 − 60 times > transcript character sequence 2. End-To-End outperform cascading with Transformer

Slide 9

Slide 9 text

• Overloaded decoder: • Pretrain encoder • Split encoder into 3 parts • Multitasking • Knowledge distillation • Simultaneous inference 2. End-To-End outperform cascading with Transformer

Slide 10

Slide 10 text

• Encoder should learn: • Acous\c knowledge • Seman\c knowledge • filter redundant states 2. End-To-End outperform cascading

Slide 11

Slide 11 text

2. End-To-End outperform cascading

Slide 12

Slide 12 text

• Pretrain encoder on better source language • Use translations to other languages Concat: Merge: 3. Multilingual outperform Bilingual

Slide 13

Slide 13 text

• Pretrain encoder on better source language • Use translations to other languages Concat: Merge: 4. Multilingual outperform Bilingual

Slide 14

Slide 14 text

4. MulLlingual outperform Bilingual

Slide 15

Slide 15 text

Pretrain ASR • pretrain the model on a high-resource ASR • fine-tune its parameters for ST 3. Improving cascaded (ASR+MT) models

Slide 16

Slide 16 text

Pretrain ASR and NMT simultaneously: • Use adversarial regularizer in loss function to make ASR encoder closer to input of MT decoder 3. Improving cascaded (ASR+MT) models

Slide 17

Slide 17 text

5. Toolkits

Slide 18

Slide 18 text

Where we are? • Baseline: Transformer • Quality? • Experiments? • Metrics?

Slide 19

Slide 19 text

Measure quality for translation with ASR + MT ASR MT Reference [EN] Reference [RU] WER Output ASR(A)[EN] Audio Output MT (Output ASR [EN]) [RU] BLUE

Slide 20

Slide 20 text

ASR MT Reference [EN] Reference [RU] WER ASR MT Output MT (Output ASR [EN]) [RU] Reference [EN] Output MT (Reference [EN]) [RU] BLUE WER Ideal model Output ASR(A)[EN] Output ASR(A)[EN] Audio Audio Translate reference with MT and compare with final output Output MT (Output ASR [EN]) [RU] BLUE

Slide 21

Slide 21 text

Ideal model 100% ASR MT Output MT (Output ASR [EN]) [RU] Reference [EN] Output MT (Reference [EN]) [RU] BLUE WER Ideal model Output ASR(A)[EN] Audio Absolute scale for measuring quality

Slide 22

Slide 22 text

ASR MT Output MT (Output ASR [EN]) [RU] Reference [EN] Output MT (Reference [EN]) [RU] BLUEASR+MT WER E2E model Output E2E model(A[EN]) [RU] Output MT (Reference [EN]) [RU] BLUEE2E Output ASR(A)[EN] Audio Audio Comparing MT+ASR and E2E

Slide 23

Slide 23 text

63 68 20 100 0 BLUE 40 60 80 Datasets + Models Article 1 2 3 4 5 Datasets Fisher- CallHome LibriSpeech ST-TED How2 LibriSpeech MuST-C Fisher- CallHome Libri-trans MuST-C LibriSpeech CoVoST MuST-C LibriSpeech E2E Translation ASR + MT Current researches

Slide 24

Slide 24 text

• No standard dataset for measuring quality • No experiments with noisy data • Very little data for measuring quality • No researches how number of utterances influence on quality Problems of current approaches MT ASR ST

Slide 25

Slide 25 text

BLUE 20 40 60 80 100 0 104 105 5⋅105 106 5⋅106 Noise ratio .1 .2 .3 .4 20 100 0 0.1 0.2 0.3 0.4 Noise ratio BLUE 40 60 80 E2E Translation ASR + MT Number of uYerances E2E Translation ASR + MT Add dimensions for measuring quality

Slide 26

Slide 26 text

BIG DATA • Use synthetic data: STT model à TTS à à MT(SOTA) à Tacotron MT ST à

Slide 27

Slide 27 text

BIG DATA • Bilingual translations: Да и сам Лорд Кентервиль, человек весьма щепетильный, счел своим долгом упомянуть об этом мистеру Отису, когда они пришли к соглашению.

Slide 28

Slide 28 text

Resume • Use many languages for training; • Estimate only part of losses that occurs because of 2 models instead of 1; • Collect lots of high-quality diversity data!