= + Model = Cascaded: End-to-End: *Automatic Speech recognition **Machine Translation ✚ SOTA quality for ASR and MT ✚ Large quantity of data ✚ Independent training — Error propagation
GPUs • modeling the bidimensional nature of a spectrogram • add a distance penalty to the attention to bias it towards local context capture local 2D-invariant features model context 2. End-To-End outperform cascading with Transformer SAN - Self- Attention Network MHA - Multi-Head Attention
MT (Output ASR [EN]) [RU] Reference [EN] Output MT (Reference [EN]) [RU] BLUE WER Ideal model Output ASR(A)[EN] Output ASR(A)[EN] Audio Audio Translate reference with MT and compare with final output Output MT (Output ASR [EN]) [RU] BLUE
with noisy data • Very little data for measuring quality • No researches how number of utterances influence on quality Problems of current approaches MT ASR ST
106 5⋅106 Noise ratio .1 .2 .3 .4 20 100 0 0.1 0.2 0.3 0.4 Noise ratio BLUE 40 60 80 E2E Translation ASR + MT Number of uYerances E2E Translation ASR + MT Add dimensions for measuring quality