systems using our own implementation of NMT with attention over the source sequence (Bahdnau et al., 2014). ◦ bidirectional GRUs ◦ AdaDelta ◦ train all systems for 500,000 iterations, with validation every 5,000 steps, best single model from validation is used. ◦ use ℓ2 regularization (α = 1e^(-5)) ◦ Dropout is used on the output layers, rate is 0.5. ◦ beam size is 10. • corpus ◦ English-German ▪ 4.4M segments from the Europarl and CommonCrawl ◦ English-French ▪ 4.9M segments from the Europarl and CommonCrawl ◦ English-Portuguese ▪ 28.5M segments from the Europarl, JRC-Aquis and OpenSubtitles • subword ◦ they created a shared subword representation for each language pair by extracting a vocabulary of 80,000 symbols from the concatenated source and target data. 11