distribution between -0.01 and 0.01. • We use stochastic gradient descent with recently proposed learning rate decay strategy Ada-Delta (Zeiler, 2012). • Mini batch size in our model is set to 50 so that the convergence speed is fast. • We train 1000 mini batches of data in one language pair before we switch to the next language pair. • For word representation dimensionality, we use 1000 for both source language and target language. • The size of hidden layer is set to 1000.