Wei He, Dianhai Yu and Haifeng Wang. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 1723–1732, 2015. 文献紹介 (’18/04/20) 長岡技術科学大学 自然言語処理研究室 稲岡 夢人
distribution between -0.01 and 0.01. • We use stochastic gradient descent with recently proposed learning rate decay strategy Ada-Delta (Zeiler, 2012). • Mini batch size in our model is set to 50 so that the convergence speed is fast. • We train 1000 mini batches of data in one language pair before we switch to the next language pair. • For word representation dimensionality, we use 1000 for both source language and target language. • The size of hidden layer is set to 1000.