P. 51
[Rennie+,2017] Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017, July).
Self-critical sequence training for image captioning. CVPR2017.
[Li+,2017] Li, J., Monroe, W., & Jurafsky, D. (2017). Learning to Decode for Future
Success. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1701.06549
[Khandelwal+,2021] Khandelwal, A. (2021). WeaSuL: Weakly Supervised Dialogue
Policy Learning: Reward Estimation for Multi-turn Dialogue. INLG2021.
P.52 [Ziegler+,2019] Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A.,
Amodei, D., Christiano, P., & Irving, G. (2019). Fine-Tuning Language Models from
Human Preferences. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1909.08593
P.53 [Choshen+,2020] Choshen, L., Fox, L., Aizenbud, Z., & Abend, O. (2020). On the
weaknesses of reinforcement learning for neural machine translation. ICLR2020.
P.54 [Stiennon+, 2020] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C.,
Radford, A., Amodei, D., & Christiano, P. Learning to summarize from human feedback.
NeurIPS2020.
P.57 [Xie+,2018] Yujia Xie, et al. A fast proximal point method for computing exact
Wasserstein distance. arXiv preprint arXiv 1802.04307, 2018.
参考文献
92/85