オフライン強化学習チュートリアル @ 強化学習若手の会 29 時系列が長くなるほど2乗オーダーが問題に (一度間違えると以降のステップも間違え続けるため) current policyの選んだ行動が最適解(=behavior policy)と一致しなかった回数 [S. Ross+, 2011] A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. https://arxiv.org/abs/1011.0686
policyの乖離を防ぐ ダイバージェンス(分布間距離)制約 [N. Jaques+, 2019] Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog. https://arxiv.org/abs/1907.00456 safe! risky..
Fu. “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”. arXiv preprint, 2020. https://arxiv.org/abs/2005.01643 [A. Kumar+, 2019] Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. “Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction”. NeurIPS, 2019. https://arxiv.org/abs/1906.00949 [S. Ross+, 2011] Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”. AISTATS, 2011. https://arxiv.org/abs/1011.0686 [D. Precup+, 2000] Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation”. ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_fac ulty_pubs [N. Jiang & L. Li, 2016] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning”. ICML, 2016. https://arxiv.org/abs/1511.03722 2021/03/22 オフライン強化学習チュートリアル @ 強化学習若手の会 74
Novikov, Tom Le Paine, Sergio Gomez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matt Hoffman, Ofir Nachum, George Tucker, Nicolas Heess, and Nando de Freitas. “RL Unplugged: Benchmarks for Offline Reinforcement Learning”. arXiv preprint, 2020. https://arxiv.org/abs/2006.13888 [A. Kumar, 2019] A. Kumar. “Data-Driven Deep Reinforcement Learning”. BAIR blog, 2019. https://bair.berkeley.edu/blog/2019/12/05/bear/ [A. Kendall+, 2019] Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, and Amar Shah. “Learning to Drive in a Day”. ICRA, 2019. https://arxiv.org/abs/1807.00412 [H. Zhu+, 2017] Han Zhu, Junqi Jin, Chang Tan, Fei Pan, Yifan Zeng, Han Li, and Kun Gai. “Optimized Cost per Click in Taobao Display Advertising”. KDD, 2017. https://arxiv.org/abs/1703.02091 [B. Zoph & Q. V. Le, 2016] Barret Zoph and Quoc V. Le. “Neural Architecture Search with Reinforcement Learning”. ICLR, 2016. https://arxiv.org/abs/1611.01578 2021/03/22 オフライン強化学習チュートリアル @ 強化学習若手の会 77
J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. “Mastering the game of Go with deep neural networks and tree search”. Nature, 2016. http://airesearch.com/wp- content/uploads/2016/01/deepmind-mastering-go.pdf [J. Gao, 2016] Jim Gao. “Machine Learning Applications for Data Center Optimization”. Google whitepaper, 2016. https://static.googleusercontent.com/media/research.google.com/ja//pubs/archiv e/42542.pdf [B. Baker+, 2016] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. “Emergent Tool Use From Multi-Agent Autocurricula”. ICLR, 2020. https://arxiv.org/abs/1909.07528 2021/03/22 オフライン強化学習チュートリアル @ 強化学習若手の会 78
Nachum, George Tucker, ziyu wang, Alexander Novikov, Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, and Thomas Paine. “Benchmarks for Deep Off-Policy Evaluation”. https://openreview.net/forum?id=kWSeGEeHvF8 2021/03/22 オフライン強化学習チュートリアル @ 強化学習若手の会 79