Use RL to discover an RL algorithm (e.g. something like PPO) automatically
Recap of a simple RL algorithm
1. Initialize parameters of policy and of value function
2. While true
a. Run policy in the episode and collect a trajectory
b. Update , where
Discovering Reinforcement Learning Algorithms
Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper]
not crashed
future is ok
it’s obvious that the future will be bad
the action avoided forecasted crash, awesome