方策勾配型強化学習の基礎と応用

Slide 1

Slide 1 text

⽅策勾配型強化学習の基礎と応⽤岩城諒 2017/12/13 @KP mtg

Slide 3

Slide 3 text

3 強化学習でできること ngaging for human players. We used the same network yperparameter values (see Extended Data Table 1) and urethroughout—takinghigh-dimensionaldata(210|160 t 60 Hz) as input—to demonstrate that our approach successful policies over a variety of games based solely utswithonlyveryminimalpriorknowledge(thatis,merely were visual images, and the number of actions available but not their correspondences; see Methods). Notably, as able to train large neural networks using a reinforce- ignalandstochasticgradientdescentinastablemanner— he temporal evolution of two indices of learning (the e score-per-episode and average predicted Q-values; see plementary Discussion for details). We compared DQN with the best performing methods from the reinforcement learning literature on the 49 games where results were available12,15. In addition to the learned agents, we alsoreport scores for aprofessionalhumangamestesterplayingundercontrolledconditions and a policy that selects actions uniformly at random (Extended Data Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y axis; see Methods). Our DQN method outperforms the best existing reinforcement learning methods on 43 of the games without incorpo- rating any of the additional prior knowledge about Atari 2600 games used by other approaches (for example, refs 12, 15). Furthermore, our DQN agent performed at a level that was comparable to that of a pro- fessionalhumangamestesteracrossthesetof49games,achievingmore than75%ofthe humanscore onmorethanhalfofthegames(29 games; Convolution Convolution Fully connected Fully connected No input matic illustration of the convolutional neural network. The hitecture are explained in the Methods. The input to the neural s of an 843 843 4 image produced by the preprocessing by three convolutional layers (note: snaking blue line symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed by a rectifier nonlinearity (that is, max 0,x ð Þ). a b c d 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200 0 20 40 60 80 100 120 140 160 180 200 Average score per episode Training epochs 8 9 10 11 lue (Q) 0 1,000 2,000 3,000 4,000 5,000 6,000 0 20 40 60 80 100 120 140 160 180 200 Average score per episode Training epochs 7 8 9 10 alue (Q) LETTER ゲーム [Mnih+ 15] t IEEE ROBOTICS & AUTOMATION MAGAZINE t MARCH 2016 104 sampled from real systems. On the other hand, if the environment is extremely stochastic, a limited amount of previously acquired data might not be able to capture the real environment’s property and could lead to inappropriate policy updates. However, rigid dynamics models, such as a humanoid robot model, do not usually include large stochasticity. There- fore, our approach is suitable for a real robot learning for high- dimensional systems like humanoid robots. formance. We proposed recursively using the off-policy PGPE method to improve the policies and applied our approach to cart-pole swing-up and basketball-shooting tasks. In the former, we introduced a real-virtual hybrid task environment composed of a motion controller and vir- tually simulated cart-pole dynamics. By using the hybrid environment, we can potentially design a wide variety of different task environments. Note that complicated arm movements of the humanoid robot need to be learned for the cart-pole swing-up. Furthermore, by using our proposed method, the challenging basketball-shooting task was successfully accomplished. Future work will develop a method based on a transfer learning [28] approach to efficiently reuse the previous expe- riences acquired in different target tasks. Acknowledgment This work was supported by MEXT KAKENHI Grant 23120004, MIC-SCOPE, ``Development of BMI Technolo- gies for Clinical Application’’ carried out under SRPBS by AMED, and NEDO. Part of this study was supported by JSPS KAKENHI Grant 26730141. This work was also supported by NSFC 61502339. References [1] A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann, “Data-effi- cient contextual policy search for robot movement skills,” in Proc. National Conf. Artificial Intelligence, 2013. [2] C. E. Rasmussen and C. K. I. Williams Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006. [3] C. G. Atkeson and S. Schaal, “Robot learning from demonstration,” in Proc. 14th Int. Conf. Machine Learning, 1997, pp. 12–20. [4] C. G. Atkeson and J. Morimoto, “Nonparametric representation of policies and value functions: A trajectory-based approach,” in Proc. Neural Infor- mation Processing Systems, 2002, pp. 1643–1650. Figure 13. The humanoid robot CB-i [7]. (Photo courtesy of ATR.) ロボット制御 [Sugimoto+ 16] IBM Research / Center for Business Optimization Modeling and Optimization Engine Actions Other System 1 System 2 System 3 Event Listener Event Notification Event Notification Event Notification < inserts > TP Profile Taxpayer State ( Current ) Modeler Optimizer < input to > State Generator < input to > Case Inventory < reads > < input to > Allocation Rules Resource Constraints < input to > < inserts , updates > Business Rules < input to > < generates > Segment Selector Action 1 Cnt Action 2 Cnt Action n Cnt 1 C 1 ^ C 2 V C 3 200 50 0 2 C 4 V C 1 ^ C 7 0 50 250 TP ID Feat 1 Feat 2 Feat n 123456789 00 5 A 1500 122334456 01 0 G 1600 122118811 03 9 G 1700 Rule Processor < input to > < input to > Recommended Actions < inserts , updates > TP ID Rec. Date Rec. Action Start Date 123456789 00 6/21/2006 A1 6/21/2006 122334456 01 6/20/2006 A2 6/20/2006 122118811 03 5/31/2006 A2 Action Handler < input to > New Case Case Extract Scheduler < starts > < updates > State Time Expired Event Notification < input to > Taxpayer State History State TP ID State Date Feat 1 Feat 2 Feat n 123456789 00 6/1/2006 5 A 1500 122334456 01 5/31/2006 0 G 1600 122118811 03 4/16/2006 4 R 922 122118811 03 4/20/2006 9 G 1700 < inserts > Feature Definitions (XML) (XSLT) (XML) (XML) (XSLT) Figure 2: Overall collections system architecture. 債権回収の最適化 [Abe+ 10] 囲碁 [Silver+ 16]

Slide 6

Slide 6 text

6 Example :: 3 :: Basketball Shooting • 状態：ロボットの関節⾓度（使ってない） • ⾏動：関節⾓度の⽬標値 • 報酬：ゴールからの距離 dimensional dynamics models with a limited amount of data sampled from real systems. On the other hand, if the environment is extremely stochastic, a limited amount of previously acquired data might not be able to capture the real environment’s property and could lead to inappropriate policy updates. However, rigid dynamics models, such as a humanoid robot model, do not usually include large stochasticity. There- fore, our approach is suitable for a real robot learning for high- dimensional systems like humanoid robots. es of a humanoid robot to eff formance. We proposed recu PGPE method to improve the proach to cart-pole swing-u tasks. In the former, we intr task environment composed o tually simulated cart-pole dy environment, we can potenti different task environments. movements of the humanoid the cart-pole swing-up. Furt posed method, the challengi was successfully accomplished Future work will develop a learning [28] approach to effici riences acquired in different tar Acknowledgment This work was supported b 23120004, MIC-SCOPE, ``De gies for Clinical Application’’ AMED, and NEDO. Part of thi KAKENHI Grant 26730141. Th NSFC 61502339. References [1] A. G. Kupcsik, M. P. Deisenroth, J. cient contextual policy search for robo Conf. Artificial Intelligence, 2013. [2] C. E. Rasmussen and C. K. I. Will Learning. Cambridge, MA: MIT Press, 2 [3] C. G. Atkeson and S. Schaal, “Robot 14th Int. Conf. Machine Learning, 1997, pp [4] C. G. Atkeson and J. Morimoto, “N cies and value functions: A trajectory-b mation Processing Systems, 2002, pp. 164 Figure 13. The humanoid robot CB-i [7]. (Photo courtesy of ATR.) our proposed approach, we compared the following methods: ● REINFORCE: The REINFORCE algorithm [25] ● PGPE: Standard PGPE [6] ● IW-PGPE: Standard IW-PGPE [34] ● Proposed: Proposed recursive IW-PGPE. For each method, we updated the parameters every ten trials and used the same learning rate. 2 m (a) 0.9 m 0.5 m 0.5 m 0.1 m y z x Robot i5 , i6 , i7 p(xp , yp , zp = 0.5) i1 , i2 , i3 i4 [Sugimoto+ 16]

Slide 26

Slide 26 text

26 ⽅策勾配定理 • ⽅策勾配は様々な形式で不偏推定できる: r✓⌘(⇡) = E ⇡ [r✓ ln ⇡✓(a|s)Q⇡(s, a)] = E ⇡ [r✓ ln ⇡✓(a|s) (Q⇡(s, a) V ⇡(s))] = E ⇡ [r✓ ln ⇡✓(a|s)A⇡(s, a)] = E ⇡ [r✓ ln ⇡✓(a|s) ⇡] = s, a Q⇡ s V ⇡ A⇡ s, a A⇡(s, a) = Q⇡(s, a) V ⇡(s) = R(s, a) + X s02S P(s0|s, a)V ⇡(s0) V ⇡(s) = E s0⇠P [r + V ⇡(s0) V ⇡(s)] = E s0⇠P [ ⇡] AAAGaXiclVRLb9NAEB4HGkp4tCmXilwsoqStKNGm4i0hFRASxzxIWilOo7WzSaz4JdsJFJMLR/4AB04gIYE4wV/gwh/g0J+A4FYkLhwYr02TNM9uZO/M7HzffDOOVrY01XEJORAip04vRM8sno2dO3/h4tJyfKXsmB1bYSXF1Ex7V6YO01SDlVzV1diuZTOqyxrbkdsP/fOdLrMd1TSeuPsWq+q0aagNVaEuhmpxIXF/T7LUdWeTbsTS98T8kSdeE8uBsxGTJP9M0qnbUqjmFXpBxlVRalJdp6LkdPSa56xJqnGUVOz1+ogcItZecFB5z0NW359UQpa9Rz1O56j6AAXyaazhVux+4RA+RCXZarPlVmMppEylY3OSilKdaS4NxIUUteUkyRC+xFEjGxpJCFfOjEcSIEEdTFCgAzowMMBFWwMKDv4qkAUCFsaq4GHMRkvl5wx6EENsB7MYZlCMtvHdRK8SRg30fU6HoxWsouFjI1KEFPlOPpJD8o18Ij/I34lcHufwtezjLgdYZtWWXq0W/8xE6bi70Oqjpmp2oQG3uVYVtVs84nehBPju89eHxbuFlJcm78hP1P+WHJCv2IHR/a28z7PCmyl6ZNQSTKyOfoNXYAMz8fibYrTJZ+vPS8ee29wmsIlPBm7gnsU98AM+n+cpZ9J5vwZW8DA+zOd/ySqP+z094195cu0kZvdm6K3z/0MbO9Ow51HN2VDzTb7P1jvKN13zuPon0W1jvD5m0hnYGpry/Mr/M86nu19/UPWkGvNxn4xz3kmPmTDeNNnj98qoUdrK3MmQ/PXk9oPwylmEBFyBdSS5BdvwGHJQAkV4KXwQPgtfFn5FV6Kr0ctBakQIMZdgaEWT/wC35HkI AAAGaXiclVRLb9NAEB4HGkp4tCmXilwsoqStKNGm4i0hFRASxzxIWilOo7WzSaz4JdsJFJMLR/4AB04gIYE4wV/gwh/g0J+A4FYkLhwYr02TNM9uZO/M7HzffDOOVrY01XEJORAip04vRM8sno2dO3/h4tJyfKXsmB1bYSXF1Ex7V6YO01SDlVzV1diuZTOqyxrbkdsP/fOdLrMd1TSeuPsWq+q0aagNVaEuhmpxIXF/T7LUdWeTbsTS98T8kSdeE8uBsxGTJP9M0qnbUqjmFXpBxlVRalJdp6LkdPSa56xJqnGUVOz1+ogcItZecFB5z0NW359UQpa9Rz1O56j6AAXyaazhVux+4RA+RCXZarPlVmMppEylY3OSilKdaS4NxIUUteUkyRC+xFEjGxpJCFfOjEcSIEEdTFCgAzowMMBFWwMKDv4qkAUCFsaq4GHMRkvl5wx6EENsB7MYZlCMtvHdRK8SRg30fU6HoxWsouFjI1KEFPlOPpJD8o18Ij/I34lcHufwtezjLgdYZtWWXq0W/8xE6bi70Oqjpmp2oQG3uVYVtVs84nehBPju89eHxbuFlJcm78hP1P+WHJCv2IHR/a28z7PCmyl6ZNQSTKyOfoNXYAMz8fibYrTJZ+vPS8ee29wmsIlPBm7gnsU98AM+n+cpZ9J5vwZW8DA+zOd/ySqP+z094195cu0kZvdm6K3z/0MbO9Ow51HN2VDzTb7P1jvKN13zuPon0W1jvD5m0hnYGpry/Mr/M86nu19/UPWkGvNxn4xz3kmPmTDeNNnj98qoUdrK3MmQ/PXk9oPwylmEBFyBdSS5BdvwGHJQAkV4KXwQPgtfFn5FV6Kr0ctBakQIMZdgaEWT/wC35HkI AAAGaXiclVRLb9NAEB4HGkp4tCmXilwsoqStKNGm4i0hFRASxzxIWilOo7WzSaz4JdsJFJMLR/4AB04gIYE4wV/gwh/g0J+A4FYkLhwYr02TNM9uZO/M7HzffDOOVrY01XEJORAip04vRM8sno2dO3/h4tJyfKXsmB1bYSXF1Ex7V6YO01SDlVzV1diuZTOqyxrbkdsP/fOdLrMd1TSeuPsWq+q0aagNVaEuhmpxIXF/T7LUdWeTbsTS98T8kSdeE8uBsxGTJP9M0qnbUqjmFXpBxlVRalJdp6LkdPSa56xJqnGUVOz1+ogcItZecFB5z0NW359UQpa9Rz1O56j6AAXyaazhVux+4RA+RCXZarPlVmMppEylY3OSilKdaS4NxIUUteUkyRC+xFEjGxpJCFfOjEcSIEEdTFCgAzowMMBFWwMKDv4qkAUCFsaq4GHMRkvl5wx6EENsB7MYZlCMtvHdRK8SRg30fU6HoxWsouFjI1KEFPlOPpJD8o18Ij/I34lcHufwtezjLgdYZtWWXq0W/8xE6bi70Oqjpmp2oQG3uVYVtVs84nehBPju89eHxbuFlJcm78hP1P+WHJCv2IHR/a28z7PCmyl6ZNQSTKyOfoNXYAMz8fibYrTJZ+vPS8ee29wmsIlPBm7gnsU98AM+n+cpZ9J5vwZW8DA+zOd/ySqP+z094195cu0kZvdm6K3z/0MbO9Ow51HN2VDzTb7P1jvKN13zuPon0W1jvD5m0hnYGpry/Mr/M86nu19/UPWkGvNxn4xz3kmPmTDeNNnj98qoUdrK3MmQ/PXk9oPwylmEBFyBdSS5BdvwGHJQAkV4KXwQPgtfFn5FV6Kr0ctBakQIMZdgaEWT/wC35HkI

Slide 38

Slide 38 text

38 Benchmarking ble 1. Performance of the implemented algorithms in terms of average return over all training iterations for five different random seeds (same across all algorithms). The results the best-performing algorithm on each task, as well as all algorithms that have performances that are not statistically significantly different (Welch’s t-test with p < 0.05), are hlighted in boldface.a In the tasks column, the partially observable variants of the tasks are annotated as follows: LS stands for limited sensors, NO for noisy observations and ayed actions, and SI for system identifications. The notation N/A denotes that an algorithm has failed on the task at hand, e.g., CMA-ES leading to out-of-memory errors in the l Humanoid task. Task Random REINFORCE TNPG RWR REPS TRPO CEM CMA-ES DDPG Cart-Pole Balancing 77.1 ± 0.0 4693.7 ± 14.0 3986.4 ± 748.9 4861.5 ± 12.3 565.6 ± 137.6 4869.8 ± 37.6 4815.4 ± 4.8 2440.4 ± 568.3 4634.4 ± 87.8 Inverted Pendulum* 153.4 ± 0.2 13.4 ± 18.0 209.7 ± 55.5 84.7 ± 13.8 113.3 ± 4.6 247.2 ± 76.1 38.2 ± 25.7 40.1 ± 5.7 40.0 ± 244.6 Mountain Car 415.4 ± 0.0 67.1 ± 1.0 -66.5 ± 4.5 79.4 ± 1.1 275.6 ± 166.3 -61.7 ± 0.9 66.0 ± 2.4 85.0 ± 7.7 288.4 ± 170.3 Acrobot 1904.5 ± 1.0 508.1 ± 91.0 395.8 ± 121.2 352.7 ± 35.9 1001.5 ± 10.8 326.0 ± 24.4 436.8 ± 14.7 785.6 ± 13.1 -223.6 ± 5.8 Double Inverted Pendulum* 149.7 ± 0.1 4116.5 ± 65.2 4455.4 ± 37.6 3614.8 ± 368.1 446.7 ± 114.8 4412.4 ± 50.4 2566.2 ± 178.9 1576.1 ± 51.3 2863.4 ± 154.0 Swimmer* 1.7 ± 0.1 92.3 ± 0.1 96.0 ± 0.2 60.7 ± 5.5 3.8 ± 3.3 96.0 ± 0.2 68.8 ± 2.4 64.9 ± 1.4 85.8 ± 1.8 Hopper 8.4 ± 0.0 714.0 ± 29.3 1155.1 ± 57.9 553.2 ± 71.0 86.7 ± 17.6 1183.3 ± 150.0 63.1 ± 7.8 20.3 ± 14.3 267.1 ± 43.5 2D Walker 1.7 ± 0.0 506.5 ± 78.8 1382.6 ± 108.2 136.0 ± 15.9 37.0 ± 38.1 1353.8 ± 85.0 84.5 ± 19.2 77.1 ± 24.3 318.4 ± 181.6 Half-Cheetah 90.8 ± 0.3 1183.1 ± 69.2 1729.5 ± 184.6 376.1 ± 28.2 34.5 ± 38.0 1914.0 ± 120.1 330.4 ± 274.8 441.3 ± 107.6 2148.6 ± 702.7 Ant* 13.4 ± 0.7 548.3 ± 55.5 706.0 ± 127.7 37.6 ± 3.1 39.0 ± 9.8 730.2 ± 61.3 49.2 ± 5.9 17.8 ± 15.5 326.2 ± 20.8 Simple Humanoid 41.5 ± 0.2 128.1 ± 34.0 255.0 ± 24.5 93.3 ± 17.4 28.3 ± 4.7 269.7 ± 40.3 60.6 ± 12.9 28.7 ± 3.9 99.4 ± 28.1 Full Humanoid 13.2 ± 0.1 262.2 ± 10.5 288.4 ± 25.2 46.7 ± 5.6 41.7 ± 6.1 287.0 ± 23.4 36.9 ± 2.9 N/A ± N/A 119.0 ± 31.2 Cart-Pole Balancing (LS)* 77.1 ± 0.0 420.9 ± 265.5 945.1 ± 27.8 68.9 ± 1.5 898.1 ± 22.1 960.2 ± 46.0 227.0 ± 223.0 68.0 ± 1.6 Inverted Pendulum (LS) 122.1 ± 0.1 13.4 ± 3.2 0.7 ± 6.1 107.4 ± 0.2 87.2 ± 8.0 4.5 ± 4.1 81.2 ± 33.2 62.4 ± 3.4 Mountain Car (LS) 83.0 ± 0.0 81.2 ± 0.6 -65.7 ± 9.0 81.7 ± 0.1 82.6 ± 0.4 -64.2 ± 9.5 -68.9 ± 1.3 -73.2 ± 0.6 Acrobot (LS)* 393.2 ± 0.0 128.9 ± 11.6 -84.6 ± 2.9 235.9 ± 5.3 379.5 ± 1.4 -83.3 ± 9.9 149.5 ± 15.3 159.9 ± 7.5 Cart-Pole Balancing (NO)* 101.4 ± 0.1 616.0 ± 210.8 916.3 ± 23.0 93.8 ± 1.2 99.6 ± 7.2 606.2 ± 122.2 181.4 ± 32.1 104.4 ± 16.0 Inverted Pendulum (NO) 122.2 ± 0.1 6.5 ± 1.1 11.5 ± 0.5 110.0 ± 1.4 119.3 ± 4.2 10.4 ± 2.2 55.6 ± 16.7 80.3 ± 2.8 Mountain Car (NO) 83.0 ± 0.0 74.7 ± 7.8 -64.5 ± 8.6 81.7 ± 0.1 82.9 ± 0.1 -60.2 ± 2.0 67.4 ± 1.4 73.5 ± 0.5 Acrobot (NO)* 393.5 ± 0.0 -186.7 ± 31.3 -164.5 ± 13.4 233.1 ± 0.4 258.5 ± 14.0 -149.6 ± 8.6 213.4 ± 6.3 236.6 ± 6.2 Cart-Pole Balancing (SI)* 76.3 ± 0.1 431.7 ± 274.1 980.5 ± 7.3 69.0 ± 2.8 702.4 ± 196.4 980.3 ± 5.1 746.6 ± 93.2 71.6 ± 2.9 Inverted Pendulum (SI) 121.8 ± 0.2 5.3 ± 5.6 14.8 ± 1.7 108.7 ± 4.7 92.8 ± 23.9 14.1 ± 0.9 51.8 ± 10.6 63.1 ± 4.8 Mountain Car (SI) 82.7 ± 0.0 63.9 ± 0.2 -61.8 ± 0.4 81.4 ± 0.1 80.7 ± 2.3 -61.6 ± 0.4 63.9 ± 1.0 66.9 ± 0.6 Acrobot (SI)* 387.8 ± 1.0 -169.1 ± 32.3 -156.6 ± 38.9 233.2 ± 2.6 216.1 ± 7.7 -170.9 ± 40.3 250.2 ± 13.7 245.0 ± 5.5 Swimmer + Gathering 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 Ant + Gathering 5.8 ± 5.0 0.1 ± 0.1 0.4 ± 0.1 5.5 ± 0.5 6.7 ± 0.7 0.4 ± 0.0 4.7 ± 0.7 N/A ± N/A 0.3 ± 0.3 Swimmer + Maze 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 Ant + Maze 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0 N/A ± N/A 0.0 ± 0.0

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text