network yperparameter values (see Extended Data Table 1) and urethroughout—takinghigh-dimensionaldata(210|160 t 60 Hz) as input—to demonstrate that our approach successful policies over a variety of games based solely utswithonlyveryminimalpriorknowledge(thatis,merely were visual images, and the number of actions available but not their correspondences; see Methods). Notably, as able to train large neural networks using a reinforce- ignalandstochasticgradientdescentinastablemanner— he temporal evolution of two indices of learning (the e score-per-episode and average predicted Q-values; see plementary Discussion for details). We compared DQN with the best performing methods from the reinforcement learning literature on the 49 games where results were available12,15. In addition to the learned agents, we alsoreport scores for aprofessionalhumangamestesterplayingundercontrolledconditions and a policy that selects actions uniformly at random (Extended Data Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y axis; see Methods). Our DQN method outperforms the best existing reinforcement learning methods on 43 of the games without incorpo- rating any of the additional prior knowledge about Atari 2600 games used by other approaches (for example, refs 12, 15). Furthermore, our DQN agent performed at a level that was comparable to that of a pro- fessionalhumangamestesteracrossthesetof49games,achievingmore than75%ofthe humanscore onmorethanhalfofthegames(29 games; Convolution Convolution Fully connected Fully connected No input matic illustration of the convolutional neural network. The hitecture are explained in the Methods. The input to the neural s of an 843 843 4 image produced by the preprocessing by three convolutional layers (note: snaking blue line symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed by a rectifier nonlinearity (that is, max 0,x ð Þ). a b c d 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200 0 20 40 60 80 100 120 140 160 180 200 Average score per episode Training epochs 8 9 10 11 lue (Q) 0 1,000 2,000 3,000 4,000 5,000 6,000 0 20 40 60 80 100 120 140 160 180 200 Average score per episode Training epochs 7 8 9 10 alue (Q) LETTER ゲーム [Mnih+ 15] t IEEE ROBOTICS & AUTOMATION MAGAZINE t MARCH 2016 104 sampled from real systems. On the other hand, if the environ- ment is extremely stochastic, a limited amount of previously acquired data might not be able to capture the real environ- ment’s property and could lead to inappropriate policy up- dates. However, rigid dynamics models, such as a humanoid robot model, do not usually include large stochasticity. There- fore, our approach is suitable for a real robot learning for high- dimensional systems like humanoid robots. formance. We proposed recursively using the off-policy PGPE method to improve the policies and applied our ap- proach to cart-pole swing-up and basketball-shooting tasks. In the former, we introduced a real-virtual hybrid task environment composed of a motion controller and vir- tually simulated cart-pole dynamics. By using the hybrid environment, we can potentially design a wide variety of different task environments. Note that complicated arm movements of the humanoid robot need to be learned for the cart-pole swing-up. Furthermore, by using our pro- posed method, the challenging basketball-shooting task was successfully accomplished. Future work will develop a method based on a transfer learning [28] approach to efficiently reuse the previous expe- riences acquired in different target tasks. Acknowledgment This work was supported by MEXT KAKENHI Grant 23120004, MIC-SCOPE, ``Development of BMI Technolo- gies for Clinical Application’’ carried out under SRPBS by AMED, and NEDO. Part of this study was supported by JSPS KAKENHI Grant 26730141. This work was also supported by NSFC 61502339. References [1] A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann, “Data-effi- cient contextual policy search for robot movement skills,” in Proc. National Conf. Artificial Intelligence, 2013. [2] C. E. Rasmussen and C. K. I. Williams Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006. [3] C. G. Atkeson and S. Schaal, “Robot learning from demonstration,” in Proc. 14th Int. Conf. Machine Learning, 1997, pp. 12–20. [4] C. G. Atkeson and J. Morimoto, “Nonparametric representation of poli- cies and value functions: A trajectory-based approach,” in Proc. Neural Infor- mation Processing Systems, 2002, pp. 1643–1650. Figure 13. The humanoid robot CB-i [7]. (Photo courtesy of ATR.) ロボット制御 [Sugimoto+ 16] IBM Research / Center for Business Optimization Modeling and Optimization Engine Actions Other System 1 System 2 System 3 Event Listener Event Notification Event Notification Event Notification < inserts > TP Profile Taxpayer State ( Current ) Modeler Optimizer < input to > State Generator < input to > Case Inventory < reads > < input to > Allocation Rules Resource Constraints < input to > < inserts , updates > Business Rules < input to > < generates > Segment Selector Action 1 Cnt Action 2 Cnt Action n Cnt 1 C 1 ^ C 2 V C 3 200 50 0 2 C 4 V C 1 ^ C 7 0 50 250 TP ID Feat 1 Feat 2 Feat n 123456789 00 5 A 1500 122334456 01 0 G 1600 122118811 03 9 G 1700 Rule Processor < input to > < input to > Recommended Actions < inserts , updates > TP ID Rec. Date Rec. Action Start Date 123456789 00 6/21/2006 A1 6/21/2006 122334456 01 6/20/2006 A2 6/20/2006 122118811 03 5/31/2006 A2 Action Handler < input to > New Case Case Extract Scheduler < starts > < updates > State Time Expired Event Notification < input to > Taxpayer State History State TP ID State Date Feat 1 Feat 2 Feat n 123456789 00 6/1/2006 5 A 1500 122334456 01 5/31/2006 0 G 1600 122118811 03 4/16/2006 4 R 922 122118811 03 4/20/2006 9 G 1700 < inserts > Feature Definitions (XML) (XSLT) (XML) (XML) (XSLT) Figure 2: Overall collections system architecture. 債権回収の最適化 [Abe+ 10] 囲碁 [Silver+ 16]