Slide 3
Slide 3 text
3
強化学習でできること
ngaging for human players. We used the same network
yperparameter values (see Extended Data Table 1) and
urethroughout—takinghigh-dimensionaldata(210|160
t 60 Hz) as input—to demonstrate that our approach
successful policies over a variety of games based solely
utswithonlyveryminimalpriorknowledge(thatis,merely
were visual images, and the number of actions available
but not their correspondences; see Methods). Notably,
as able to train large neural networks using a reinforce-
ignalandstochasticgradientdescentinastablemanner—
he temporal evolution of two indices of learning (the
e score-per-episode and average predicted Q-values; see
plementary Discussion for details).
We compared DQN with the best performing methods from the
reinforcement learning literature on the 49 games where results were
available12,15. In addition to the learned agents, we alsoreport scores for
aprofessionalhumangamestesterplayingundercontrolledconditions
and a policy that selects actions uniformly at random (Extended Data
Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y
axis; see Methods). Our DQN method outperforms the best existing
reinforcement learning methods on 43 of the games without incorpo-
rating any of the additional prior knowledge about Atari 2600 games
used by other approaches (for example, refs 12, 15). Furthermore, our
DQN agent performed at a level that was comparable to that of a pro-
fessionalhumangamestesteracrossthesetof49games,achievingmore
than75%ofthe humanscore onmorethanhalfofthegames(29 games;
Convolution Convolution Fully connected Fully connected
No input
matic illustration of the convolutional neural network. The
hitecture are explained in the Methods. The input to the neural
s of an 843 843 4 image produced by the preprocessing
by three convolutional layers (note: snaking blue line
symbolizes sliding of each filter across input image) and two fully connected
layers with a single output for each valid action. Each hidden layer is followed
by a rectifier nonlinearity (that is, max 0,x
ð Þ).
a b
c d
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
2,200
0 20 40 60 80 100 120 140 160 180 200
Average score per episode
Training epochs
8
9
10
11
lue (Q)
0
1,000
2,000
3,000
4,000
5,000
6,000
0 20 40 60 80 100 120 140 160 180 200
Average score per episode
Training epochs
7
8
9
10
alue (Q)
LETTER
ゲーム [Mnih+ 15]
t IEEE ROBOTICS & AUTOMATION MAGAZINE t MARCH 2016
104
sampled from real systems. On the other hand, if the environ-
ment is extremely stochastic, a limited amount of previously
acquired data might not be able to capture the real environ-
ment’s property and could lead to inappropriate policy up-
dates. However, rigid dynamics models, such as a humanoid
robot model, do not usually include large stochasticity. There-
fore, our approach is suitable for a real robot learning for high-
dimensional systems like humanoid robots.
formance. We proposed recursively using the off-policy
PGPE method to improve the policies and applied our ap-
proach to cart-pole swing-up and basketball-shooting
tasks. In the former, we introduced a real-virtual hybrid
task environment composed of a motion controller and vir-
tually simulated cart-pole dynamics. By using the hybrid
environment, we can potentially design a wide variety of
different task environments. Note that complicated arm
movements of the humanoid robot need to be learned for
the cart-pole swing-up. Furthermore, by using our pro-
posed method, the challenging basketball-shooting task
was successfully accomplished.
Future work will develop a method based on a transfer
learning [28] approach to efficiently reuse the previous expe-
riences acquired in different target tasks.
Acknowledgment
This work was supported by MEXT KAKENHI Grant
23120004, MIC-SCOPE, ``Development of BMI Technolo-
gies for Clinical Application’’ carried out under SRPBS by
AMED, and NEDO. Part of this study was supported by JSPS
KAKENHI Grant 26730141. This work was also supported by
NSFC 61502339.
References
[1] A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann, “Data-effi-
cient contextual policy search for robot movement skills,” in Proc. National
Conf. Artificial Intelligence, 2013.
[2] C. E. Rasmussen and C. K. I. Williams Gaussian Processes for Machine
Learning. Cambridge, MA: MIT Press, 2006.
[3] C. G. Atkeson and S. Schaal, “Robot learning from demonstration,” in Proc.
14th Int. Conf. Machine Learning, 1997, pp. 12–20.
[4] C. G. Atkeson and J. Morimoto, “Nonparametric representation of poli-
cies and value functions: A trajectory-based approach,” in Proc. Neural Infor-
mation Processing Systems, 2002, pp. 1643–1650.
Figure 13. The humanoid robot CB-i [7]. (Photo courtesy of ATR.)
ロボット制御 [Sugimoto+ 16]
IBM Research / Center for Business Optimization
Modeling and Optimization
Engine
Actions
Other
System 1
System 2
System 3
Event Listener
Event
Notification
Event
Notification
Event
Notification
< inserts >
TP Profile
Taxpayer State
( Current )
Modeler
Optimizer
< input to >
State Generator
< input to >
Case Inventory
< reads >
< input to >
Allocation Rules
Resource
Constraints
< input to >
< inserts , updates >
Business Rules
< input to >
< generates >
Segment Selector Action
1
Cnt Action
2
Cnt Action
n
Cnt
1 C
1
^ C
2
V C
3
200 50 0
2 C
4
V C
1
^ C
7
0 50 250
TP ID Feat
1
Feat
2
Feat
n
123456789 00 5 A 1500
122334456 01 0 G 1600
122118811 03 9 G 1700
Rule Processor
< input to >
< input to >
Recommended
Actions
< inserts , updates >
TP ID Rec. Date Rec. Action Start Date
123456789 00 6/21/2006 A1 6/21/2006
122334456 01 6/20/2006 A2 6/20/2006
122118811 03 5/31/2006 A2
Action Handler
< input to >
New
Case
Case Extract
Scheduler
< starts > < updates >
State
Time Expired
Event
Notification
< input to >
Taxpayer State
History
State
TP ID State Date Feat
1
Feat
2
Feat
n
123456789 00 6/1/2006 5 A 1500
122334456 01 5/31/2006 0 G 1600
122118811 03 4/16/2006 4 R 922
122118811 03 4/20/2006 9 G 1700
< inserts >
Feature Definitions
(XML)
(XSLT)
(XML)
(XML)
(XSLT)
Figure 2: Overall collections system architecture.
債権回収の最適化 [Abe+ 10]
囲碁 [Silver+ 16]