SchNet: A continuous-filter convolutional neural network for modeling quantum interactions

1 DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ SchNet: A continuous-filter
convolutional neural network for modeling quantum interactions Kazuki Fujikawa, DeNA

サマリ • 書誌情報 – NIPS2017 – Schütt, K., Kindermans, P.
J., Felix, H. E. S., Chmiela, S., Tkatchenko, A., and Müller, K. R. • 概要 – Graph Convolution に関する論⽂ – グラフの接続情報を使わず、ノード間の距離を使って畳み込みを⾏う • 3次元空間上で任意に位置するノードとの相互作⽤をモデリングすることが可能 • 下記のようなケースで特に有効 – 同じグラフ構造でも異なる配置が存在し、それにより特性が変化するケース – グラフ上の距離と実空間上の距離に乖離が⽣じるケース 2

アウトライン • 背景 • 関連研究 – Message Passing Neural Networkとその派⽣
• 提案⼿法 – Continuous-filter convolutional layer – Interaction block – SchNet • 実験・結果 3

背景 • 創薬や材料化学の分野における最適分⼦の探索において、物性は重要な情報 – DFT（Density Functional Theory）などによる近似がよく利⽤される – ⾮常に計算コストが⼤きく、⼗分な探索ができない課題があった •
機械学習モデルで物性を⾼速かつ正確に予測できると有⽤ – DFTで得たデータを教師データとして、機械学習で物性を予測するタスクが近年盛んに⾏われている 5 ssing for Quantum Chemistry 1 Patrick F. Riley 2 Oriol Vinyals 3 George E. Dahl 1 - - - k e e d f t f l m DFT 103 seconds Message Passing Neural Net 10 2 seconds E,!0, ... Targets Figure 1. A Message Passing Neural Network predicts quantum 図引⽤: Gilmer+, ICML2017

背景 • 画像認識 / ⾃然⾔語処理などの分野で深層学習は発展 – CNN / RNN による効率的な特徴抽出法が貢献
– 画像 / テキストは規則正しいグラフと考えられる • 分⼦のグラフに対して同じ⽅法を適⽤することは困難 – ノードの次数が⼀定ではない、エッジには属性が付いている、etc. 6 図引⽤: 機は熟した！グラフ構造に対する Deep Learning、Graph Convolutionのご紹介（http://tech-blog.abeja.asia/entry/2017/04/27/105613）図引⽤: wikipedia （https://ja.wikipedia.org/wiki/酢酸）

関連研究 • Message Passing Neural Networks（MPNN） [Gilmer+, ICML2017] – ノードの次数が不規則なグラフに対して有効な特徴抽出法をGilmerらが⼀般化
– 各層では、各ノードに割り当てられた特徴ベクトルを、隣接するノードやエッジの特徴ベクトルを使って更新する – 上記をL層繰り返すと各ノードの特徴ベクトルはL近傍のノードやエッジの情報を反映したものとなる 8 ntum Chemistry Oriol Vinyals 3 George E. Dahl 1 DFT 103 seconds Message Passing Neural Net 10 2 seconds E,!0, ... Targets Message Passing Neural Network predicts quantum f an organic molecule by modeling a computationally DFT calculation. 図引⽤: Gilmer+, ICML2017

関連研究 • Message Passing Neural Networks（MPNN） [Gilmer+, ICML2017] – Message
passing phase • Message function: 𝑀" (ℎ% " , ℎ' " , 𝑒%' ) – 各ノードが隣接するノードに対して伝搬させる情報を作成する • Update function: 𝑈" ℎ% " , 𝑚% "-. – 各ノードが隣接するノードから情報を貰い、⾃分⾃⾝の情報を更新する 9 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

関連研究 • Message Passing Neural Networks（MPNN） [Gilmer+, ICML2017] – Readout
phase • Readout function: 𝑅({ℎ% (3)|𝑣 ∈ 𝐺}) – Message passing phaseを経て得られた各ノードの情報を統合し、グラフ全体に対して⼀つの情報を得る 10

関連研究 • CNN for Learning Molecular Fingerprints [Duvenaud+, NIPS2015] –
Message passing phase • Message function: 𝑀" ℎ% " , ℎ' " , 𝑒%' = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ' " , 𝑒%' ) • Update function: 𝑈" ℎ% " , 𝑚% "-. = 𝜎 𝐻" ABC % 𝑚% "-. – 𝐻" ABC (%): step 𝑡、頂点 𝑣 における次数 deg (𝑣) ごとに準備された重み 11 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

関連研究 • CNN for Learning Molecular Fingerprints [Duvenaud+, NIPS2015] –
Readout phase • Readout function: 𝑅 ℎ% 3 𝑣 ∈ 𝐺 = 𝑓( ∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊" ℎ% " %," ) 12

関連研究 • Gated Graph Neural Networks (GG-NN) [Li+, ICLR2016] –
Message passing phase – Message function: 𝑀" ℎ% " , ℎ' " , 𝑒%' = 𝐴N' ℎ' " » 𝐴N' : エッジの種類（単結合、⼆重結合、etc.）ごとに定義された重み – Update function: 𝑈" ℎ% " , 𝑚% "-. = GRU ℎ% " , 𝑚% "-. 13 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

関連研究 • Gated Graph Neural Networks (GG-NN) [Li+, ICLR2016] –
Readout phase • Readout function: – 𝑅 ℎ% 3 𝑣 ∈ 𝐺 = tanh( ∑ 𝜎 % 𝑖 ℎ% 3 , ℎ% W ⊙ tanh 𝑗 ℎ% 3 , ℎ% W ) – 𝑖, 𝑗: NNの重み、 𝜎 𝑖 ℎ% 3 , ℎ% W : soft attentionの役割 14

関連研究 • Deep Tensor Neural Networks (DTNN) [Schütt+, Nature2017] –
Message passing phase • Message function: 𝑀" ℎ% " , ℎ' " , 𝑒%' = tanh 𝑊Z[ 𝑊[Zℎ\ " + 𝑏. ⊙ 𝑊_Z𝑒%\ + 𝑏` – 𝑊Z[, 𝑊[Z, 𝑊_Z: それぞれ共有重み、𝑏. , 𝑏` : バイアス項 • Update function: 𝑈" ℎ% " , 𝑚% "-. = ℎ% " + 𝑚% "-. 15 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

関連研究 • Deep Tensor Neural Networks (DTNN) [Schütt+, Nature2017] –
Readout phase • Readout function: 𝑅 ℎ% 3 𝑣 ∈ 𝐺 = ∑ NN(ℎ% 3 ) % 16

関連研究 • Edge Network + Set2Set (enn-s2s) [Gilmer+, ICML2017] –
Message passing phase • Message function: 𝑀" ℎ% " , ℎ' " , 𝑒%' = 𝐴(𝑒%\ )ℎ' " – 𝐴(𝑒%\ ): エッジベクトル 𝑒%\ を変換するNN • Update function: 𝑈" ℎ% " , 𝑚% "-. = GRU ℎ% " , 𝑚% "-. – GGNN [Li+, ICLR2016] と同様 17 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

関連研究 • Edge Network + Set2Set (enn-s2s) [Gilmer+, ICML2017] –
Readout phase • Readout function: 𝑅 ℎ% 3 𝑣 ∈ 𝐺 = set2set ℎ% 3 𝑣 ∈ 𝐺 – set2set [Vinyals+, ICLR2016] によって作成された𝑞" ∗を後のNNの⼊⼒にする – 他にも⼊⼒特徴の作り⽅などに⼯夫あり 18 whilst preserving the right properties which we just discussed: a memory that increases with the size of the set, and which is order invariant. In the next sections, we explain such a modification, which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1. 4.2 ATTENTION MECHANISMS Neural models with memories coupled to differentiable addressing mechanism have been success- fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 2015). Since we are interested in associative memories we employed a “content” based attention. This has the property that the vector retrieved from our memory would not change if we randomly shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular, our process block based on an attention mechanism uses the following: qt = LSTM(q ⇤ t 1 ) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q ⇤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is a query vector which allows us to read rt from the memories, f is a function that computes a single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a recurrent state but which takes no inputs. q⇤ is the state which this LSTM evolves, and is formed 図引⽤: Vinyals+, ICLR2016

関連研究 • 既存⼿法の限界 – ノード間の相互作⽤をグラフ上での距離のみに基づいてモデリングしている • グラフ構造は同じでも異なる配置が存在し、それにより予測対象の傾向が変わる場合を想定していない • グラフ上の距離は遠いが実空間では近いような場合に、相互作⽤を適切にモデリング
できない可能性がある 19 図引⽤: https://en.wikipedia.org/wiki/Conformational_isomerism

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 21
(left), the interaction block (middle) work (right). The shifted softplus is Zi 3.7 embed 7.2 𝑑gh = 𝐫g − 𝐫h (a) 1st interaction block (b) 2nd interaction block (c) 3rd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filters in each interaction block of SchNet trained on molecular dynamics of ethanol. Negative values are blue, positive values are red. Filter-generating networks The cfconv layer including its filter-generating network are depicted at the right panel of Fig. 2. In order to satisfy the requirements for modeling molecular energies, we restrict our filters for the cfconv layers to be rotationally invariant. The rotational invariance is obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters would be highly correlated since a neural network after initialization is close to linear. This leads to a plateau at the beginning of training that is hard to overcome. We avoid this by expanding the distance with radial basis functions ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ @ ˆ E dense + shifted softplus embed Zj’ Zj dense + shifted softplus × × embed +

(left), the interaction block (middle) work (right). The shifted softplus is Zi 3.7 embed 7.2 𝑑gh = 𝐫g − 𝐫h (a) 1st interaction block (b) 2nd interaction block (c) 3rd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filters in each interaction block of SchNet trained on molecular dynamics of ethanol. Negative values are blue, positive values are red. Filter-generating networks The cfconv layer including its filter-generating network are depicted at the right panel of Fig. 2. In order to satisfy the requirements for modeling molecular energies, we restrict our filters for the cfconv layers to be rotationally invariant. The rotational invariance is obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters would be highly correlated since a neural network after initialization is close to linear. This leads to a plateau at the beginning of training that is hard to overcome. We avoid this by expanding the distance with radial basis functions ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ @ ˆ E dense + shifted softplus embed Zj’ Zj dense + shifted softplus × × embed + Filter-generating Networks 𝜇. = 0.1Å, 𝜇` = 0.2Å, … 𝜇qWW = 30Å , 𝛾 = 10Åでrbfカーネルを300個⽤意 ⇓ 𝑑gh に最も近い𝜇を持つカーネルは 1に近づき、遠ざかるに従い0に近づく（ソフトな1-hot表現が得られる）

(left), the interaction block (middle) work (right). The shifted softplus is Zi 3.7 embed 7.2 𝑑gh = 𝐫g − 𝐫h (a) 1st interaction block (b) 2nd interaction block (c) 3rd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filters in each interaction block of SchNet trained on molecular dynamics of ethanol. Negative values are blue, positive values are red. Filter-generating networks The cfconv layer including its filter-generating network are depicted at the right panel of Fig. 2. In order to satisfy the requirements for modeling molecular energies, we restrict our filters for the cfconv layers to be rotationally invariant. The rotational invariance is obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters would be highly correlated since a neural network after initialization is close to linear. This leads to a plateau at the beginning of training that is hard to overcome. We avoid this by expanding the distance with radial basis functions ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ @ ˆ E dense + shifted softplus embed Zj’ Zj dense + shifted softplus × × embed + Filter-generating Networks 出⼒ベクトルとノードのembed vector の要素積を取る ⇓ 各ユニットのactivationでノードの embed vectorをフィルタする

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離情報を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 24
(left), the interaction block (middle) work (right). The shifted softplus is Zi 3.7 embed 7.2 𝑑gh = 𝐫g − 𝐫h (a) 1st interaction block (b) 2nd interaction block (c) 3rd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filters in each interaction block of SchNet trained on molecular dynamics of ethanol. Negative values are blue, positive values are red. Filter-generating networks The cfconv layer including its filter-generating network are depicted at the right panel of Fig. 2. In order to satisfy the requirements for modeling molecular energies, we restrict our filters for the cfconv layers to be rotationally invariant. The rotational invariance is obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters would be highly correlated since a neural network after initialization is close to linear. This leads to a plateau at the beginning of training that is hard to overcome. We avoid this by expanding the distance with radial basis functions ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ @ ˆ E dense + shifted softplus embed Zj’ Zj dense + shifted softplus × × embed +

提案⼿法: Interaction block • cfconv layerを含むmessage passing layer – cfconv
layerでノード間の相互作⽤を考慮して各ノードの特徴ベクトルを更新 – ノード距離に制限無く相互作⽤を表現することが可能（DTNNなどとの相違点） 25 chNet with an architectural overview (left), the interaction block (middle) convolution with filter-generating network (right). The shifted softplus is (a) 1st interaction block (b) 2nd interaction block (c) 3rd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filters in each interaction block of SchNet trained on molecular dynamics of ethanol. Negative values are blue, positive values are red. Filter-generating networks The cfconv layer including its filter-generating network are depicted at the right panel of Fig. 2. In order to satisfy the requirements for modeling molecular energies,

提案⼿法: SchNet • interaction, atom-wise を挟み、最終的に各原⼦ごとに1次元のスカラー値を出⼒する • 出⼒されたスカラー値を原⼦数分⾜し合わせて分⼦全体の予測結果を得る 26
Figure 2: Illustration of SchNet with an architectural overview (left), the interaction block (middle) and the continuous-filter convolution with filter-generating network (right). The shifted softplus is defined as ssp(x) = ln(0.5ex + 0.5).

提案⼿法: Loss • ⼆種類のロスを⾜した損失関数を定義 – • エネルギーの予測に関する⼆乗誤差 – • 原⼦間⼒の予測に関する⼆乗誤差を原⼦毎に求め、⾜し合わせたもの
• 𝜌: 原⼦間⼒を重要視する度合いを表すハイパーパラメータ • 原⼦間⼒の予測値は下記により計算で求める [Chmiela+, 2017] 27 110,462 0.31 – 0.45 0.33 We include the total energy E as well as forces Fi in the training loss to train a neural network that performs well on both properties: `( ˆ E, (E, F1, . . . , Fn)) = kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) This kind of loss has been used before for fitting a restricted potential energy surfaces with MLPs [36]. In our experiments, we use ⇢ = 0 in Eq. 5 for pure energy based training and ⇢ = 100 for combined energy and force training. The value of ⇢ was optimized empirically to account for different scales of energy and forces. Due to the relation of energies and forces reflected in the model, we expect to see improved gen- eralization, however, at a computational cost. As we need to perform a full forward and backward pass on the energy model to obtain the forces, the resulting force model is twice as deep and, hence, requires about twice the amount of computation time. Even though the GDML model captures this relationship between energies and forces, it is explicitly optimized to predict the force field while the energy prediction is a by-product. Models such as circular fingerprints [15], molecular graph convolutions or message-passing neural networks[19] for property prediction across chemical compound space are only concerned with equilibrium molecules, i.e., the special case where the forces are vanishing. They can not be trained with forces in a similar manner, as they include discontinuities in their predicted potential energy surface caused by discrete binning or the use of one-hot encoded bond type information. 0.34 0.84 – – 0.31 – 0.45 0.33 as well as forces Fi in the training loss to train a neural network that es: . , Fn)) = kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) before for fitting a restricted potential energy surfaces with MLPs [36]. = 0 in Eq. 5 for pure energy based training and ⇢ = 100 for combined value of ⇢ was optimized empirically to account for different scales of s and forces reflected in the model, we expect to see improved gen- putational cost. As we need to perform a full forward and backward btain the forces, the resulting force model is twice as deep and, hence, nt of computation time. captures this relationship between energies and forces, it is explicitly field while the energy prediction is a by-product. Models such as cular graph convolutions or message-passing neural networks[19] for mical compound space are only concerned with equilibrium molecules, y predictions in kcal/mol on the QM9 data set with given NN [18] enn-s2s [19] enn-s2s-ens5 [19] 0.94 – – 0.84 – – – 0.45 0.33 forces Fi in the training loss to train a neural network that kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) fitting a restricted potential energy surfaces with MLPs [36]. 5 for pure energy based training and ⇢ = 100 for combined was optimized empirically to account for different scales of es reflected in the model, we expect to see improved gen- cost. As we need to perform a full forward and backward rces, the resulting force model is twice as deep and, hence, utation time. his relationship between energies and forces, it is explicitly e the energy prediction is a by-product. Models such as h convolutions or message-passing neural networks[19] for ound space are only concerned with equilibrium molecules, ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ Fi(Z1, . . . , Zn, r1, . . . , rn) = @ ˆ E @ri (Z1, . . . , Zn, r1, . . . , rn). (4) Chmiela et al. [17] pointed out that this leads to an energy-conserving force-field by construction. As SchNet yields rotationally invariant energy predictions, the force predictions are rotationally equivariant by construction. The model has to be at least twice differentiable to allow for gradient descent of the force loss. We chose a shifted softplus ssp(x) = ln(0.5ex + 0.5) as non-linearity

実験: QM9 • DFTで算出された分⼦の17種の物性値を含むデータセット – そのうち⼀つの物性: U0 （絶対零度での分⼦全体のエネルギー）のみを予測対象とする – 平衡状態で原⼦間⼒はゼロであり、予測する必要が無い
• ⽐較⼿法 – DTNN [Schütt+, Nature2017], enn-s2s [Gilmer+, ICML2017], enn-s2s-ens5（enn-s2sのアンサンブル） • 実験結果 – SchNetが⼀貫してSOTAの結果を⽰した – 訓練データ 110k 件でMean Absolute Errorが 0.31kcal/mol だった 29 Table 1: Mean absolute errors for energy predictions in kcal/mol on the QM9 data set with given training set size N. Best model in bold. N SchNet DTNN [18] enn-s2s [19] enn-s2s-ens5 [19] 50,000 0.59 0.94 – – 100,000 0.34 0.84 – – 110,462 0.31 – 0.45 0.33 We include the total energy E as well as forces Fi in the training loss to train a neural network that performs well on both properties: ⇢ n X @ ˆ E ! 2

実験: MD17 • Molecular Dynamics (MD) シミュレーションを⾏ったデータセット – ⼀つの分⼦（ベンゼンなど）に関する軌跡データ •
8種の分⼦に関してデータを取り、別タスクとして学習する • 同じ分⼦でもサンプルによって位置やエネルギー、原⼦間⼒が異なる – 分⼦全体のエネルギー、原⼦間⼒をそれぞれ予測し、Mean Absolute Errorで評価 • ⽐較⼿法 – DTNN [Schütt+, Nature2017], GDML [Chmiela+, 2017] 30 • 実験結果 – N=1,000 • 多くのタスクでGDMLが上回った • GDMLはカーネル回帰ベースのモデルであり、サンプル数 / 分⼦のノード数の⼆乗に⽐例して計算量が増加するため N=50,000は学習できなかった – N=50,000 • 多くのタスクでSchNetがDTNNを上回っている • SchNetは（GDMLと⽐べて）スケーラビリティに優れており、データ数の増加に従い精度も改善された Table 2: Mean absolute errors for energy and force predictions in kcal/mol and kcal/mol/Å, respec- tively. GDML and SchNet test errors for training with 1,000 and 50,000 examples of molecular dynamics simulations of small, organic molecules are shown. SchNets were trained only on energies as well as energies and forces combined. Best results in bold. N = 1,000 N = 50,000 GDML [17] SchNet DTNN [18] SchNet forces energy both energy energy both Benzene energy 0.07 1.19 0.08 0.04 0.08 0.07 forces 0.23 14.12 0.31 – 1.23 0.17 Toluene energy 0.12 2.95 0.12 0.18 0.16 0.09 forces 0.24 22.31 0.57 – 1.79 0.09 Malonaldehyde energy 0.16 2.03 0.13 0.19 0.13 0.08 forces 0.80 20.41 0.66 – 1.51 0.08 Salicylic acid energy 0.12 3.27 0.20 0.41 0.25 0.10 forces 0.28 23.21 0.85 – 3.72 0.19 Aspirin energy 0.27 4.20 0.37 – 0.25 0.12 forces 0.99 23.54 1.35 – 7.36 0.33 Ethanol energy 0.15 0.93 0.08 – 0.07 0.05 forces 0.79 6.56 0.39 – 0.76 0.05 Uracil energy 0.11 2.26 0.14 – 0.13 0.10 forces 0.24 20.08 0.56 – 3.28 0.11 Naphtalene energy 0.12 3.58 0.16 – 0.20 0.11 forces 0.23 25.36 0.58 – 2.58 0.11

実験: ISO17 • Molecular Dynamics (MD) シミュレーションを⾏ったデータセット – C7 O2
H10 の異性体129種類に関する軌跡データ • MD17とは違い、別の分⼦のデータが同じタスクとして含まれる – 2種のタスクを⽤意 • known molecules / unknown conformation: – テストデータに既知の分⼦・未知の⽴体配座を利⽤ • unknown molecules / unknown conformation: – テストデータに未知の分⼦・未知の⽴体配座を利⽤ – ⽐較⼿法 • mean predictor (訓練データの分⼦毎の平均？) 31 • 結果 – known molecules / unknown conformation • energy+forcesはQM9での精度に匹敵 – unknown molecules / unknown conformation • energy+forcesはenergyのみよりも優れていた – 原⼦間⼒を学習に加えることは、単⼀の分⼦にフィットしているわけではなく、化合物空間全体で⼀般化されていた – known moleculesと⽐べると精度に隔たりがあり、さらなる改善が必要 Table 3: Mean absolute errors on C7 O2 H10 isomers in kcal/mol. mean predictor SchNet energy energy+forces known molecules / energy 14.89 0.52 0.36 unknown conformation forces 19.56 4.13 1.00 unknown molecules / energy 15.54 3.11 2.40 unknown conformation forces 19.15 5.71 2.18 Table 1: Mean absolute errors for energy predictions in kcal/mol on the QM9 data set with given training set size N. Best model in bold. N SchNet DTNN [18] enn-s2s [19] enn-s2s-ens5 [19] 50,000 0.59 0.94 – – 100,000 0.34 0.84 – – 110,462 0.31 – 0.45 0.33 We include the total energy E as well as forces Fi in the training loss to train a neural network that performs well on both properties: `( ˆ E, (E, F1, . . . , Fn)) = kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) This kind of loss has been used before for ﬁtting a restricted potential energy surfaces with MLPs [36]. In our experiments, we use ⇢ = 0 in Eq. 5 for pure energy based training and ⇢ = 100 for combined energy and force training. The value of ⇢ was optimized empirically to account for different scales of energy and forces. Due to the relation of energies and forces reﬂected in the model, we expect to see improved gen- eralization, however, at a computational cost. As we need to perform a full forward and backward

まとめ • Cotinuous-filter convolutional (cfconv) layerを提案 – 分⼦中の原⼦のような、ノード間の距離が不規則なグラフに対して有効な特徴抽出層を提案した •
SchNetを提案 – cfconv layerを⽤いることで3次元空間上の任意の位置に存在する原⼦の相互作⽤をモデリングした • ISO17（ベンチマークデータセット）を提案 • QM9, MD17, ISO17で実験・評価し、本⼿法の有効性を確認 – ⾮平衡状態の分⼦に対するエネルギー予測タスクにおいて、原⼦間⼒の予測を学習対象に加えることによって性能改善を実現した – ⾮平衡状態・未知な分⼦に対して⾼い性能で予測できるよう、ロバストな学習を実現することが今後の課題 32

References • SchNet – Schütt, Kristof, et al. "SchNet: A
continuous-filter convolutional neural network for modeling quantum interactions." Advances in Neural Information Processing Systems. 2017. • MPNN variants – Gilmer, Justin, et al. "Neural message passing for quantum chemistry." In Proceedings of the 34th International Conference on Machine Learning, pages 1263–1272, 2017. – Duvenaud, David K., et al. "Convolutional networks on graphs for learning molecular fingerprints." Advances in neural information processing systems. 2015. – Li, Yujia, Tarlow, Daniel, Brockschmidt, Marc, and Zemel, Richard. Gated graph sequence neural networks. ICLR, 2016. – Schütt, Kristof T., et al. "Quantum-chemical insights from deep tensor neural networks." Nature communications 8 (2017): 13890. • Others – Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. "Order matters: Sequence to sequence for sets." ICLR, 2016. – Chmiela, S., Tkatchenko, A., Sauceda, H. E., Poltavsky, I., Schütt, K. T., & Müller, K. R. (2017). Machine learning of accurate energy-conserving molecular force fields. Science Advances, 3(5), e1603015. 33

SchNet: A continuous-filter convolutional neura...

SchNet: A continuous-filter convolutional neural network for modeling quantum interactions

Kazuki Fujikawa

More Decks by Kazuki Fujikawa

Featured

Transcript

1 DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ SchNet: A continuous-filter

サマリ • 書誌情報 – NIPS2017 – Schütt, K., Kindermans, P.

アウトライン • 背景 • 関連研究 – Message Passing Neural Networkとその派⽣

アウトライン • 背景 • 関連研究 – Message Passing Neural Networkとその派⽣

背景 • 創薬や材料化学の分野における最適分⼦の探索において、物性は重要な情報 – DFT（Density Functional Theory）などによる近似がよく利⽤される – ⾮常に計算コストが⼤きく、⼗分な探索ができない課題があった •

背景 • 画像認識 / ⾃然⾔語処理などの分野で深層学習は発展 – CNN / RNN による効率的な特徴抽出法が貢献

アウトライン • 背景 • 関連研究 – Message Passing Neural Networkとその派⽣

関連研究 • Message Passing Neural Networks（MPNN） [Gilmer+, ICML2017] – ノードの次数が不規則なグラフに対して有効な特徴抽出法をGilmerらが⼀般化

関連研究 • Message Passing Neural Networks（MPNN） [Gilmer+, ICML2017] – Message

関連研究 • Message Passing Neural Networks（MPNN） [Gilmer+, ICML2017] – Readout

関連研究 • CNN for Learning Molecular Fingerprints [Duvenaud+, NIPS2015] –

関連研究 • CNN for Learning Molecular Fingerprints [Duvenaud+, NIPS2015] –

関連研究 • Gated Graph Neural Networks (GG-NN) [Li+, ICLR2016] –

関連研究 • Gated Graph Neural Networks (GG-NN) [Li+, ICLR2016] –

関連研究 • Deep Tensor Neural Networks (DTNN) [Schütt+, Nature2017] –

関連研究 • Deep Tensor Neural Networks (DTNN) [Schütt+, Nature2017] –

関連研究 • Edge Network + Set2Set (enn-s2s) [Gilmer+, ICML2017] –

関連研究 • Edge Network + Set2Set (enn-s2s) [Gilmer+, ICML2017] –

アウトライン • 背景 • 関連研究 – Message Passing Neural Networkとその派⽣

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 21

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 22

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 23

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離情報を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 24

提案⼿法: Interaction block • cfconv layerを含むmessage passing layer – cfconv

提案⼿法: SchNet • interaction, atom-wise を挟み、最終的に各原⼦ごとに1次元のスカラー値を出⼒する • 出⼒されたスカラー値を原⼦数分⾜し合わせて分⼦全体の予測結果を得る 26

提案⼿法: Loss • ⼆種類のロスを⾜した損失関数を定義 – • エネルギーの予測に関する⼆乗誤差 – • 原⼦間⼒の予測に関する⼆乗誤差を原⼦毎に求め、⾜し合わせたもの

アウトライン • 背景 • 関連研究 – Message Passing Neural Networkとその派⽣

実験: QM9 • DFTで算出された分⼦の17種の物性値を含むデータセット – そのうち⼀つの物性: U0 （絶対零度での分⼦全体のエネルギー）のみを予測対象とする – 平衡状態で原⼦間⼒はゼロであり、予測する必要が無い

実験: MD17 • Molecular Dynamics (MD) シミュレーションを⾏ったデータセット – ⼀つの分⼦（ベンゼンなど）に関する軌跡データ •

実験: ISO17 • Molecular Dynamics (MD) シミュレーションを⾏ったデータセット – C7 O2

まとめ • Cotinuous-filter convolutional (cfconv) layerを提案 – 分⼦中の原⼦のような、ノード間の距離が不規則なグラフに対して有効な特徴抽出層を提案した •

References • SchNet – Schütt, Kristof, et al. "SchNet: A