SchNet: A continuous-filter convolutional neural network for modeling quantum interactions

Slide 1

Slide 1 text

1 DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ SchNet: A continuous-filter convolutional neural network for modeling quantum interactions Kazuki Fujikawa, DeNA

Slide 9

Slide 9 text

関連研究 • Message Passing Neural Networks（MPNN） [Gilmer+, ICML2017] – Message passing phase • Message function: 𝑀" (ℎ% " , ℎ' " , 𝑒%' ) – 各ノードが隣接するノードに対して伝搬させる情報を作成する • Update function: 𝑈" ℎ% " , 𝑚% "-. – 各ノードが隣接するノードから情報を貰い、⾃分⾃⾝の情報を更新する 9 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

Slide 11

Slide 11 text

関連研究 • CNN for Learning Molecular Fingerprints [Duvenaud+, NIPS2015] – Message passing phase • Message function: 𝑀" ℎ% " , ℎ' " , 𝑒%' = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ' " , 𝑒%' ) • Update function: 𝑈" ℎ% " , 𝑚% "-. = 𝜎 𝐻" ABC % 𝑚% "-. – 𝐻" ABC (%): step 𝑡、頂点 𝑣 における次数 deg (𝑣) ごとに準備された重み 11 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

Slide 13

Slide 13 text

関連研究 • Gated Graph Neural Networks (GG-NN) [Li+, ICLR2016] – Message passing phase – Message function: 𝑀" ℎ% " , ℎ' " , 𝑒%' = 𝐴N' ℎ' " » 𝐴N' : エッジの種類（単結合、⼆重結合、etc.）ごとに定義された重み – Update function: 𝑈" ℎ% " , 𝑚% "-. = GRU ℎ% " , 𝑚% "-. 13 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

Slide 15

Slide 15 text

関連研究 • Deep Tensor Neural Networks (DTNN) [Schütt+, Nature2017] – Message passing phase • Message function: 𝑀" ℎ% " , ℎ' " , 𝑒%' = tanh 𝑊Z[ 𝑊[Zℎ\ " + 𝑏. ⊙ 𝑊_Z𝑒%\ + 𝑏` – 𝑊Z[, 𝑊[Z, 𝑊_Z: それぞれ共有重み、𝑏. , 𝑏` : バイアス項 • Update function: 𝑈" ℎ% " , 𝑚% "-. = ℎ% " + 𝑚% "-. 15 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

Slide 17

Slide 17 text

関連研究 • Edge Network + Set2Set (enn-s2s) [Gilmer+, ICML2017] – Message passing phase • Message function: 𝑀" ℎ% " , ℎ' " , 𝑒%' = 𝐴(𝑒%\ )ℎ' " – 𝐴(𝑒%\ ): エッジベクトル 𝑒%\ を変換するNN • Update function: 𝑈" ℎ% " , 𝑚% "-. = GRU ℎ% " , 𝑚% "-. – GGNN [Li+, ICLR2016] と同様 17 v u1 u2 h(0) v h(0) u1 h(0) u2 Message Function: 𝑀" (ℎ% " , ℎ'/ " , 𝑒%'/ ) Σ Message Function: 𝑀" (ℎ% " , ℎ'0 " , 𝑒%'0 ) Update Function: 𝑈" (ℎ% " , 𝑚% "-.) Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message Recurrent Unit introduced in Cho et al. (2 used weight tying, so the same update fu each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j( where i and j are neural networks, and wise multiplication. Interaction Networks, Battaglia et al. (2 This work considered both the case whe get at each node in the graph, and where level target. It also considered the case node level effects applied at each time case the update function takes as input th (hv , xv , mv) where xv is an external vec some outside influence on the vertex v. Th tion M(hv , hw , evw) is a neural network concatenation (hv , hw , evw). The vertex U(hv , xv , mv) is a neural network whic the concatenation (hv , xv , mv). Finally, i there is a graph level output, R = f( P Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) has used this idea. Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) 𝑒%'/ 𝑒%'0

Slide 18

Slide 18 text

関連研究 • Edge Network + Set2Set (enn-s2s) [Gilmer+, ICML2017] – Readout phase • Readout function: 𝑅 ℎ% 3 𝑣 ∈ 𝐺 = set2set ℎ% 3 𝑣 ∈ 𝐺 – set2set [Vinyals+, ICLR2016] によって作成された𝑞" ∗を後のNNの⼊⼒にする – 他にも⼊⼒特徴の作り⽅などに⼯夫あり 18 whilst preserving the right properties which we just discussed: a memory that increases with the size of the set, and which is order invariant. In the next sections, we explain such a modification, which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1. 4.2 ATTENTION MECHANISMS Neural models with memories coupled to differentiable addressing mechanism have been success- fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 2015). Since we are interested in associative memories we employed a “content” based attention. This has the property that the vector retrieved from our memory would not change if we randomly shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular, our process block based on an attention mechanism uses the following: qt = LSTM(q ⇤ t 1 ) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q ⇤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is a query vector which allows us to read rt from the memories, f is a function that computes a single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a recurrent state but which takes no inputs. q⇤ is the state which this LSTM evolves, and is formed 図引⽤: Vinyals+, ICLR2016

Slide 21

Slide 21 text

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 21 (left), the interaction block (middle) work (right). The shifted softplus is Zi 3.7 embed 7.2 𝑑gh = 𝐫g − 𝐫h (a) 1st interaction block (b) 2nd interaction block (c) 3rd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filters in each interaction block of SchNet trained on molecular dynamics of ethanol. Negative values are blue, positive values are red. Filter-generating networks The cfconv layer including its filter-generating network are depicted at the right panel of Fig. 2. In order to satisfy the requirements for modeling molecular energies, we restrict our filters for the cfconv layers to be rotationally invariant. The rotational invariance is obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters would be highly correlated since a neural network after initialization is close to linear. This leads to a plateau at the beginning of training that is hard to overcome. We avoid this by expanding the distance with radial basis functions ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ @ ˆ E dense + shifted softplus embed Zj’ Zj dense + shifted softplus × × embed +

Slide 22

Slide 22 text

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 22 (left), the interaction block (middle) work (right). The shifted softplus is Zi 3.7 embed 7.2 𝑑gh = 𝐫g − 𝐫h (a) 1st interaction block (b) 2nd interaction block (c) 3rd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filters in each interaction block of SchNet trained on molecular dynamics of ethanol. Negative values are blue, positive values are red. Filter-generating networks The cfconv layer including its filter-generating network are depicted at the right panel of Fig. 2. In order to satisfy the requirements for modeling molecular energies, we restrict our filters for the cfconv layers to be rotationally invariant. The rotational invariance is obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters would be highly correlated since a neural network after initialization is close to linear. This leads to a plateau at the beginning of training that is hard to overcome. We avoid this by expanding the distance with radial basis functions ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ @ ˆ E dense + shifted softplus embed Zj’ Zj dense + shifted softplus × × embed + Filter-generating Networks 𝜇. = 0.1Å, 𝜇` = 0.2Å, … 𝜇qWW = 30Å , 𝛾 = 10Åでrbfカーネルを300個⽤意 ⇓ 𝑑gh に最も近い𝜇を持つカーネルは 1に近づき、遠ざかるに従い0に近づく（ソフトな1-hot表現が得られる）

Slide 23

Slide 23 text

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 23 (left), the interaction block (middle) work (right). The shifted softplus is Zi 3.7 embed 7.2 𝑑gh = 𝐫g − 𝐫h (a) 1st interaction block (b) 2nd interaction block (c) 3rd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filters in each interaction block of SchNet trained on molecular dynamics of ethanol. Negative values are blue, positive values are red. Filter-generating networks The cfconv layer including its filter-generating network are depicted at the right panel of Fig. 2. In order to satisfy the requirements for modeling molecular energies, we restrict our filters for the cfconv layers to be rotationally invariant. The rotational invariance is obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters would be highly correlated since a neural network after initialization is close to linear. This leads to a plateau at the beginning of training that is hard to overcome. We avoid this by expanding the distance with radial basis functions ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ @ ˆ E dense + shifted softplus embed Zj’ Zj dense + shifted softplus × × embed + Filter-generating Networks 出⼒ベクトルとノードのembed vector の要素積を取る ⇓ 各ユニットのactivationでノードの embed vectorをフィルタする

Slide 24

Slide 24 text

提案⼿法: Continuous-filter convolution (cfconv) • ノード間の距離情報を利⽤して重み付けするフィルタ – “重要視したい距離” を学習で求める 24 (left), the interaction block (middle) work (right). The shifted softplus is Zi 3.7 embed 7.2 𝑑gh = 𝐫g − 𝐫h (a) 1st interaction block (b) 2nd interaction block (c) 3rd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filters in each interaction block of SchNet trained on molecular dynamics of ethanol. Negative values are blue, positive values are red. Filter-generating networks The cfconv layer including its filter-generating network are depicted at the right panel of Fig. 2. In order to satisfy the requirements for modeling molecular energies, we restrict our filters for the cfconv layers to be rotationally invariant. The rotational invariance is obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters would be highly correlated since a neural network after initialization is close to linear. This leads to a plateau at the beginning of training that is hard to overcome. We avoid this by expanding the distance with radial basis functions ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ @ ˆ E dense + shifted softplus embed Zj’ Zj dense + shifted softplus × × embed +

Slide 27

Slide 27 text

提案⼿法: Loss • ⼆種類のロスを⾜した損失関数を定義 – • エネルギーの予測に関する⼆乗誤差 – • 原⼦間⼒の予測に関する⼆乗誤差を原⼦毎に求め、⾜し合わせたもの • 𝜌: 原⼦間⼒を重要視する度合いを表すハイパーパラメータ • 原⼦間⼒の予測値は下記により計算で求める [Chmiela+, 2017] 27 110,462 0.31 – 0.45 0.33 We include the total energy E as well as forces Fi in the training loss to train a neural network that performs well on both properties: `( ˆ E, (E, F1, . . . , Fn)) = kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) This kind of loss has been used before for fitting a restricted potential energy surfaces with MLPs [36]. In our experiments, we use ⇢ = 0 in Eq. 5 for pure energy based training and ⇢ = 100 for combined energy and force training. The value of ⇢ was optimized empirically to account for different scales of energy and forces. Due to the relation of energies and forces reflected in the model, we expect to see improved gen- eralization, however, at a computational cost. As we need to perform a full forward and backward pass on the energy model to obtain the forces, the resulting force model is twice as deep and, hence, requires about twice the amount of computation time. Even though the GDML model captures this relationship between energies and forces, it is explicitly optimized to predict the force field while the energy prediction is a by-product. Models such as circular fingerprints [15], molecular graph convolutions or message-passing neural networks[19] for property prediction across chemical compound space are only concerned with equilibrium molecules, i.e., the special case where the forces are vanishing. They can not be trained with forces in a similar manner, as they include discontinuities in their predicted potential energy surface caused by discrete binning or the use of one-hot encoded bond type information. 0.34 0.84 – – 0.31 – 0.45 0.33 as well as forces Fi in the training loss to train a neural network that es: . , Fn)) = kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) before for fitting a restricted potential energy surfaces with MLPs [36]. = 0 in Eq. 5 for pure energy based training and ⇢ = 100 for combined value of ⇢ was optimized empirically to account for different scales of s and forces reflected in the model, we expect to see improved gen- putational cost. As we need to perform a full forward and backward btain the forces, the resulting force model is twice as deep and, hence, nt of computation time. captures this relationship between energies and forces, it is explicitly field while the energy prediction is a by-product. Models such as cular graph convolutions or message-passing neural networks[19] for mical compound space are only concerned with equilibrium molecules, y predictions in kcal/mol on the QM9 data set with given NN [18] enn-s2s [19] enn-s2s-ens5 [19] 0.94 – – 0.84 – – – 0.45 0.33 forces Fi in the training loss to train a neural network that kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) fitting a restricted potential energy surfaces with MLPs [36]. 5 for pure energy based training and ⇢ = 100 for combined was optimized empirically to account for different scales of es reflected in the model, we expect to see improved gen- cost. As we need to perform a full forward and backward rces, the resulting force model is twice as deep and, hence, utation time. his relationship between energies and forces, it is explicitly e the energy prediction is a by-product. Models such as h convolutions or message-passing neural networks[19] for ound space are only concerned with equilibrium molecules, ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ Fi(Z1, . . . , Zn, r1, . . . , rn) = @ ˆ E @ri (Z1, . . . , Zn, r1, . . . , rn). (4) Chmiela et al. [17] pointed out that this leads to an energy-conserving force-field by construction. As SchNet yields rotationally invariant energy predictions, the force predictions are rotationally equivariant by construction. The model has to be at least twice differentiable to allow for gradient descent of the force loss. We chose a shifted softplus ssp(x) = ln(0.5ex + 0.5) as non-linearity

Slide 30

Slide 30 text

実験: MD17 • Molecular Dynamics (MD) シミュレーションを⾏ったデータセット – ⼀つの分⼦（ベンゼンなど）に関する軌跡データ • 8種の分⼦に関してデータを取り、別タスクとして学習する • 同じ分⼦でもサンプルによって位置やエネルギー、原⼦間⼒が異なる – 分⼦全体のエネルギー、原⼦間⼒をそれぞれ予測し、Mean Absolute Errorで評価 • ⽐較⼿法 – DTNN [Schütt+, Nature2017], GDML [Chmiela+, 2017] 30 • 実験結果 – N=1,000 • 多くのタスクでGDMLが上回った • GDMLはカーネル回帰ベースのモデルであり、サンプル数 / 分⼦のノード数の⼆乗に⽐例して計算量が増加するため N=50,000は学習できなかった – N=50,000 • 多くのタスクでSchNetがDTNNを上回っている • SchNetは（GDMLと⽐べて）スケーラビリティに優れており、データ数の増加に従い精度も改善された Table 2: Mean absolute errors for energy and force predictions in kcal/mol and kcal/mol/Å, respec- tively. GDML and SchNet test errors for training with 1,000 and 50,000 examples of molecular dynamics simulations of small, organic molecules are shown. SchNets were trained only on energies as well as energies and forces combined. Best results in bold. N = 1,000 N = 50,000 GDML [17] SchNet DTNN [18] SchNet forces energy both energy energy both Benzene energy 0.07 1.19 0.08 0.04 0.08 0.07 forces 0.23 14.12 0.31 – 1.23 0.17 Toluene energy 0.12 2.95 0.12 0.18 0.16 0.09 forces 0.24 22.31 0.57 – 1.79 0.09 Malonaldehyde energy 0.16 2.03 0.13 0.19 0.13 0.08 forces 0.80 20.41 0.66 – 1.51 0.08 Salicylic acid energy 0.12 3.27 0.20 0.41 0.25 0.10 forces 0.28 23.21 0.85 – 3.72 0.19 Aspirin energy 0.27 4.20 0.37 – 0.25 0.12 forces 0.99 23.54 1.35 – 7.36 0.33 Ethanol energy 0.15 0.93 0.08 – 0.07 0.05 forces 0.79 6.56 0.39 – 0.76 0.05 Uracil energy 0.11 2.26 0.14 – 0.13 0.10 forces 0.24 20.08 0.56 – 3.28 0.11 Naphtalene energy 0.12 3.58 0.16 – 0.20 0.11 forces 0.23 25.36 0.58 – 2.58 0.11

Slide 31

Slide 31 text

実験: ISO17 • Molecular Dynamics (MD) シミュレーションを⾏ったデータセット – C7 O2 H10 の異性体129種類に関する軌跡データ • MD17とは違い、別の分⼦のデータが同じタスクとして含まれる – 2種のタスクを⽤意 • known molecules / unknown conformation: – テストデータに既知の分⼦・未知の⽴体配座を利⽤ • unknown molecules / unknown conformation: – テストデータに未知の分⼦・未知の⽴体配座を利⽤ – ⽐較⼿法 • mean predictor (訓練データの分⼦毎の平均？) 31 • 結果 – known molecules / unknown conformation • energy+forcesはQM9での精度に匹敵 – unknown molecules / unknown conformation • energy+forcesはenergyのみよりも優れていた – 原⼦間⼒を学習に加えることは、単⼀の分⼦にフィットしているわけではなく、化合物空間全体で⼀般化されていた – known moleculesと⽐べると精度に隔たりがあり、さらなる改善が必要 Table 3: Mean absolute errors on C7 O2 H10 isomers in kcal/mol. mean predictor SchNet energy energy+forces known molecules / energy 14.89 0.52 0.36 unknown conformation forces 19.56 4.13 1.00 unknown molecules / energy 15.54 3.11 2.40 unknown conformation forces 19.15 5.71 2.18 Table 1: Mean absolute errors for energy predictions in kcal/mol on the QM9 data set with given training set size N. Best model in bold. N SchNet DTNN [18] enn-s2s [19] enn-s2s-ens5 [19] 50,000 0.59 0.94 – – 100,000 0.34 0.84 – – 110,462 0.31 – 0.45 0.33 We include the total energy E as well as forces Fi in the training loss to train a neural network that performs well on both properties: `( ˆ E, (E, F1, . . . , Fn)) = kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) This kind of loss has been used before for ﬁtting a restricted potential energy surfaces with MLPs [36]. In our experiments, we use ⇢ = 0 in Eq. 5 for pure energy based training and ⇢ = 100 for combined energy and force training. The value of ⇢ was optimized empirically to account for different scales of energy and forces. Due to the relation of energies and forces reﬂected in the model, we expect to see improved gen- eralization, however, at a computational cost. As we need to perform a full forward and backward

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text