Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network

Slide 1

Slide 1 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AI System Dept. System & Design Management Unit Kazuki Fujikawa 21 0 0 2 0 0 2 @ .2 2 82 2 2 @ 7 66 5 45 - 5 4 4 - 6- 4 - 4 6- 6 5 8-6 8 -5 -/ 4 / 68 4. - 2 2 21 2 @ 7

Slide 9

Slide 9 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Message Function: !" (ℎ% " , ℎ'( " , )%'( ) Σ Message Function: !" (ℎ% " , ℎ'( " , )%'( ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- n + : , : : 1 C 5 2 l icN ]E S eF[Uh • , : l , : !" ℎ% " , ℎ' " , )%' = ,-.,/0(ℎ' " , )%' ) l 0 5 : 1" ℎ% " , 2% "34 = 5(6 " 789 (%)2% "34) • 6 " 789 (%) 0D ; N ISda deg (;) MNfg L P v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) )%'( )%'? Update Function: 1" (ℎ% " , 2% "34)

Slide 11

Slide 11 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n , , 0 0 F C ,, 00 . - .1 ( 2 + 6 M ,12U] h iU g • CC : CC : C CC : 6 ) !" ℎ$ " , ℎ& " , '$& = )*& ℎ& " • )*& ) Rde fa G a G 6 I [cL N 2 6 ) +" ℎ$ " , ,$ "-. = /0+ ℎ$ " , ,$ "-. Message Function: !" (ℎ$ " , ℎ&2 " , '$&2 ) Σ Message Function: !" (ℎ$ " , ℎ&2 " , '$&2 ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) '$&2 '$&4 Update Function: +" (ℎ$ " , ,$ "-.)

Slide 13

Slide 13 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n ( , : 0 (, +1 0 [ T ] US • ) 0 0 7: 0 ) 0 :1 7 : !" ℎ$ " , ℎ& " , '$& = tanh -./ -/.ℎ0 " + 23 ⊙ -6.'$0 + 27 • -./, -/., -6. D23 , 27 M N • 20 :1 7 : 8" ℎ$ " , 9$ ":3 = ℎ$ " + 9$ ":3 Message Function: !" (ℎ$ " , ℎ&< " , '$&< ) Σ Message Function: !" (ℎ$ " , ℎ&< " , '$&< ) Neural Message Passing f time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. '$&< '$&> Update Function: 8" (ℎ$ " , 9$ ":3)

Slide 15

Slide 15 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n + 0 2 2 E E , - ( [ 1 6 G 2 2 M • EE6 C6EE C 6E [ EE6 :G 7 ) !" ℎ$ " , ℎ& " , '$& = )('$+ )ℎ& " • )('$+ )) N S R '$+ M IL00 [ C 6 :G 7 ) -" ℎ$ " , .$ "/0 = GRU ℎ$ " , .$ "/0 • ,,00 - 1 U Message Function: !" (ℎ$ " , ℎ&4 " , '$&4 ) Σ Message Function: !" (ℎ$ " , ℎ&4 " , '$&4 ) Neural Message Passing f time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. '$&4 '$&5 Update Function: -" (ℎ$ " , .$ "/0)

Slide 17

Slide 17 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n + 2 5 0 2 , 02 2 027 0 /Ci 10 t m dchUh M Rl • t m h f C Ma[ R L • dchUh t r PRnsM R b]eR oO c l L g h g Ryut representation, less fundamental changes (e.g., formal charge, association/disassociation of salts) are neglected. Loss or gain of a hydrogen is represented by 32 easy-to-compute features of that reactant atom alone, ∈ 5 ai 32. Loss or gain of a bond is represented by a concatenation of the features of the atoms involved and four features of the bond, ∈ 5 a b a [ , , ] i ij j 68. Because edits occur at the reaction center by definition, the overall representation of a candidate reaction depends only on the atoms and bonds at the reaction core. There is no explicit inclusion of other molecular features, e.g., adjacency to certain functional groups. However, in our featurization, we include rapidly calculable structural and electronic features of the reactants’ atoms that reflect the local chemical environment first and foremost, but also reflect the surrounding molecular context.19−22 The chosen features can be found in Table 2 and Table S2 and are discussed in more detail in the Supporting Information. The design of the neural network is motivated by the likelihood of a reaction being a function of the atom/bond changes that are required for it to occur. Individual edits are first analyzed in isolation using a neural network unique to the Figure 3. Edit-based model architecture for scoring candidate ACS Central Science Research Article in of a bond is es of the atoms ∈ 5 a ] j 68. y definition, the depends only on re is no explicit cency to certain on, we include features of the nvironment first ding molecular in Table 2 and the Supporting otivated by the the atom/bond vidual edits are k unique to the dense layers are mediate feature es vectors of all ural network to each candidate oposed reaction e free energy of candidates are uces a vector of ting their values with kB T = 1. An evel features is reaction shown ecture, the four assessing their picted in Figure on tests after an to describe the ed with bias and 001 parameters, attempts to rank cts alone; no corresponding molecules are prints of length is used prior to odel is shown in able S4). which trains the 10%/20% training/validation/testing split and ceased training once the validation loss did not improve for five epochs. The edit-based model achieves an test accuracy of 68.5%, averaged across all folds. In this context, accuracy refers to the percentage of reaction examples where the recorded product was assigned a rank of 1. The baseline model was similarly trained and tested in a 5-fold CV, reaching an accuracy of 33.3%, suggesting that the set of recorded products in the data set is fairly homogeneous. The hybrid model, combining the edit-based representation with the proposed products’ fingerprint representations, achieves an accuracy of 71.8%. These results are displayed in Table 1. Figure 3. Edit-based model architecture for scoring candidate reactions. Reactions are represented by four types of edits. Initial atom- and bond-level attributes are converted into feature representations, which are summed and used to calculate that candidate reaction’s likelihood score. Table 1. Comparison between Baseline, Edit-Based, and Hybrid Models in Terms of Categorical Crossentropy Loss and Accuracya model loss acc. (%) top-3 (%) top-5 (%) top-10 (%) random guess 5.46 0.8 2.3 3.8 7.6 baseline 3.28 33.3 48.2 55.8 65.9 edit-based 1.34 68.5 84.8 89.4 93.6 hybrid 1.21 71.8 86.7 90.8 94.6 aTop-n refers to the percentage of examples where the recorded product was ranked within the top n candidates. scalabity. Wei et al.10 describe the use of neural networks to predict the outcome of reactions based on reactant fingerprints, but limit their study to 16 types of reactions covering a very narrow scope of possible alkyl halide and alkene reactions. Given two reactants and one reagent, the model was trained to identify which of 16 templates was most applicable. The data set used for cross-validation comes from artificially generated examples with limited chemical functionality, rather than experimental data. Quite recently, Segler and Waller describe two approaches to forward synthesis prediction. The first is a knowledge-graph approach that uses the concept of half reactions to generate possible products given exactly two reactants by looking at the known reactions in which each of those reactants participates.11 required; (3) a new reaction representation focused on the fundamental transformation at the reaction site rather than constituent reactant and product fingerprints; (4) the implementation and validation of a neural network-based model that learns when certain modes of reactivity are more or less likely to occur than other potential modes. Despite the literature bias toward reporting only high-yielding reactions, we develop a successful workflow that can be performed without any manual curation using actual reactions reported in the USPTO literature. ■ APPROACH Overview. Our model predicts the outcome of a chemical reaction in a two-step manner: (1) applying overgeneralized Figure 1. Model framework combining forward enumeration and candidate ranking. The primary aim of this work is the creation of the parametrized scoring model, which is trained to maximize the probability assigned to the recorded experimental outcome. ACS Central Science Research Article

Slide 25

Slide 25 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - U c NW h i , , - ML Nd • , !" ℎ$ " , ℎ& " , '$& = )(+ ,-.,/0 ℎ& " , '$& ) i ) gef : + a • 2" ℎ$ " , 3$ "45 = )(65 ℎ$ "75 + 69 3$ "45) i ) gef :65 69 a Message Function: )(+ ,-.,/0 ℎ& " , '$&: ) Σ Message Function: )(+ ,-.,/0 ℎ& " , '$&; ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) '$&: '$&; Update Function: )(65 ℎ$ "75 + 69 3$ "45)

Slide 28

Slide 28 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n N m : : u a e . . d W y gr v a . . . . e c r!"# ML W v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 $# $"% &"# '( ') ') + . . ℎ"+ (-) ℎ"/ (-) u1 u2 u3 u4 v u1 u2 u3 &"# : n i p o l : : s tv 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint. As we will show, the set of candidates arising from this procedure is more compact than those arising mML

Slide 29

Slide 29 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n MN G W A ( ) AMN L ) v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 !" !#$ !#% !#& !#' (#" )* )+ )+ + ( (#" )* )+ + (#" )* )+ + (#" )* )+ + ℎ#- (/) ℎ#1 (/) hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint. As we will show, the set of candidates arising from this procedure is more compact than those arising from templates without sacrificing coverage.

Slide 30

Slide 30 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n d NG rg Ml ( ) o WMc ) M t G e WM a • n N WM L yA s bGi v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 ( ℎ"# (%) ℎ"' (%) Σ ( 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint.

Slide 31

Slide 31 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n LG d n li b c ) ) oe ) W G A a M p p a M g ) oe v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 ) ) ) ( & ℎ"# (%) ℎ"' (%) (̃* (̃"+ ,"* -. -/ -/ + u1 u2 u3 u4 v u1 u2 u3 ) ) 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint. As we will show, the set of candidates arising from this procedure is more compact than those arising ONL

Slide 37

Slide 37 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n ) ) ) ) ) ) () ) - W L N D : : !" ($%) = (" ($%) − (" (*) W !" ($% ) ( . ) 3 : v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 - - - - - Σ e each candidate pair (r, p). The first model naively ectors of all atom representations obtained from a . Our second and improved model, called WLDN, ween these differences vectors. d atom representation of atom v in candidate product ertaining to atom v as follows: s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) -mapped so we can use v to refer to the same atom. e difference vectors, resulting in a single vector for her neural network to score the candidate product pi . DN) Instead of simply summing all difference vec- d a difference graph. A difference graph D(r, pi) is atoms and bonds as pi , with atom v’s feature vector aph has several benefits. First, in D(r, pi), atom v’s e to the reaction center, thus focusing the processing Second, D(r, pi) explicates neighbor dependencies molecule pi . We define difference vector d(pi) v pertaining to atom v as follows: d(pi) v = c(pi) v c(r) v ; s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom. The pooling operation is a simple sum over these difference vectors, resulting in a single vector for each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi . Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vectors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is defined as a molecular graph which has the same atoms and bonds as pi , with atom v’s feature vector replaced by d(pi) v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies between difference vectors. The WLDN maps this graph-based representation into a fixed-length vector, by applying a separately parameterized WLN on top of D(r, pi): h(pi,l) v = ⌧ 0 @U1h(pi,l 1) v + U2 X u2N(v) ⌧ ⇣ V[h(pi,l 1) u , fuv] ⌘ 1 A (1  l  L) (10) d(pi,L) v = X u2N(v) W(0)h(pi,L) u W(1)fuv W(2)h(pi,L) v (11) where h(pi,0) v = d(pi) v . The final score of pi is s(pi) = uT ⌧(M P v2pi d(pi,L) v ). Training Both models are trained to minimize the softmax log-likelihood objective over the scores {s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target. 4 Experiments Data As a source of data for our experiments, we used reactions from USPTO granted patents, collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of 480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K, and 40K for training, development, and testing purposes. In addition, for comparison purposes we report the results on the subset of 15K reaction from this dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and 3K for training, development, and testing. Setup for Reaction Center Identification The output of this component consists of K atom pairs with the highest reactivity scores. We compute the coverage as the proportion of reactions where all atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is found in the model-generated candidate set. The model features reflect basic chemical properties of atoms and bonds. Atom-level features include its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence, and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether it is conjugated, and whether the bond is part of a ring. Both our local and global models are build upon a Weisfeiler-Lehman Network, with unrolled depth d(pi) v = c(pi) v c(r) v ; s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom. The pooling operation is a simple sum over these difference vectors, resulting in a single vector for each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi . Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vectors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is defined as a molecular graph which has the same atoms and bonds as pi , with atom v’s feature vector replaced by d(pi) v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies between difference vectors. The WLDN maps this graph-based representation into a fixed-length vector, by applying a separately parameterized WLN on top of D(r, pi): h(pi,l) v = ⌧ 0 @U1h(pi,l 1) v + U2 X u2N(v) ⌧ ⇣ V[h(pi,l 1) u , fuv] ⌘ 1 A (1  l  L) (10) d(pi,L) v = X u2N(v) W(0)h(pi,L) u W(1)fuv W(2)h(pi,L) v (11) where h(pi,0) v = d(pi) v . The final score of pi is s(pi) = uT ⌧(M P v2pi d(pi,L) v ). Training Both models are trained to minimize the softmax log-likelihood objective over the scores {s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target. 4 Experiments Data As a source of data for our experiments, we used reactions from USPTO granted patents, collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of 480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K, and 40K for training, development, and testing purposes. In addition, for comparison purposes we report the results on the subset of 15K reaction from this dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and 3K for training, development, and testing. Setup for Reaction Center Identification The output of this component consists of K atom pairs with the highest reactivity scores. We compute the coverage as the proportion of reactions where all atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is found in the model-generated candidate set. The model features reflect basic chemical properties of atoms and bonds. Atom-level features include its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence, and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether it is conjugated, and whether the bond is part of a ring.

Slide 46

Slide 46 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n pc O U SA • er n 8A a i A n s 8 n p 0 1 + 2 A A Human Evaluation Setup Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the product given reactants in all groups. Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical engineering who use organic chemistry concepts regularly but have less formal training. Our model performs at the expert chemist level in terms of top 1 accuracy. Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3 Our Model 69.1 Figure 5: Details of human performance i i TP 0 ++ 1 + ou A Human Evaluation Setup Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the product given reactants in all groups. Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical engineering who use organic chemistry concepts regularly but have less formal training. Our model performs at the expert chemist level in terms of top 1 accuracy. Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3 Our Model 69.1 Figure 5: Details of human performance (a) Histogram showing the distribution of question difficulties as evaluated by the average expert performance across all ten performers. (b) Comparison of model performance against human performance for sets of questions as grouped by the average human accuracy shown in Figure 5a .

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text