Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AI System Dept.
System & Design Management Unit Kazuki Fujikawa 21 0 0 2 0 0 2 @ .2 2 82 2 2 @ 7 66 5 45 - 5 4 4 - 6- 4 - 4 6- 6 5 8-6 8 -5 -/ 4 / 68 4. - 2 2 21 2 @ 7

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n
o n D • a g 1 2 • 0 ~ : : e : • 0 : A b 0 : • A 0 0 : 0 N M D : • 0 4 D : •

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n G
4 h py G o N G o G N • a 1 r • o • o n e G 0 0 1 + Gel 0 h C 3 G Figure 2: Overview of our approach. (1) we train a model to identify pairwise atom interactions in the reaction center. (2) we pick the top K atom pairs and enumerate chemically-feasible bond configurations between these atoms. Each bond configuration generates a candidate outcome of the reaction. (3) Another model is trained to score these candidates to find the true product. predicting transformations from reactants to products in a single step. Second, mechanistic descrip-

n / G n C n

C n C / C C n n

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n : O
T O O A S Figure 1: An example reaction where the reaction center is (27,28), (7,27), and (8,27), highlighted in green. Here bond (27,28) is deleted and (7,27) and (8,27) are connected by aromatic bonds to form a new ring. The corresponding reaction template consists of not only the reaction center, but nearby functional groups that explicitly specify the context. template involves graph matching and this makes examining large numbers of templates prohibitively

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n p y
r 00 100 w n N hc i Ld l n d l t sRu a k j Nov LGbge N L L G6 4 d l t -66 /62 7G. 2 D C 64 3 7 236:2 2 2 6 65 2 :2 65 2 7

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n , 2
2 7 2 , 0)7 : (,+ 1 R N) 2 ( wM]r )7 : [ o GP e L • , 2 2 7 2 + M • i h sL k gm ] • M]i h a d k gm uM] n + M C i h k gm + RI] • 2 2 , 2 2 7 R u Ni h p L l sL oP k gm tp ] ntum Chemistry Oriol Vinyals 3 George E. Dahl 1 DFT 103 seconds Message Passing Neural Net 10 2 seconds E,!0, ... Targets )7 : (,+

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Message Function: !"
(ℎ% " , ℎ'( " , )%'( ) Σ Message Function: !" (ℎ% " , ℎ'( " , )%'( ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- n + : , : : 1 C 5 2 l icN ]E S eF[Uh • , : l , : !" ℎ% " , ℎ' " , )%' = ,-.,/0(ℎ' " , )%' ) l 0 5 : 1" ℎ% " , 2% "34 = 5(6 " 789 (%)2% "34) • 6 " 789 (%) 0D ; N ISda deg (;) MNfg L P v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) )%'( )%'? Update Function: 1" (ℎ% " , 2% "34)

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n ( 5
: D 5 + : C )DE 5D ,02 S R [ ] LI M N (+0 P a • 1 5 DC 5 1 5 DC D C ! ℎ# $ % ∈ ' = )( ∑ -.)/012 34 ℎ# 4 #,4 ) v u1 u2 h(0) v h(0) u1 h(0) u2 1 5 DC D C )( ∑ -.)/012 34 ℎ # 4 #,4 ) ℎ# (7) ℎ89 (7) ℎ8: (7) F F F F F F ℎ89 ($) ℎ8: (;) ℎ # ($) 5: 5 : 5: 5 : <#89 <#8: = > = !({ℎ# ($)|% ∈ '})

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n , ,
0 0 F C ,, 00 . - .1 ( 2 + 6 M ,12U] h iU g • CC : CC : C CC : 6 ) !" ℎ$ " , ℎ& " , '$& = )*& ℎ& " • )*& ) Rde fa G a G 6 I [cL N 2 6 ) +" ℎ$ " , ,$ "-. = /0+ ℎ$ " , ,$ "-. Message Function: !" (ℎ$ " , ℎ&2 " , '$&2 ) Σ Message Function: !" (ℎ$ " , ℎ&2 " , '$&2 ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) '$&2 '$&4 Update Function: +" (ℎ$ " , ,$ "-.)

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n + 6
+ 6 6 C : ++ 1- ,)- 2 0 6 LFG+ 0 N GU • 6 6 6 ( ! ℎ# $ % ∈ ' = tanh( ∑ 0 # 1 ℎ# $ , ℎ# 3 ⊙ tanh 5 ℎ# $ , ℎ# 3 ) • 1, 5( I 0 1 ℎ# $ , ℎ# 3 ( 6 I R

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n ( ,
: 0 (, +1 0 [ T ] US • ) 0 0 7: 0 ) 0 :1 7 : !" ℎ$ " , ℎ& " , '$& = tanh -./ -/.ℎ0 " + 23 ⊙ -6.'$0 + 27 • -./, -/., -6. D23 , 27 M N • 20 :1 7 : 8" ℎ$ " , 9$ ":3 = ℎ$ " + 9$ ":3 Message Function: !" (ℎ$ " , ℎ&< " , '$&< ) Σ Message Function: !" (ℎ$ " , ℎ&< " , '$&< ) Neural Message Passing f time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. '$&< '$&> Update Function: 8" (ℎ$ " , 9$ ":3)

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n (22: ,2
)2 7 )2 (,)) +0 ) 2 N D • 2 1 : 2 2 1 0 ! ℎ# $ % ∈ ' = ∑ NN(ℎ# $ ) #

2 2 E E , - ( [ 1 6 G 2 2 M • EE6 C6EE C 6E [ EE6 :G 7 ) !" ℎ$ " , ℎ& " , '$& = )('$+ )ℎ& " • )('$+ )) N S R '$+ M IL00 [ C 6 :G 7 ) -" ℎ$ " , .$ "/0 = GRU ℎ$ " , .$ "/0 • ,,00 - 1 U Message Function: !" (ℎ$ " , ℎ&4 " , '$&4 ) Σ Message Function: !" (ℎ$ " , ℎ&4 " , '$&4 ) Neural Message Passing f time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. '$&4 '$&5 Update Function: -" (ℎ$ " , .$ "/0)

2 2 C C , - ( 1 6 E L2 2 • 1 6 E :E 7 ) ! ℎ# $ % ∈ ' = set2set ℎ# $ % ∈ ' C C G6 C - 1 -. ∗ RM00M SLIN size of the set, and which is order invariant. In the next sections, we explain such a modification, which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1. 4.2 ATTENTION MECHANISMS Neural models with memories coupled to differentiable addressing mechanism have been success- fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 2015). Since we are interested in associative memories we employed a “content” based attention. This has the property that the vector retrieved from our memory would not change if we randomly shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular, our process block based on an attention mechanism uses the following: qt = LSTM(q ⇤ t 1 ) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q ⇤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is a query vector which allows us to read rt from the memories, f is a function that computes a single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a recurrent state but which takes no inputs. q⇤ t is the state which this LSTM evolves, and is formed by concatenating the query qt with the resulting attention readout rt . t is the index which indicates V ) G6 C - 1

5 0 2 , 02 2 027 0 /Ci 10 t m dchUh M Rl • t m h f C Ma[ R L • dchUh t r PRnsM R b]eR oO c l L g h g Ryut representation, less fundamental changes (e.g., formal charge, association/disassociation of salts) are neglected. Loss or gain of a hydrogen is represented by 32 easy-to-compute features of that reactant atom alone, ∈ 5 ai 32. Loss or gain of a bond is represented by a concatenation of the features of the atoms involved and four features of the bond, ∈ 5 a b a [ , , ] i ij j 68. Because edits occur at the reaction center by definition, the overall representation of a candidate reaction depends only on the atoms and bonds at the reaction core. There is no explicit inclusion of other molecular features, e.g., adjacency to certain functional groups. However, in our featurization, we include rapidly calculable structural and electronic features of the reactants’ atoms that reflect the local chemical environment first and foremost, but also reflect the surrounding molecular context.19−22 The chosen features can be found in Table 2 and Table S2 and are discussed in more detail in the Supporting Information. The design of the neural network is motivated by the likelihood of a reaction being a function of the atom/bond changes that are required for it to occur. Individual edits are first analyzed in isolation using a neural network unique to the Figure 3. Edit-based model architecture for scoring candidate ACS Central Science Research Article in of a bond is es of the atoms ∈ 5 a ] j 68. y definition, the depends only on re is no explicit cency to certain on, we include features of the nvironment first ding molecular in Table 2 and the Supporting otivated by the the atom/bond vidual edits are k unique to the dense layers are mediate feature es vectors of all ural network to each candidate oposed reaction e free energy of candidates are uces a vector of ting their values with kB T = 1. An evel features is reaction shown ecture, the four assessing their picted in Figure on tests after an to describe the ed with bias and 001 parameters, attempts to rank cts alone; no corresponding molecules are prints of length is used prior to odel is shown in able S4). which trains the 10%/20% training/validation/testing split and ceased training once the validation loss did not improve for five epochs. The edit-based model achieves an test accuracy of 68.5%, averaged across all folds. In this context, accuracy refers to the percentage of reaction examples where the recorded product was assigned a rank of 1. The baseline model was similarly trained and tested in a 5-fold CV, reaching an accuracy of 33.3%, suggesting that the set of recorded products in the data set is fairly homogeneous. The hybrid model, combining the edit-based representation with the proposed products’ fingerprint representations, achieves an accuracy of 71.8%. These results are displayed in Table 1. Figure 3. Edit-based model architecture for scoring candidate reactions. Reactions are represented by four types of edits. Initial atom- and bond-level attributes are converted into feature representations, which are summed and used to calculate that candidate reaction’s likelihood score. Table 1. Comparison between Baseline, Edit-Based, and Hybrid Models in Terms of Categorical Crossentropy Loss and Accuracya model loss acc. (%) top-3 (%) top-5 (%) top-10 (%) random guess 5.46 0.8 2.3 3.8 7.6 baseline 3.28 33.3 48.2 55.8 65.9 edit-based 1.34 68.5 84.8 89.4 93.6 hybrid 1.21 71.8 86.7 90.8 94.6 aTop-n refers to the percentage of examples where the recorded product was ranked within the top n candidates. scalabity. Wei et al.10 describe the use of neural networks to predict the outcome of reactions based on reactant fingerprints, but limit their study to 16 types of reactions covering a very narrow scope of possible alkyl halide and alkene reactions. Given two reactants and one reagent, the model was trained to identify which of 16 templates was most applicable. The data set used for cross-validation comes from artificially generated examples with limited chemical functionality, rather than experimental data. Quite recently, Segler and Waller describe two approaches to forward synthesis prediction. The first is a knowledge-graph approach that uses the concept of half reactions to generate possible products given exactly two reactants by looking at the known reactions in which each of those reactants participates.11 required; (3) a new reaction representation focused on the fundamental transformation at the reaction site rather than constituent reactant and product fingerprints; (4) the implementation and validation of a neural network-based model that learns when certain modes of reactivity are more or less likely to occur than other potential modes. Despite the literature bias toward reporting only high-yielding reactions, we develop a successful workflow that can be performed without any manual curation using actual reactions reported in the USPTO literature. ▪ APPROACH Overview. Our model predicts the outcome of a chemical reaction in a two-step manner: (1) applying overgeneralized Figure 1. Model framework combining forward enumeration and candidate ranking. The primary aim of this work is the creation of the parametrized scoring model, which is trained to maximize the probability assigned to the recorded experimental outcome. ACS Central Science Research Article

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n &1+0',"K> T
(0%! • 78EHRS &1+0'#%'D; 4 "' • 78EHL:A&1+0'M 2N7PJ T ".)/& • &1+0'#%'<A@O5 IL &1+0'Q3 .*-%$1CG9?F 56B=

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n C
n / C n n

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. MN L
P P G G u1 u2 u3 u4 u5 1 3 . / - / u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 3 2 Weisfeiler-Lehman Difference Network (WLDN) MN

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 . 32
32 u1 u2 u3 u4 u5 - / u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 32 / Weisfeiler-Lehman Difference Network (WLDN) 1

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. G
N ML u1 u2 u3 u4 u5 /3 /3 / / / - ./ ./ u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 3 3 1 Weisfeiler-Lehman Difference Network (WLDN)

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. G L .
u1 u2 u3 u4 u5 / / / /2 / - ./ ./ u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 . 2 1 Weisfeiler-Lehman Difference Network (WLDN)

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. L n -
1 . L n SPN W G u1 u2 u3 u4 u5 - / u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 N / Weisfeiler-Lehman Difference Network (WLDN) L ML

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - U
c NW h i , , - ML Nd • , !" ℎ$ " , ℎ& " , '$& = )(+ ,-.,/0 ℎ& " , '$& ) i ) gef : + a • 2" ℎ$ " , 3$ "45 = )(65 ℎ$ "75 + 69 3$ "45) i ) gef :65 69 a Message Function: )(+ ,-.,/0 ℎ& " , '$&: ) Σ Message Function: )(+ ,-.,/0 ℎ& " , '$&; ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) '$&: '$&; Update Function: )(65 ℎ$ "75 + 69 3$ "45)

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - M
W N a O N v u1 u2 u3 u4 h(0) v f u1 v h(0) u1 f u2 v h(0) u2 W(2) W(0) W(1) W(0) W(1) L Σ said to be isomorphic if their set representations are the same. The number of distinct labels grows exponentially with the number of iterations L. WL Network The discrete relabeling process does not directly generalize to continuous feature vectors. Instead, we appeal to neural networks to continuously embed the computations inherent in the WL test. Let r be the analogous continuous relabeling function. Then a node v 2 G with neighbor nodes N(v), node features fv , and edge features fuv is “relabeled” according to r(v) = ⌧(U1fv + U2 X u2N(v) ⌧(V[fu, fuv])) (1) where ⌧(·) could be any non-linear function. We apply this relabeling operation iteratively to obtain context-dependent atom vectors h(l) v = ⌧(U1h(l 1) v + U2 X u2N(v) ⌧(V[h(l 1) u , fuv])) (1  l  L) (2) where h(0) v = fv and U1, U2, V are shared across layers. The ﬁnal atom representations arise from mimicking the set comparison function in the WL isomorphism test, yielding cv = X u2N(v) W(0)h(L) u W(1)fuv W(2)h(L) v (3) The set comparison here is realized by matching each rank-1 edge tensor h(L) u ⌦ fuv ⌦ h(L) v to a set of reference edges also cast as rank-1 tensors W(0)[k] ⌦ W(1)[k] ⌦ W(2)[k], where W[k] is the k-th row of matrix W. In other words, Eq. 3 above could be written as cv[k] = X u2N(v) D W(0)[k] ⌦ W(1)[k] ⌦ W(2)[k], h(L) u ⌦ fuv ⌦ h(L) v E (4) The resulting cv is a vector representation that captures the local chemical environment of the atom (through relabeling) and involves a comparison against a learned set of reference environments. The representation of the whole graph G is simply the sum over all the atom representations: cG = P v cv . 4 ℎ" ($)

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - /
P N L G M S n /1 . . M G u1 u2 u3 u4 u5 - / u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 L / Weisfeiler-Lehman Difference Network (WLDN) P

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n N m
: : u a e . . d W y gr v a . . . . e c r!"# ML W v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 $# $"% &"# '( ') ') + . . ℎ"+ (-) ℎ"/ (-) u1 u2 u3 u4 v u1 u2 u3 &"# : n i p o l : : s tv 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint. As we will show, the set of candidates arising from this procedure is more compact than those arising mML

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n MN G
W A ( ) AMN L ) v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 !" !#$ !#% !#& !#' (#" )* )+ )+ + ( (#" )* )+ + (#" )* )+ + (#" )* )+ + ℎ#- (/) ℎ#1 (/) hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint. As we will show, the set of candidates arising from this procedure is more compact than those arising from templates without sacrificing coverage.

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n d NG
rg Ml ( ) o WMc ) M t G e WM a • n N WM L yA s bGi v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 ( ℎ"# (%) ℎ"' (%) Σ ( 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint.

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n LG d
n li b c ) ) oe ) W G A a M p p a M g ) oe v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 ) ) ) ( & ℎ"# (%) ℎ"' (%) (̃* (̃"+ ,"* -. -/ -/ + u1 u2 u3 u4 v u1 u2 u3 ) ) 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv . The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1. We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜ cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜ cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma ˜ cu + Ma ˜ cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint. As we will show, the set of candidates arising from this procedure is more compact than those arising ONL

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. G
N ML u1 u2 u3 u4 u5 /3 /3 / / / - ./ ./ u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 3 3 1 Weisfeiler-Lehman Difference Network (WLDN)

ic l n K a o p es T l • ict lyv ic l • l l ic r T view of our approach. (1) we train a model to identify pairwise atom interactions

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. G L .
u1 u2 u3 u4 u5 / / / /2 / - ./ ./ u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 . 2 1 Weisfeiler-Lehman Difference Network (WLDN)

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - P
: N L !" ($% ) = (" ($% ) − (" (*) P !" ($% ) v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 () - - - - - Σ e each candidate pair (r, p). The ﬁrst model naively ectors of all atom representations obtained from a . Our second and improved model, called WLDN, ween these differences vectors. d atom representation of atom v in candidate product ertaining to atom v as follows: s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) -mapped so we can use v to refer to the same atom. e difference vectors, resulting in a single vector for her neural network to score the candidate product pi . DN) Instead of simply summing all difference vec- d a difference graph. A difference graph D(r, pi) is atoms and bonds as pi , with atom v’s feature vector aph has several beneﬁts. First, in D(r, pi), atom v’s e to the reaction center, thus focusing the processing Second, D(r, pi) explicates neighbor dependencies ! " ($%) = ( " ($%) − ( " (*) )

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 1 ):3
( 21 l f ih S g e !" ($% ) = (" ($% ) − (" (*) l !" ($% ) a cP l 3 - 2 NL W u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 1 ):3 ( 21 ih ih ih y ∈ {0,1,2,3} 3 -

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n ) )
) ) ) ) () ) - W L N D : : !" ($%) = (" ($%) − (" (*) W !" ($% ) ( . ) 3 : v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 - - - - - Σ e each candidate pair (r, p). The first model naively ectors of all atom representations obtained from a . Our second and improved model, called WLDN, ween these differences vectors. d atom representation of atom v in candidate product ertaining to atom v as follows: s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) -mapped so we can use v to refer to the same atom. e difference vectors, resulting in a single vector for her neural network to score the candidate product pi . DN) Instead of simply summing all difference vec- d a difference graph. A difference graph D(r, pi) is atoms and bonds as pi , with atom v’s feature vector aph has several benefits. First, in D(r, pi), atom v’s e to the reaction center, thus focusing the processing Second, D(r, pi) explicates neighbor dependencies molecule pi . We define difference vector d(pi) v pertaining to atom v as follows: d(pi) v = c(pi) v c(r) v ; s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom. The pooling operation is a simple sum over these difference vectors, resulting in a single vector for each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi . Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vectors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is defined as a molecular graph which has the same atoms and bonds as pi , with atom v’s feature vector replaced by d(pi) v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies between difference vectors. The WLDN maps this graph-based representation into a fixed-length vector, by applying a separately parameterized WLN on top of D(r, pi): h(pi,l) v = ⌧ 0 @U1h(pi,l 1) v + U2 X u2N(v) ⌧ ⇣ V[h(pi,l 1) u , fuv] ⌘ 1 A (1  l  L) (10) d(pi,L) v = X u2N(v) W(0)h(pi,L) u W(1)fuv W(2)h(pi,L) v (11) where h(pi,0) v = d(pi) v . The final score of pi is s(pi) = uT ⌧(M P v2pi d(pi,L) v ). Training Both models are trained to minimize the softmax log-likelihood objective over the scores {s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target. 4 Experiments Data As a source of data for our experiments, we used reactions from USPTO granted patents, collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of 480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K, and 40K for training, development, and testing purposes. In addition, for comparison purposes we report the results on the subset of 15K reaction from this dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and 3K for training, development, and testing. Setup for Reaction Center Identification The output of this component consists of K atom pairs with the highest reactivity scores. We compute the coverage as the proportion of reactions where all atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is found in the model-generated candidate set. The model features reflect basic chemical properties of atoms and bonds. Atom-level features include its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence, and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether it is conjugated, and whether the bond is part of a ring. Both our local and global models are build upon a Weisfeiler-Lehman Network, with unrolled depth d(pi) v = c(pi) v c(r) v ; s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom. The pooling operation is a simple sum over these difference vectors, resulting in a single vector for each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi . Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vectors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is defined as a molecular graph which has the same atoms and bonds as pi , with atom v’s feature vector replaced by d(pi) v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies between difference vectors. The WLDN maps this graph-based representation into a fixed-length vector, by applying a separately parameterized WLN on top of D(r, pi): h(pi,l) v = ⌧ 0 @U1h(pi,l 1) v + U2 X u2N(v) ⌧ ⇣ V[h(pi,l 1) u , fuv] ⌘ 1 A (1  l  L) (10) d(pi,L) v = X u2N(v) W(0)h(pi,L) u W(1)fuv W(2)h(pi,L) v (11) where h(pi,0) v = d(pi) v . The final score of pi is s(pi) = uT ⌧(M P v2pi d(pi,L) v ). Training Both models are trained to minimize the softmax log-likelihood objective over the scores {s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target. 4 Experiments Data As a source of data for our experiments, we used reactions from USPTO granted patents, collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of 480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K, and 40K for training, development, and testing purposes. In addition, for comparison purposes we report the results on the subset of 15K reaction from this dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and 3K for training, development, and testing. Setup for Reaction Center Identification The output of this component consists of K atom pairs with the highest reactivity scores. We compute the coverage as the proportion of reactions where all atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is found in the model-generated candidate set. The model features reflect basic chemical properties of atoms and bonds. Atom-level features include its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence, and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether it is conjugated, and whether the bond is part of a ring.

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. . . n
) - 1 2 3 -- 3 ( ) ( k e h af c W !" ($%) = (" ($%) − (" (*) k !" ($% ) 3 k - 2 : 3 1 W D LN i u1 u2 u3 u5 u1 u2 u3 u4 u1 u2 u3 u4 Weisfeiler-Lehman Difference Network (WLDN) h h h y ∈ {0,1,2,3} - 2 : 3

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n a o
o L > c l > nM • > g G • e / >g bd G Figure 3: A reaction that reduces the carbonyl carbon of an amide by removing bond 4-23 (red circle). Reactivity at this site would be highly unlikely without the presence of borohydride (atom 25, blue circle). The global model correctly predicts bond 4-23 as the most susceptible to change, while the local model does not even include it in the top ten predictions. The attention map of the global model show that atoms 1, 25, and 26 were determinants of atom 4’s predicted reactivity. USPTO-15K Method |✓| K=6 K=8 K=10 Local 572K 80.1 85.0 87.7 Local 1003K 81.6 86.1 89.1 Global 756K 86.7 90.1 92.2 USPTO Local 572K 83.0 87.2 89.6 Local 1003K 82.4 86.7 89.1 Global 756K 89.8 92.0 93.3 Avg. Num. of Candidates (USPTO) Template - 482.3 out of 5006 Global - 60.9 246.5 1076 (a) Reaction Center Prediction Performance. Coverage is reported by picking the top K (K=6,8,10) reactivity pairs. |✓| is the number of model parameters. USPTO-15K Method Cov. P@1 P@3 P@5 Coley et al. 100.0 72.1 86.6 90.7 WLN 90.1 74.9 84.6 86.3 WLDN 90.1 76.7 85.6 86.8 WLN (*) 100.0 81.4 92.5 94.8 WLDN (*) 100.0 84.1 94.1 96.1 USPTO WLN 92.0 73.5 86.1 89.0 WLDN 92.0 74.0 86.7 89.5 WLN (*) 100.0 76.7 91.0 94.6 WLDN (*) 100.0 77.8 91.9 95.4 (b) Candidate Ranking Performance. Precision at ranks 1,3,5 are reported. (*) denotes that the true product was added if not covered by the previous stage. Table 1: Results on USPTO-15K and USPTO datasets.

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n v /653:
= s n v o /653: = • g +% l a = • 26937143 • 8 l 0.+= • g % l • 26937143 % • 8 l K = ey r cg lC = USPTO-15K Method |✓| K=6 K=8 K=10 Local 572K 80.1 85.0 87.7 Local 1003K 81.6 86.1 89.1 Global 756K 86.7 90.1 92.2 USPTO Local 572K 83.0 87.2 89.6 Local 1003K 82.4 86.7 89.1 Global 756K 89.8 92.0 93.3 Avg. Num. of Candidates (USPTO) Template - 482.3 out of 5006 Global - 60.9 246.5 1076 (a) Reaction Center Prediction Performance. Coverage is reported by picking the top K (K=6,8,10) reactivity pairs. |✓| is the number of model parameters. Method Coley et al WLN WLDN WLN (*) WLDN (*) WLN WLDN WLN (*) WLDN (*) (b) Candidate R 1,3,5 are repor added if not co Table 1: Results on USPTO-15K and US accuracy against the top-performing template-based approach employs frequency-based heuristics to construct reaction tem

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 5 ,
/1 1 3C P NC L *)+ L 5 5 , /1 1 3C *+ *)+ ( L W • P NC T c D *+ *)+ eiC *)+ *+L W =8 K=10 .0 87.7 .1 89.1 .1 92.2 .2 89.6 .7 89.1 .0 93.3 USPTO) t of 5006 6.5 1076 ance. Coverage 8,10) reactivity meters. USPTO-15K Method Cov. P@1 P@3 P@5 Coley et al. 100.0 72.1 86.6 90.7 WLN 90.1 74.9 84.6 86.3 WLDN 90.1 76.7 85.6 86.8 WLN (*) 100.0 81.4 92.5 94.8 WLDN (*) 100.0 84.1 94.1 96.1 USPTO WLN 92.0 73.5 86.1 89.0 WLDN 92.0 74.0 86.7 89.5 WLN (*) 100.0 76.7 91.0 94.6 WLDN (*) 100.0 77.8 91.9 95.4 (b) Candidate Ranking Performance. Precision at ranks 1,3,5 are reported. (*) denotes that the true product was added if not covered by the previous stage. ults on USPTO-15K and USPTO datasets.

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n pc O
U SA • er n 8A a i A n s 8 n p 0 1 + 2 A A Human Evaluation Setup Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the product given reactants in all groups. Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical engineering who use organic chemistry concepts regularly but have less formal training. Our model performs at the expert chemist level in terms of top 1 accuracy. Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3 Our Model 69.1 Figure 5: Details of human performance i i TP 0 ++ 1 + ou A Human Evaluation Setup Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the product given reactants in all groups. Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical engineering who use organic chemistry concepts regularly but have less formal training. Our model performs at the expert chemist level in terms of top 1 accuracy. Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3 Our Model 69.1 Figure 5: Details of human performance (a) Histogram showing the distribution of question difficulties as evaluated by the average expert performance across all ten performers. (b) Comparison of model performance against human performance for sets of questions as grouped by the average human accuracy shown in Figure 5a .

4 e G e N Go e G r N • a h 1 p • e • e y n rG 0 0 1 + e l G 0 e C 3 G

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n P
DICRIMG :PGAMIC ACRINM :SRCNL UIRH I IK P 7 HLAM 9 RUNPJ 5IM" MGNMG" R AK P DICRIMG :PGAMIC ACRINM :SRCNL UIRH I IK P 7 HLAM 9 RUNPJ DTAMC IM 9 SPAK 4M NPLARINM PNC IMG R L , n LOKAR A I" 5 MMI P 9 " 1ATID 1ST MASD" AMD K M OSPS 2SWIJ 9 SPAK M RUNPJ NP RH OP DICRINM N NPGAMIC CH LI RP P ACRINM . , ,( GK P" APUIM 3 " AMD APJ AKK P 9 SPAK L NKIC ACHIM 7 APMIMG NP RPN MRH I AMD ACRINM P DICRINM ! ( , . - -, GK P" APUIM" IJ P S " AMD APJ AKK P NUAPD KOHA0H L . 0H LICAK MRH I KAMMIMG UIRH P APCH AMD 1 O 9 SPAK 9 RUNPJ NKICI , 0NK " 0NMMNP " R AK P DICRINM N :PGAMIC ACRINM :SRCNL IMG ACHIM 7 APMIMG ( , . )() ))( n LOKAR P 6A AKA" ARRH U " R AK 7 APMIMG RN OP DICR CH LICAK P ACRINM - . -

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 4
ESK HS HVHQWEWLRQ a RJH V" 2EYLG" EQG :EWKHZ 5EKQ 3 WHQGHG RQQH WLYLW ILQJH S LQWV ,) , ) a KH YEVKLG H" LQR" HW EO BHLVIHLOH OHK EQ J ESK NH QHOV HS (. n 4 ESK 1 a 4LO H " 7 VWLQ" HW EO H EO HVVEJH SEVVLQJ IR T EQW KH LVW 6Q " SEJHV ( , " , a 2 YHQE G" 2EYLG 8 " HW EO 1RQYRO WLRQEO QHWZR NV RQ J ESKV IR OHE QLQJ ROH OE ILQJH S LQWV a 9L" C MLE" E ORZ" 2EQLHO" 0 R NV K LGW" :E " EQG DH HO" L KE G 4EWHG J ESK VHT HQ H QH EO QHWZR NV " a K WW" 8 LVWRI " HW EO EQW KH L EO LQVLJKWV I R GHHS WHQVR QH EO QHWZR NV EW H R QL EWLRQV - , (-. a LQ EOV" LRO" E 0HQJLR" EQG :EQM QEWK 8 GO GH EWWH V HT HQ H WR VHT HQ H IR VHWV 619 "

Predicting Organic Reaction Outcomes with Weisf...

Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network

More Decks by Kazuki Fujikawa

Featured

Transcript