NIPS2017 Few-shot Learning and Graph Convolution

Slide 1

Slide 1 text

AI System Dept. System Management Unit Kazuki Fujikawa 7 202 & 2 1 70 2

Slide 20

Slide 20 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 0 1 0 0 7 72 ,+70 + - n k M L /-/ F N LF / - /:!" i -/ - h F e • • Zc g f a • 2.1 Notation In few-shot classification we are given a small support set of N labeled examples S = {(x1, y1), . . . , (xN , yN )} where each xi 2 RD is the D-dimensional feature vector of an example and yi 2 {1, . . . , K} is the corresponding label. Sk denotes the set of examples labeled with class k. 2.2 Model Prototypical Networks compute an M-dimensional representation ck 2 RM , or prototype, of each class through an embedding function f : RD ! RM with learnable parameters . Each prototype is the mean vector of the embedded support points belonging to its class: ck = 1 |Sk | X (xi,yi)2Sk f (xi) (1) Given a distance function d : RM ⇥ RM ! [0, +1), Prototypical Networks produce a distribution over classes for a query point x based on a softmax over distances to the prototypes in the embedding space: p (y = k | x) = exp( d(f (x), ck)) P k0 exp( d(f (x), ck0 )) (2) Learning proceeds by minimizing the negative log-probability J( ) = log p (y = k | x) of the true class k via SGD. Training episodes are formed by randomly selecting a subset of classes from the training set, then choosing a subset of examples within each class to act as the support set and a 2 ing ers from few-shot learning in that instead of being given a support set of given a class meta-data vector vk for each class. These could be determined d be learned from e.g., raw text [8]. Modifying Prototypical Networks to deal s straightforward: we simply define ck = g#(vk) to be a separate embedding . An illustration of the zero-shot procedure for Prototypical Networks as ot procedure is shown in Figure 1. Since the meta-data vector and query ent input domains, we found it was helpful empirically to fix the prototype it length, however we do not constrain the query embedding f. we performed experiments on Omniglot [18] and the miniImageNet version with the splits proposed by Ravi and Larochelle [24]. We perform zero-shot 1 version of the Caltech UCSD bird dataset (CUB-200 2011) [34]. ot Classification et of 1623 handwritten characters collected from 50 alphabets. There are 20 Notation ew-shot classification we are given a small support set of N labeled examples S = , y1), . . . , (xN , yN )} where each xi 2 RD is the D-dimensional feature vector of an example yi 2 {1, . . . , K} is the corresponding label. Sk denotes the set of examples labeled with class k. Model otypical Networks compute an M-dimensional representation ck 2 RM , or prototype, of each through an embedding function f : RD ! RM with learnable parameters . Each prototype mean vector of the embedded support points belonging to its class: ck = 1 |Sk | X (xi,yi)2Sk f (xi) (1) n a distance function d : RM ⇥ RM ! [0, +1), Prototypical Networks produce a distribution classes for a query point x based on a softmax over distances to the prototypes in the embedding e: p (y = k | x) = exp( d(f (x), ck)) P k0 exp( d(f (x), ck0 )) (2) ning proceeds by minimizing the negative log-probability J( ) = log p (y = k | x) of the class k via SGD. Training episodes are formed by randomly selecting a subset of classes from aining set, then choosing a subset of examples within each class to act as the support set and a 2

Slide 21

Slide 21 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 0 1 0 0 7 72 ,+70 + - n a i - • O > , • - I h aT AN E i c e M d F> SgC / i / > SgC - Table 1: Few-shot classification accuracies on Omniglot. ⇤Uses non-standard train/test splits. 5-way Acc. 20-way Acc. Model Dist. Fine Tune 1-shot 5-shot 1-shot 5-shot MATCHING NETWORKS [32] Cosine N 98.1% 98.9% 93.8% 98.5% MATCHING NETWORKS [32] Cosine Y 97.9% 98.7% 93.5% 98.7% NEURAL STATISTICIAN [7] - N 98.1% 99.5% 93.2% 98.1% MAML [9]⇤ - N 98.7% 99.9% 95.8% 98.9% PROTOTYPICAL NETWORKS (OURS) Euclid. N 98.8% 99.7% 96.0% 98.9% Table 2: Few-shot classification accuracies on miniImageNet. All accuracy results are averaged over 600 test episodes and are reported with 95% confidence intervals. ⇤Results reported by [24]. 5-way Acc. Model Dist. Fine Tune 1-shot 5-shot BASELINE NEAREST NEIGHBORS⇤ Cosine N 28.86 ± 0.54% 49.79 ± 0.79% MATCHING NETWORKS [32]⇤ Cosine N 43.40 ± 0.78% 51.09 ± 0.71% MATCHING NETWORKS FCE [32]⇤ Cosine N 43.56 ± 0.84% 55.31 ± 0.73% META-LEARNER LSTM [24]⇤ - N 43.44 ± 0.77% 60.60 ± 0.71% MAML [9] - N 48.70 ± 1.84% 63.15 ± 0.91% PROTOTYPICAL NETWORKS (OURS) Euclid. N 49.42 ± 0.78% 68.20 ± 0.66% 3.2 miniImageNet Few-shot Classification The miniImageNet dataset, originally proposed by Vinyals et al. [32], is derived from the larger Figure 3: Comparison showing the effect of distance metric and number of classes per training episode on 5-way classification accuracy for both Matching Networks and Prototypical Networks on miniImageNet. The x-axis indicates configuration of the training episodes (way, distance, and shot), and the y-axis indicates 5-way test accuracy for the corresponding shot. Error bars indicate 95% confidence intervals as computed over 600 test episodes. Note that Matching Networks and Prototypical Networks are identical in the 1-shot case. Table 3: Zero-shot classification accuracies on CUB-200. Model Image Features 50-way Acc. 0-shot ALE [1] Fisher 26.9% SJE [2] AlexNet 40.3% SAMPLE CLUSTERING [19] AlexNet 44.3% SJE [2] GoogLeNet 50.1% DS-SJE [25] GoogLeNet 50.4% DA-SJE [25] GoogLeNet 50.9% SYNTHESIZED CLASSIFIERS [6] GoogLeNet 54.7% PROTOTYPICAL NETWORKS (OURS) GoogLeNet 54.8%

Slide 22

Slide 22 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n uS D O yIl s m i yOG F CL M I y P 0 Sz pS g td OT D D LM yr yO • θ" = $" ⊙ θ"&' − )" ⊙ ∇+,-. ℒ" θ" = 1 ∗ θ"&' − η ∗ ∇+,-. ℒ" 0 3" = $" ⊙ 3"&' − )" ⊙ 4 3" • $" , )" ∇+,-. ℒ" , ℒ" , θ"&' nMNoa θ6 V 2 7 2 S D PIN y + 7 01 2 1 1 -, , Thus, we propose training a meta-learner LSTM to learn an update rule for traini work. We set the cell state of the LSTM to be the parameters of the learner, or c candidate cell state ˜ ct = r✓t 1 Lt , given how valuable information about the grad mization. We define parametric forms for it and ft so that the meta-learner can de values through the course of the updates. Let us start with it , which corresponds to the learning rate for the updates. We let it = WI · ⇥ r✓t 1 Lt, Lt, ✓t 1, it 1 ⇤ + bI , meaning that the learning rate is a function of the current parameter value ✓t 1 , the r✓t 1 Lt , the current loss Lt , and the previous learning rate it 1 . With this inform learner should be able to finely control the learning rate so as to train the learne avoiding divergence. As for ft , it seems possible that the optimal choice isn’t the constant 1. Intuitive justify shrinking the parameters of the learner and forgetting part of its previous if the learner is currently in a bad local optima and needs a large change to esca correspond to a situation where the loss is high but the gradient is close to zero. Th for the forget gate is to have it be a function of that information, as well as the previ forget gate: ft = WF · ⇥ r✓t 1 Lt, Lt, ✓t 1, ft 1 ⇤ + bF . Additionally, notice that we can also learn the initial value of the cell state c0 for the it as a parameter of the meta-learner. This corresponds to the initial weights of th 3 cover classes not present in any of the datasets in Dmeta train (similarly, we ad meta-validation set that is used to determine hyper-parameters). where ✓t 1 are the parameters of the learner after t 1 updates, ↵t is the learn Lt is the loss optimized by the learner for its tth update, r✓t 1 Lt is the gradien respect to parameters ✓t 1 , and ✓t is the updated parameters of the learner. Our key observation that we leverage here is that this update resembles the updat in an LSTM (Hochreiter & Schmidhuber, 1997) ct = ft ct 1 + it ˜ ct, if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ ct = r✓t 1 Lt . Thus, we propose training a meta-learner LSTM to learn an update rule for train work. We set the cell state of the LSTM to be the parameters of the learner, or candidate cell state ˜ ct = r✓t 1 Lt , given how valuable information about the gr mization. We define parametric forms for it and ft so that the meta-learner can d values through the course of the updates. Let us start with it , which corresponds to the learning rate for the updates. We let it = WI · ⇥ r✓t 1 Lt, Lt, ✓t 1, it 1 ⇤ + bI , meaning that the learning rate is a function of the current parameter value ✓t 1 , th r✓t 1 Lt , the current loss Lt , and the previous learning rate it 1 . With this infor learner should be able to finely control the learning rate so as to train the learn avoiding divergence. As for ft , it seems possible that the optimal choice isn’t the constant 1. Intuiti justify shrinking the parameters of the learner and forgetting part of its previou if the learner is currently in a bad local optima and needs a large change to esc correspond to a situation where the loss is high but the gradient is close to zero. T for the forget gate is to have it be a function of that information, as well as the pre forget gate: ft = WF · ⇥ r✓t 1 Lt, Lt, ✓t 1, ft 1 ⇤ + bF . Additionally, notice that we can also learn the initial value of the cell state c0 for th it as a parameter of the meta-learner. This corresponds to the initial weights of t 3 Oriol Vinyals, NIPS 17 Figure Credit: Hugo Larochelle Optimization Based Meta Learning v e - :7 1 27 : 7 - 2 : 2 2 , . 2 2 :

Slide 27

Slide 27 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 0 ,2 0 7 1 - +, n , , : M NP • L , • L M Hidden layers Hidden layers Temporal Dropout Neighborhood Attention + Temporal Convolution Attention over Demonstration Demonstration Current State Action A Block# B C D E F G H I J Attention over Current State Context Network Demonstration Network Manipulation Network Context Embedding Figure 2: Illustration of the network architecture. In our experiments, we use p = 0.95, which reduces the length of demonstrations by a factor of 20. During test time, we can sample multiple downsampled trajectories, use each of them to compute downstream results, and average these results to produce an ensemble estimate. In our experience, this consistently improves the performance of the policy. Neighborhood Attention: After downsampling the demonstration, we apply a sequence of opera- tions, composed of dilated temporal convolution [65] and neighborhood attention. We now describe this second operation in more detail. From the figure, we can observe that for the easier tasks with fewer stages, all of the different conditioning strategies perform equally well and almost perfectly. As the difficulty (number of stages) increases, however, conditioning on the entire demonstration starts to outperform conditioning on the final state. One possible explanation is that when conditioned only on the final state, the policy may struggle about which block it should stack first, a piece of information that is readily accessible from demonstration, which not only communicates the task, but also provides valuable information to help accomplish it. More surprisingly, conditioning on the entire demonstration also seems to outperform conditioning on the snapshot, which we originally expected to perform the best. We suspect that this is due to the regularization effect introduced by temporal dropout, which effectively augments the set of demonstrations seen by the policy during training. Another interesting finding was that training with behavioral cloning has the same level of performance as training with DAGGER, which suggests that the entire training procedure could work without requiring interactive supervision. In our preliminary experiments, we found that injecting noise into the trajectory collection process was important for behavioral cloning to work well, hence in all experiments reported here we use noise injection. In practice, such noise can come from natural human-induced noise through tele-operation, or by artificially injecting additional noise before applying it on the physical robot. 5.2 Visualization We visualize the attention mechanisms underlying the main policy architecture to have a better understanding about how it operates. There are two kinds of attention we are mainly interested in, one where the policy attends to different time steps in the demonstration, and the other where the policy attends to different blocks in the current state. Fig. 4 shows some of the attention heatmaps. (a) Attention over blocks in the current state. (b) Attention over downsampled demonstration. Figure 4: Visualizing attentions performed by the policy during an entire execution. The task

Slide 30

Slide 30 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 7 + 2 , 0 +- n a c bO n s Mb k pl BM L It iu M y dg M . : : : . / • .: : / .: :. : re : / .: : .: h M : :. a c bO S • .: : / .: :. : re :. : .: : _ bO o y iu Planning Module Planning with Visual Foresight [4,5] User Input/Task Speciﬁcation How can we enable robots to learn vision-based manipulation skills that generalize to new objects & goals? One-Shot Visual Imitation Learning Planning with Visual Foresight Can robots reuse data from other tasks to adapt to new objects from only one visual demonstration? Our meta-learning approach: Learn to learn many other tasks using one demo training sets test sets meta-training tasks { task 1 task 2 … … Meta-testing: Learn new held-out task from 1 demo How can robots acquire general models and skills using entirely autonomously-collected data? Meta-Imitation Learning using MAML [1,2] meta-training time: - Learn from raw pixel observations   (rather than task-speciﬁc, engineered representations) - collect data with a diverse range of objects and environments - reuse data from other objects & tasks when learning to perform new task Collect data autonomously Predict future video for different actions [3,5] Demo: Robot placing, tasks correspond to different objects. - program initial motions, provide objects - record camera images and robot actions - no object supervision, camera calibration, human annotation, etc. Frederik Ebert, Chelsea Finn, Alex Lee, Sergey Levine Chelsea Finn*, Tianhe Yu*, Tianhao Zhang, Pieter Abbeel, Sergey Levine Meta-training: Meta-training tasks: Standard robotics paradigm: Brittle, hand-engineered pipeline. RGB-D image segment objects estimate pose & physics of segments optimize action using estimated poses & physics Our approach: input image future predictions actions: [1] Finn, Abbeel, Levine. Model-Agnostic Meta- [2] Finn*, Yu*, Zhang, Abbeel, Levine. One-Shot Pla D G Use new objects from only one visual demonstration? Our meta-learning approach: Learn to learn many other tasks using one demo training sets test sets meta-training tasks { task 1 task 2 … … Meta-testing: Learn new held-out task from 1 demo u Meta-Imitation Learning using MAML [1,2] val demo meta-training time: training demo meta-training tasks meta-test time: demo of meta-test task with held-out objects Co Pre Sam 1. 2. 3. 4. policy architecture shown in demo Demo: Robot placing, tasks correspond to different objects. - - - Meta-training: One-Shot Imitation Learning Research Meta-training tasks: inp mh :: .. . . . .- -. :

Slide 39

Slide 39 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Message Function: !" (ℎ% " , ℎ'( " , )%'( ) Σ Message Function: !" (ℎ% " , ℎ'( " , )%'( ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- n + : , : : 1 C 5 2 l icN ]E S eF[Uh • , : l , : !" ℎ% " , ℎ' " , )%' = ,-.,/0(ℎ' " , )%' ) l 0 5 : 1" ℎ% " , 2% "34 = 5(6 " 789(%)2% "34) • 6 " 789(%) 0D : N ISda deg(:) MNfg L P v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) )%'( )%'> Update Function: 1" (ℎ% " , 2% "34)

Slide 41

Slide 41 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n , , 0 0 F C ,, 00 . - .1 ( 2 + 6 M ,12U] h iU g • CC : CC : C CC : 6 ) !" ℎ$ " , ℎ& " , '$& = )*& ℎ& " • )*& ) Rde fa G a G 6 I [cL N 2 6 ) +" ℎ$ " , ,$ "-. = /0+ ℎ$ " , ,$ "-. Message Function: !" (ℎ$ " , ℎ&2 " , '$&2 ) Σ Message Function: !" (ℎ$ " , ℎ&2 " , '$&2 ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) '$&2 '$&4 Update Function: +" (ℎ$ " , ,$ "-.)

Slide 43

Slide 43 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n ( , : 0 (, +1 0 [ T ] US • ) 0 0 7: 0 ) 0 :1 7 : !" ℎ$ " , ℎ& " , '$& = tanh -./ -/.ℎ0 " + 23 ⊙ -5.'$0 + 26 • -./, -/., -5. D23 , 26 M N • 20 :1 7 : 7" ℎ$ " , 8$ "93 = ℎ$ " + 8$ "93 Message Function: !" (ℎ$ " , ℎ&; " , '$&; ) Σ Message Function: !" (ℎ$ " , ℎ&; " , '$&; ) Neural Message Passing f time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. '$&; '$&= Update Function: 7" (ℎ$ " , 8$ "93)

Slide 45

Slide 45 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n + 0 2 2 E E , - ( [ 1 6 G 2 2 M • EE6 C6EE C 6E [ EE6 :G 7 ) !" ℎ$ " , ℎ& " , '$& = )('$+ )ℎ& " • )('$+ )) N S R '$+ M IL00 [ C 6 :G 7 ) -" ℎ$ " , .$ "/0 = GRU ℎ$ " , .$ "/0 • ,,00 - 1 U Message Function: !" (ℎ$ " , ℎ&4 " , '$&4 ) Σ Message Function: !" (ℎ$ " , ℎ&4 " , '$&4 ) Neural Message Passing f time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut . During the message passing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v , ht w , evw) (1) ht+1 v = Ut(ht v , mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆ y = R({hT v | v 2 G}). (3) The message functions Mt , vertex update functions Ut , and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt , vertex update function Ut , and readout function R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v ) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a target at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv , xv , mv) where xv is an external vector representing some outside influence on the vertex v. The message function M(hv , hw , evw) is a neural network which takes the concatenation (hv , hw , evw). The vertex update function U(hv , xv , mv) is a neural network which takes as input the concatenation (hv , xv , mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. '$&4 '$&5 Update Function: -" (ℎ$ " , .$ "/0)

Slide 48

Slide 48 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. + , + 1 n 2 . 727 . + )2 2 7 .2 ) )2 C eft), the interaction block (middle) ork (right). The shifted softplus is Zi + (+ !"# = %" − %# (a) 1st interaction block (b) 2nd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filte SchNet trained on molecular dynamics of ethanol. Negative values ar Filter-generating networks The cfconv layer including its filter-ge at the right panel of Fig. 2. In order to satisfy the requirements for we restrict our filters for the cfconv layers to be rotationally invarian obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters w a neural network after initialization is close to linear. This leads to training that is hard to overcome. We avoid this by expanding the dista ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is occurring in the data sets are covered by the filters. Due to this addit filters are less correlated leading to a faster training procedure. Choos to reducing the resolution of the filter, while restricting the range of filter size in a usual convolutional layer. An extensive evaluation of th left for future work. We feed the expanded distances into two dense l to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction an ethanol molecular dynamics trajectory. We observe how each filter interatomic distances. This enables its interaction block to update the re radial environment of each atom. The sequential updates from three in to construct highly complex many-body representations in the spirit o rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecula an energy-conserving force model by differentiating the energy mode ˆ Fi(Z1, . . . , Zn, r1, . . . , rn) = @ ˆ E (Z1, . . . , Zn, r + + -. + 2 3 7 + (+ Zj’ Zj + + -. + 2 3 7 + (+ +

Slide 49

Slide 49 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 + 1 , n 1 12 .7 , 2 1 , , p e i Co m eft), the interaction block (middle) ork (right). The shifted softplus is Zi ( ) .3+.- ) !"# = %" − %# (a) 1st interaction block (b) 2nd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filte SchNet trained on molecular dynamics of ethanol. Negative values ar Filter-generating networks The cfconv layer including its filter-ge at the right panel of Fig. 2. In order to satisfy the requirements for we restrict our filters for the cfconv layers to be rotationally invarian obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters w a neural network after initialization is close to linear. This leads to training that is hard to overcome. We avoid this by expanding the dista ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is occurring in the data sets are covered by the filters. Due to this addit filters are less correlated leading to a faster training procedure. Choos to reducing the resolution of the filter, while restricting the range of filter size in a usual convolutional layer. An extensive evaluation of th left for future work. We feed the expanded distances into two dense l to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction an ethanol molecular dynamics trajectory. We observe how each filter interatomic distances. This enables its interaction block to update the re radial environment of each atom. The sequential updates from three in to construct highly complex many-body representations in the spirit o rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecula an energy-conserving force model by differentiating the energy mode ˆ Fi(Z1, . . . , Zn, r1, . . . , rn) = @ ˆ E (Z1, . . . , Zn, r -. . 01 .- 2 .3+.- Zj’ Zj -. . 01 .- 2 .3+.- + 2 7 0 7- 2 0 7 '( = 0.1Å, '. = 0.2Å, … '122 = 30Å 4 = 10Å 7+ ( b !"# hC' d h n lC h 0 c f

Slide 50

Slide 50 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 + 1 , n 327 2 3 - 7 32 3 7 32 - 32 b a eft), the interaction block (middle) ork (right). The shifted softplus is Zi ) + !"# = %" − %# (a) 1st interaction block (b) 2nd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filte SchNet trained on molecular dynamics of ethanol. Negative values ar Filter-generating networks The cfconv layer including its filter-ge at the right panel of Fig. 2. In order to satisfy the requirements for we restrict our filters for the cfconv layers to be rotationally invarian obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters w a neural network after initialization is close to linear. This leads to training that is hard to overcome. We avoid this by expanding the dista ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is occurring in the data sets are covered by the filters. Due to this addit filters are less correlated leading to a faster training procedure. Choos to reducing the resolution of the filter, while restricting the range of filter size in a usual convolutional layer. An extensive evaluation of th left for future work. We feed the expanded distances into two dense l to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction an ethanol molecular dynamics trajectory. We observe how each filter interatomic distances. This enables its interaction block to update the re radial environment of each atom. The sequential updates from three in to construct highly complex many-body representations in the spirit o rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecula an energy-conserving force model by differentiating the energy mode ˆ Fi(Z1, . . . , Zn, r1, . . . , rn) = @ ˆ E (Z1, . . . , Zn, r + 2 . -7 + 3-7 ) + Zj’ Zj + 2 . -7 + 3-7 ) + + 2 7 0 7- 2 0 7 C ) + 73 c ( 7 (7 32 ) + 73

Slide 51

Slide 51 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. + , + 1 n 2 . 727 . + )2 2 7 .2 ) )2 C eft), the interaction block (middle) ork (right). The shifted softplus is Zi + (+ !"# = %" − %# (a) 1st interaction block (b) 2nd interaction block Figure 3: 10x10 Å cuts through all 64 radial, three-dimensional filte SchNet trained on molecular dynamics of ethanol. Negative values ar Filter-generating networks The cfconv layer including its filter-ge at the right panel of Fig. 2. In order to satisfy the requirements for we restrict our filters for the cfconv layers to be rotationally invarian obtained by using interatomic distances dij = kri rj k as input for the filter network. Without further processing, the filters w a neural network after initialization is close to linear. This leads to training that is hard to overcome. We avoid this by expanding the dista ek(ri rj) = exp( kdij µk k2) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is occurring in the data sets are covered by the filters. Due to this addit filters are less correlated leading to a faster training procedure. Choos to reducing the resolution of the filter, while restricting the range of filter size in a usual convolutional layer. An extensive evaluation of th left for future work. We feed the expanded distances into two dense l to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction an ethanol molecular dynamics trajectory. We observe how each filter interatomic distances. This enables its interaction block to update the re radial environment of each atom. The sequential updates from three in to construct highly complex many-body representations in the spirit o rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecula an energy-conserving force model by differentiating the energy mode ˆ Fi(Z1, . . . , Zn, r1, . . . , rn) = @ ˆ E (Z1, . . . , Zn, r + + -. + 2 3 7 + (+ Zj’ Zj + + -. + 2 3 7 + (+ +

Slide 54

Slide 54 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. + , + 1 n 011 a l C [C • m: aeoh • m: aeohC : 2 7 • ! C C L • mi : ] , + 100,000 0.34 0.84 – – 110,462 0.31 – 0.45 0.33 We include the total energy E as well as forces Fi in the training loss to train a neural network that performs well on both properties: `( ˆ E, (E, F1, . . . , Fn)) = kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) This kind of loss has been used before for fitting a restricted potential energy surfaces with MLPs [36]. n our experiments, we use ⇢ = 0 in Eq. 5 for pure energy based training and ⇢ = 100 for combined energy and force training. The value of ⇢ was optimized empirically to account for different scales of energy and forces. Due to the relation of energies and forces reflected in the model, we expect to see improved gen- eralization, however, at a computational cost. As we need to perform a full forward and backward pass on the energy model to obtain the forces, the resulting force model is twice as deep and, hence, equires about twice the amount of computation time. Even though the GDML model captures this relationship between energies and forces, it is explicitly optimized to predict the force field while the energy prediction is a by-product. Models such as circular fingerprints [15], molecular graph convolutions or message-passing neural networks[19] for property prediction across chemical compound space are only concerned with equilibrium molecules, .e., the special case where the forces are vanishing. They can not be trained with forces in a similar manner, as they include discontinuities in their predicted potential energy surface caused by discrete binning or the use of one-hot encoded bond type information. 5 Experiments and results 0.59 0.94 – – 0.34 0.84 – – 0.31 – 0.45 0.33 as well as forces Fi in the training loss to train a neural network that ies: . . , Fn)) = kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) before for fitting a restricted potential energy surfaces with MLPs [36]. = 0 in Eq. 5 for pure energy based training and ⇢ = 100 for combined value of ⇢ was optimized empirically to account for different scales of es and forces reflected in the model, we expect to see improved gen- mputational cost. As we need to perform a full forward and backward btain the forces, the resulting force model is twice as deep and, hence, nt of computation time. l captures this relationship between energies and forces, it is explicitly e field while the energy prediction is a by-product. Models such as ecular graph convolutions or message-passing neural networks[19] for rgy predictions in kcal/mol on the QM9 data set with given . DTNN [18] enn-s2s [19] enn-s2s-ens5 [19] 0.94 – – 0.84 – – – 0.45 0.33 as forces Fi in the training loss to train a neural network that = kE ˆ Ek2 + ⇢ n n X i=0 Fi @ ˆ E @Ri ! 2 . (5) or fitting a restricted potential energy surfaces with MLPs [36]. q. 5 for pure energy based training and ⇢ = 100 for combined ⇢ was optimized empirically to account for different scales of rces reflected in the model, we expect to see improved gen- al cost. As we need to perform a full forward and backward forces, the resulting force model is twice as deep and, hence, mputation time. s this relationship between energies and forces, it is explicitly hile the energy prediction is a by-product. Models such as aph convolutions or message-passing neural networks[19] for mpound space are only concerned with equilibrium molecules, re vanishing. They can not be trained with forces in a similar in their predicted potential energy surface caused by discrete bond type information. ek(ri rj) = exp( kdij µk k ) located at centers 0Å  µk  30Å every 0.1Å with = 10Å. This is chosen such that all distances occurring in the data sets are covered by the filters. Due to this additional non-linearity, the initial filters are less correlated leading to a faster training procedure. Choosing fewer centers corresponds to reducing the resolution of the filter, while restricting the range of the centers corresponds to the filter size in a usual convolutional layer. An extensive evaluation of the impact of these variables is left for future work. We feed the expanded distances into two dense layers with softplus activations to compute the filter weight W(ri rj) as shown in Fig. 2 (right). Fig 3 shows 2d-cuts through generated filters for all three interaction blocks of SchNet trained on an ethanol molecular dynamics trajectory. We observe how each filter emphasizes certain ranges of interatomic distances. This enables its interaction block to update the representations according to the radial environment of each atom. The sequential updates from three interaction blocks allow SchNet to construct highly complex many-body representations in the spirit of DTNNs [18] while keeping rotational invariance due to the radial filters. 4.2 Training with energies and forces As described above, the interatomic forces are related to the molecular energy, so that we can obtain an energy-conserving force model by differentiating the energy model w.r.t. the atom positions ˆ Fi(Z1, . . . , Zn, r1, . . . , rn) = @ ˆ E @ri (Z1, . . . , Zn, r1, . . . , rn). (4) Chmiela et al. [17] pointed out that this leads to an energy-conserving force-field by construction. As SchNet yields rotationally invariant energy predictions, the force predictions are rotationally equivariant by construction. The model has to be at least twice differentiable to allow for gradient descent of the force loss. We chose a shifted softplus ssp(x) = ln(0.5ex + 0.5) as non-linearity throughout the network in order to obtain a smooth potential energy surface. The shifting ensures that

Slide 57

Slide 57 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 1 2 1 10 7 7 1 , + n : O F C GAE M / 3 N • 3 / • 3 / / • of the neighbors of a node. Our objective is to design convolution operators that can be applied to graphs without a regular structure, and without imposing a particular order on the neighbors of a given node. To summarize, we would like to learn a mapping at each node in the graph which has the form: zi = W (xi, {xn1 , . . . , xnk }), where {n1, . . . , nk } are the neighbors of node i that define the receptive field of the convolution, is a non-linear activation function, and W are its learned parameters; the dependence on the neighboring nodes as a set represents our intention to learn a function that is order-independent. We present the following two realizations of this operator that provides the output of a set of filters in a neighborhood of a node of interest that we refer to as the "center node": zi = ✓ W C xi + 1 |Ni | X j2Ni W N xj + b ◆ , (1) where Ni is the set of neighbors of node i, W C is the weight matrix associated with the center node, W N is the weight matrix associated with neighboring nodes, and b is a vector of biases, one for each filter. The dimensionality of the weight matrices is determined by the dimensionality of the inputs and the number of filters. The computational complexity of this operator on a graph with n nodes, a 2 Figure 1: Graph convolution on protein structures. Left: Each residue in a protein is a node in a graph where neighborhood of a node is the set of neighboring nodes in the protein structure; each node has features comp from its amino acid sequence and structure, and edges have features describing the relative distance and a between residues. Right: Schematic description of the convolution operator which has as its receptive field of neighboring residues, and produces an activation which is associated with the center residue. neighborhood of size k, Fin input features and Fout output features is O(kFinFoutn). Constructio the neighborhood is straightforward using a preprocessing step that takes O(n2 log n). In order to provide for some differentiation between neighbors, we incorporate features on the ed between each neighbor and the center node as follows: zi = ✓ W C xi + 1 |Ni | X j2Ni W N xj + 1 |Ni | X j2Ni W E Aij + b ◆ , where W E is the weight matrix associated with edge features. For comparison with order-independent methods we propose an order-dependent method, wh order is determined by distance from the center node. In this method each neighbor has unique we matrices for nodes and edges: zi = ✓ W C xi + 1 |Ni | X j2Ni W N j xj + 1 |Ni | X j2Ni W E j Aij + b ◆ . Here W N j /W E j are the weight matrices associated with the jth node or the edges connecting to the nodes, respectively. This operator is inspired by the PATCHY-SAN method of Niepert et al. [16]. more flexible than the order-independent convolutional operators, allowing the learning of distinct between neighbors at the cost of significantly more parameters. Multiple layers of these graph convolution operators can be used, and this will have the ef of learning features that characterize the graph at increasing levels of abstraction, and will allow information to propagate through the graph, thereby integrating information across region increasing size. Furthermore, these operators are rotation-invariant if the features have this prop In convolutional networks, inputs are often downsampled based on the size and stride of the recep field. It is also common to use pooling to further reduce the size of the input. Our graph opera on the other hand maintain the structure of the graph, which is necessary for the protein interf prediction problem, where we classify pairs of nodes from different graphs, rather than en graphs. Using convolutional architectures that use only convolutional layers without downsamplin common practice in the area of graph convolutional networks, especially if classification is perform neighborhood of size k, Fin input features and Fout output features is O(kFinFoutn). Construction of the neighborhood is straightforward using a preprocessing step that takes O(n2 log n). In order to provide for some differentiation between neighbors, we incorporate features on the edges between each neighbor and the center node as follows: zi = ✓ W C xi + 1 |Ni | X j2Ni W N xj + 1 |Ni | X j2Ni W E Aij + b ◆ , (2) where W E is the weight matrix associated with edge features. For comparison with order-independent methods we propose an order-dependent method, where order is determined by distance from the center node. In this method each neighbor has unique weight matrices for nodes and edges: zi = ✓ W C xi + 1 |Ni | X j2Ni W N j xj + 1 |Ni | X j2Ni W E j Aij + b ◆ . (3) Here W N j /W E j are the weight matrices associated with the jth node or the edges connecting to the jth nodes, respectively. This operator is inspired by the PATCHY-SAN method of Niepert et al. [16]. It is more flexible than the order-independent convolutional operators, allowing the learning of distinctions between neighbors at the cost of significantly more parameters. Multiple layers of these graph convolution operators can be used, and this will have the effect of learning features that characterize the graph at increasing levels of abstraction, and will also allow information to propagate through the graph, thereby integrating information across regions of increasing size. Furthermore, these operators are rotation-invariant if the features have this property. In convolutional networks, inputs are often downsampled based on the size and stride of the receptive field. It is also common to use pooling to further reduce the size of the input. Our graph operators on the other hand maintain the structure of the graph, which is necessary for the protein interface prediction problem, where we classify pairs of nodes from different graphs, rather than entire graphs. Using convolutional architectures that use only convolutional layers without downsampling is common practice in the area of graph convolutional networks, especially if classification is performed at the node or edge level. This practice has support from the success of networks without pooling layers in the realm of object recognition [23]. The downside of not downsampling is higher memory and computational costs. Related work. Several authors have recently proposed graph convolutional operators that generalize Method Convolutional Layers 1 2 3 4 No Convolution 0.812 (0.007) 0.810 (0.006) 0.808 (0.006) 0.796 (0.006) Diffusion (DCNN) (2 hops) [5] 0.790 (0.014) – – – Diffusion (DCNN) (5 hops) [5]) 0.828 (0.018) – – – Single Weight Matrix (MFN [9]) 0.865 (0.007) 0.871 (0.013) 0.873 (0.017) 0.869 (0.017) Node Average (Equation (1)) 0.864 (0.007) 0.882 (0.007) 0.891 (0.005) 0.889 (0.005) Node and Edge Average (Equation (2)) 0.876 (0.005) 0.898 (0.005) 0.895 (0.006) 0.889 (0.007) DTNN [21] 0.867 (0.007) 0.880 (0.007) 0.882 (0.008) 0.873 (0.012) Order Dependent (Equation (3)) 0.854 (0.004) 0.873 (0.005) 0.891 (0.004) 0.889 (0.008) Table 2: Median area under the receiver operating characteristic curve (AUC) across all complexes in the test set for various graph convolutional methods. Results shown are the average and standard deviation over ten runs with different random seeds. Networks have the following number of filters for 1, 2, 3, and 4 layers before merging, respectively: (256), (256, 512), (256, 256, 512), (256, 256, 512, 512). The exception is the DTNN method, which by necessity produces an output which is has the same dimensionality as its input. Unlike the other methods, diffusion convolution performed best with an RBF with a standard deviation of 2Å. After merging, all networks have a dense layer with 512 hidden units followed by a binary classification layer. Bold faced values indicate best performance for each method.

Slide 61

Slide 61 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 2 A 0 2 7 211 , 2 2 + n I / 2 • 1 • 2 D 2 D • D L not indicate which vertex serves as the source and which serves as the destination, nor does it disambiguate between pairs of vertices that happen to share the same midpoint. To train the network to produce a coherent set of embeddings we build off of the loss penalty used in [20]. During training, we have a ground truth set of annotations defining the unique objects in the scene and the edges between these objects. This allows us to enforce two penalties: that an edge points to a vertex by matching its output embedding as closely as possible, and that the embedding vectors produced for each vertex are sufficiently different. We think of the first as “pulling together” all references to a single vertex, and the second as “pushing apart” the references to different individual vertices. We consider an embedding hi 2 Rd produced for a vertex vi 2 V . All edges that connect to this vertex produce a set of embeddings h0 ik , k = 1, ..., Ki where Ki is the total number of references to that vertex. Given an image with n objects the loss to “pull together” these embeddings is: Lpull = 1 Pn i=1 Ki n X i=1 Ki X k=1 (hi h0 ik )2 To “push apart” embeddings across different vertices we first used the penalty described in [20], but experienced difficulty with convergence. We tested alternatives and the most reliable loss was a margin-based penalty similar to [9]: Lpush = n 1 X i=1 n X j=i+1 max(0, m ||hi hj ||) Intuitively, Lpush is at its highest the closer hi and hj are to each other. The penalty drops off sharply as the distance between hi and hj grows, eventually hitting zero once the distance is greater than a given margin m. On the flip side, for some edge connected to a vertex vi , the loss Lpull will quickly grow the further its reference embedding h0 i is from hi . The two penalties are weighted equally leaving a final associative embedding loss of Lpull + Lpush . In this work, we use m = 8 and d = 8. Convergence of the network improves greatly after increasing the dimension d of tags up from 1 as used in [20]. Once the network is trained with this loss, full construction of the graph can be performed with a trivial postprocessing step. The network produces a pool of vertex and edge detections. For every edge, we look at the source and destination embeddings and match them to the closest embedding amongst the detected vertices. Multiple edges may have the same source and target vertices, vs and vt , and it is also possible for vs to equal vt . Figure 2: Full pipeline for object and relationship detection. A network is trained to produce two heatmaps that activate at the predicted locations of objects and relationships. Feature vectors are extracted from the pixel locations of top activations and fed through fully connected networks to To train the network to produce a coherent set of embeddings we build off of the loss penalty used in [20]. During training, we have a ground truth set of annotations defining the unique objects in the scene and the edges between these objects. This allows us to enforce two penalties: that an edge points to a vertex by matching its output embedding as closely as possible, and that the embedding vectors produced for each vertex are sufficiently different. We think of the first as “pulling together” all references to a single vertex, and the second as “pushing apart” the references to different individual vertices. We consider an embedding hi 2 Rd produced for a vertex vi 2 V . All edges that connect to this vertex produce a set of embeddings h0 ik , k = 1, ..., Ki where Ki is the total number of references to that vertex. Given an image with n objects the loss to “pull together” these embeddings is: Lpull = 1 Pn i=1 Ki n X i=1 Ki X k=1 (hi h0 ik )2 To “push apart” embeddings across different vertices we first used the penalty described in [20], but experienced difficulty with convergence. We tested alternatives and the most reliable loss was a margin-based penalty similar to [9]: Lpush = n 1 X i=1 n X j=i+1 max(0, m ||hi hj ||) Intuitively, Lpush is at its highest the closer hi and hj are to each other. The penalty drops off sharply as the distance between hi and hj grows, eventually hitting zero once the distance is greater than a given margin m. On the flip side, for some edge connected to a vertex vi , the loss Lpull will quickly grow the further its reference embedding h0 i is from hi . The two penalties are weighted equally leaving a final associative embedding loss of Lpull + Lpush . In this work, we use m = 8 and d = 8. Convergence of the network improves greatly after increasing the dimension d of tags up from 1 as used in [20]. Once the network is trained with this loss, full construction of the graph can be performed with a trivial postprocessing step. The network produces a pool of vertex and edge detections. For every edge, we look at the source and destination embeddings and match them to the closest embedding amongst the detected vertices. Multiple edges may have the same source and target vertices, vs and vt , and it is also possible for vs to equal vt . 5

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text