Matching Networks for One Shot Learning

Slide 1

Slide 1 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AI System Dept. System Management Unit Kazuki Fujikawa Matching Networks for One Shot Learning https://papers.nips.cc/paper/6385-matching-networks-for-one- shot-learning 論⽂紹介 1 NIPS2016 読み会 @Preferred Networks 2017/01/19

Slide 12

Slide 12 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (One-shot Learning) n  Siamese Learnet [Bertinetto+, NIPS2016] ⁃  Learn the parameters of a network to incorporate domain specific information from a few examples 12 siamese siamese learnet learnet Figure 1: Our proposed architectures predict the parameters of a network from a single example, replacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predicts the parameters of an embedding function that is applied to both inputs, whereas the single-stream learnet predicts the parameters of a function that is applied to the other input. Linear layers are denoted by ⇤ and nonlinear layers by . Dashed connections represent parameter sharing. discriminative one-shot learning is to find a mechanism to incorporate domain-specific information in the learner, i.e. learning to learn . Another challenge, which is of practical importance in applications of one-shot learning, is to avoid a lengthy optimization process such as eq. (1). We propose to address both challenges by learning the parameters W of the predictor from a single exemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function ! that maps ( z ; W 0) to W . Since in practice this function will be implemented using a deep neural network, we call it a learnet . The learnet depends on the exemplar z , which is a single representative of the class of interest, and contains parameters W 0 of its own. Learning to learn can now be posed as the problem of optimizing the learnet meta-parameters W 0 using an objective function defined below. Furthermore, the feed-forward learnet evaluation is much faster than solving the optimization problem (1). In order to train the learnet, we require the latter to produce good predictors given any possible exemplar z , which is empirically evaluated as an average over n training samples zi :

Slide 13

Slide 13 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (Attention Mechanism) n  Sequence to Sequence with Attention [Bahdanau+, 2014] ⁃  Attend to the word relevant to the generation of the next target word in the source sentence 13 t t her architectures such as a hybrid of an RNN alchbrenner and Blunsom, 2013). ral machine translation. The new architecture 3.2) and a decoder that emulates searching n (Sec. 3.1). x 1 x 2 x 3 x T + α t,1 α t,2 α t,3 α t,T y t-1 y t h 1 h 2 h 3 h T h 1 h 2 h 3 h T s t-1 s t Figure 1: The graphical illus- tration of the proposed model trying to generate the t -th target word yt given a source sentence (x1, x2, . . . , xT ) . al probability (4) by –decoder ap- on a distinct annotations ntence. Each put sequence word of the ons are com- sum of these (5) ij) Figure 1: The graphical illus- tration of the proposed model trying to generate the t -th target word yt given a source sentence (x1, x2, . . . , xT ) . si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoder–decoder approach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi . The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i -th word of the input sequence. We explain in detail how the annotations are computed in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi : ci = Tx X j =1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) P Tx k =1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi , Eq. (4)) and the j -th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Figure 1: The graphical illus- tration of the proposed model trying to generate the t -th target word yt given a source sentence (x1, x2, . . . , xT ) . si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoder–decoder approach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi . The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i -th word of the input sequence. We explain in detail how the annotations are computed in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi : ci = Tx X j =1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) P Tx k =1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi , Eq. (4)) and the j -th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Figure 1: The graphical illus- tration of the proposed model trying to generate the t -th target word yt given a source sentence (x1, x2, . . . , xT ) . proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi . The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i -th word of the input sequence. We explain in detail how the annotations are computed in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi : ci = Tx X j =1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) P Tx k =1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi , Eq. (4)) and the j -th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Published as a conference paper at ICLR 2015 (a) (b)

Slide 14

Slide 14 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (Attention Mechanism) n  Pointer Networks [Vinyals+, 2015] ⁃  Generate output sequence using a distribution over the dictionary of inputs 14 (a) Sequence-to-Sequence (b) Ptr-Net Figure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a code vector that is used to generate the output sequence (purple) using the probability chain rule and another RNN. The output dimensionality is fixed by the dimensionality of the problem and it is the same during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequence to a code (blue) that is fed to the generating network (purple). At each step, the generating network produces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). The output of the attention mechanism is a softmax distribution with dictionary size equal to the length of the input. ion (i.e., when we only have examples of inputs and desired outputs). The proposed approach is depicted in Figure 1. The main contributions of our work are as follows: This model performs significantly better than the sequence-to-sequence model on the co problem, but it is not applicable to problems where the output dictionary size depends on Nevertheless, a very simple extension (or rather reduction) of the model allows us to do th 2.3 Ptr-Net We now describe a very simple modification of the attention model that allows us to method to solve combinatorial optimization problems where the output dictionary size d the number of elements in the input sequence. The sequence-to-sequence model of Section 2.1 uses a softmax distribution over a fixed si dictionary to compute p ( Ci |C1 , . . . , Ci 1 , P ) in Equation 1. Thus it cannot be used for our where the size of the output dictionary is equal to the length of the input sequence. To problem we model p ( Ci |C1 , . . . , Ci 1 , P ) using the attention mechanism of Equation 3 a ui j = vT tanh( W1 ej + W2 di) j 2 (1 , . . . , n ) p ( Ci |C1 , . . . , Ci 1 , P ) = softmax ( ui ) where softmax normalizes the vector ui (of length n) to be an output distribution over the of inputs, and v, W1 , and W2 are learnable parameters of the output model. Here, we do the encoder state ej to propagate extra information to the decoder, but instead, use ui j a to the input elements. In a similar way, to condition on Ci 1 as in Equation 1, we sim the corresponding PCi 1 as the input. Both our method and the attention model can be application of content-based attention mechanisms proposed in [6, 5, 2]. We also note that our approach specifically targets problems whose outputs are discrete spond to positions in the input. Such problems may be addressed artificially – for example learn to output the coordinates of the target point directly using an RNN. However, at this solution does not respect the constraint that the outputs map back to the inputs exac out the constraints, the predictions are bound to become blurry over longer sequences as sequence-to-sequence models for videos [12]. 3 Motivation and Datasets Structure In the following sections, we review each of the three problems we considered, as well a generation protocol.1 In the training data, the inputs are planar point sets P = {P1 , . . . , Pn } with n elements ea Pj = ( xj , yj) are the cartesian coordinates of the points over which we find the convex hu launay triangulation or the solution to the corresponding Travelling Salesman Problem. In we sample from a uniform distribution in [0 , 1] ⇥ [0 , 1] . The outputs CP = {C1 , . . . , C sequences representing the solution associated to the point set P. In Figure 2, we find an i of an input/output pair ( P, CP ) for the convex hull and the Delaunay problems.

Slide 18

Slide 18 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  System Overview ⁃  Embedding functions f, g are parameterized as a simple CNN (e.g. VGG or Inception) or a fully conditional embedding function mentioned later 18 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by first defining and explaining our model whilst linking its several components to related work. Then in the following section we briefly elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S , our model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) . Second, we employ ˆ x Query f g(x i ) f ( ˆ x,S) a ∑ P(ˆ y | ˆ x where xi, yi are the inputs and corresp { (xi, yi) }k i =1 , and a is an attention mech tially describes the output for a new class Where the attention mechanism a is a kerne Where the attention mechanism is zero f metric and an appropriate constant otherw (although this requires an extension to the Thus (1) subsumes both KDE and kNN me mechanism and the yi act as values bound this case we can understand this as a parti we “point” to the corresponding example i form defined by the classifier cS(ˆ x) is very 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the fier. The simplest form that this takes attention models and kernel functions) a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( ate neural networks (potentially with f = examples where f and g are parameteris tasks (as in VGG[22] or Inception[24]) or Section 4). We note that, though related to metric learn For a given support set S and sample to cl pairs (x 0 , y 0 ) 2 S such that y 0 = y and mi methods such as Neighborhood Compone nearest neighbor [28]. However, the objective that we are trying classification, and thus we expect it to per Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that e tially describes the output for a new class as a linear combination of the labels in the s Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel densit Where the attention mechanism is zero for the b furthest xi from ˆ x according to som metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest n (although this requires an extension to the attention mechanism that we describe in Sec Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a mechanism and the yi act as values bound to the corresponding keys xi , much like a has this case we can understand this as a particular kind of associative memory where, give Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. Figure 1: Matching Networks architecture x i Support Set（S） y i g

Slide 19

Slide 19 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Attention Kernel ⁃  Calculate softmax over the cosine distance between and •  Similar to nearest neighbor calculation ⁃  Train a network using cross entropy loss 19 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by first defining and explaining our model whilst linking its several components to related work. Then in the following section we briefly elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S , our model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) . Second, we employ ˆ x Query f g(x i ) f ( ˆ x,S) a Our model in its simplest form computes a probability over ˆ y as follow P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss b tially describes the output for a new class as a linear combination of Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin Where the attention mechanism is zero for the b furthest xi from ˆ x metric and an appropriate constant otherwise, then (1) is equivalent t (although this requires an extension to the attention mechanism that w ∑ P(ˆ y | ˆ x where xi, yi are the inputs and corresp { (xi, yi) }k i =1 , and a is an attention mech tially describes the output for a new class Where the attention mechanism a is a kerne Where the attention mechanism is zero f metric and an appropriate constant otherw (although this requires an extension to the Thus (1) subsumes both KDE and kNN me mechanism and the yi act as values bound this case we can understand this as a parti we “point” to the corresponding example i form defined by the classifier cS(ˆ x) is very 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the fier. The simplest form that this takes attention models and kernel functions) a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( ate neural networks (potentially with f = examples where f and g are parameteris tasks (as in VGG[22] or Inception[24]) or Section 4). We note that, though related to metric learn For a given support set S and sample to cl pairs (x 0 , y 0 ) 2 S such that y 0 = y and mi methods such as Neighborhood Compone nearest neighbor [28]. However, the objective that we are trying classification, and thus we expect it to per Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that e tially describes the output for a new class as a linear combination of the labels in the s Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel densit Where the attention mechanism is zero for the b furthest xi from ˆ x according to som metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest n (although this requires an extension to the attention mechanism that we describe in Sec Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a mechanism and the yi act as values bound to the corresponding keys xi , much like a has this case we can understand this as a particular kind of associative memory where, give Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the suppo { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that tially describes the output for a new class as a linear combination of the labels in the Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel dens Where the attention mechanism is zero for the b furthest xi from ˆ x according to so metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest (although this requires an extension to the attention mechanism that we describe in Se Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as mechanism and the yi act as values bound to the corresponding keys xi , much like a ha this case we can understand this as a particular kind of associative memory where, giv we “point” to the corresponding example in the support set, retrieving its label. Hence th form defined by the classifier cS(ˆ x) is very flexible and can adapt easily to any new sup 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the attention mechanism, which fully specifie fier. The simplest form that this takes (and which has very tight relationships wi attention models and kernel functions) is to use the softmax over the cosine dist a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( xj )) with embedding functions f and g bein ate neural networks (potentially with f = g ) to embed ˆ x and xi . In our experiments w examples where f and g are parameterised variously as deep convolutional network tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for languag Section 4). We note that, though related to metric learning, the classifier defined by Equation 1 is di For a given support set S and sample to classify ˆ x , it is enough for ˆ x to be sufficiently a pairs (x 0 , y 0 ) 2 S such that y 0 = y and misaligned with the rest. This kind of loss is als c: cosine distance Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. Figure 1: Matching Networks architecture x i Support Set（S） y i g ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) hk = ˆ hk + f 0 (ˆ x) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] h the output (i.e., cell after the output gate), and c the cell. a is commonly referred based attention. We do K steps of “reads”, so f(ˆ x, S) = hK where hk is as describ 2.2 Training Strategy In the previous subsection we described Matching Networks which map a support set t function, S ! c(ˆ x) . We achieve this via a modification of the set-to-set paradigm attention, with the resulting mapping being of the form P✓(. | ˆ x, S) , noting that ✓ are of the model (i.e. of the embedding functions f and g described previously). The training procedure has to be chosen carefully so as to match inference at test t has to perform well with support sets S 0 which contain classes never seen during tra More specifically, let us define a task T as distribution over possible label sets L consider T to uniformly weight all data sets of up to a few unique classes (e.g. examples per class (e.g., up to 5). In this case, a label set L sampled from a task typically have 5 to 25 examples. To form an “episode” to compute gradients and update our model, we first sample L could be the label set { cats, dogs }). We then use L to sample the support set S (i.e., both S and B are labelled examples of cats and dogs). The Matching Net is minimise the error predicting the labels in the batch B conditioned on the support form of meta-learning since the training procedure explicitly learns to learn from a g to minimise a loss over a batch. More precisely, the Matching Nets training objectiv ✓ = arg max ✓ EL ⇠ T 2 4 ES ⇠ L,B ⇠ L 2 4 X ( x,y )2 B log P✓ (y | x, S) 3 5 3 5 . Training ✓ with eq. 6 yields a model which works well when sampling S 0 ⇠ T 0 g(x i ) f ( ˆ x,S)

Slide 20

Slide 20 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ LSTM LSTM + Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 20 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i g’ LSTM LSTM + g’ LSTM LSTM + noting that LSTM (x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-ou concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = | S |. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax (hk 1g(xi)) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-out rk 1 from concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S , g as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (similar to f 0 above VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = | S |. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n x i g(x i ,S)

Slide 21

Slide 21 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 21 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i g’ g’ noting that LSTM (x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-ou concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = | S |. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax (hk 1g(xi)) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-out rk 1 from concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S , g as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (similar to f 0 above VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = | S |. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n Embed into vector using g’ （g’: neural network such as VGG or Inception） x i x i

Slide 22

Slide 22 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ LSTM LSTM Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 22 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i g’ LSTM LSTM g’ LSTM LSTM noting that LSTM (x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-ou concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = | S |. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax (hk 1g(xi)) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-out rk 1 from concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S , g as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (similar to f 0 above VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = | S |. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n Feed into Bi-LSTM （gʼ: neural network such as VGG or Inception） g'(x i ) x i

Slide 23

Slide 23 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ LSTM LSTM + Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 23 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i g’ LSTM LSTM + g’ LSTM LSTM + noting that LSTM (x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-ou concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = | S |. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax (hk 1g(xi)) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-out rk 1 from concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S , g as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (similar to f 0 above VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = | S |. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n g(x i ,S) Let be the sum of and outputs of Bi-LSTM g(x i ,S) g'(x i ) x i

Slide 24

Slide 24 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM r k−1 a(h k−1 ,g(x i )) g(x i ) LSTM f ( ˆ x,S) = h K ˆ h k−1 h k−1 ˆ h k + + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query weighted sum 24 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ˆ x ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4)

Slide 25

Slide 25 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM g(x i ) ˆ h 1 h 1 + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query 25 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) is calculated without using S h 1 = LSTM( f '( ˆ x),[ ˆ h 0 ,r 0 ],c 0 )+ f '( ˆ x) h 1 ˆ x

Slide 26

Slide 26 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM g(x i ) ˆ h 1 h 1 + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query 26 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) Calculate the relevance between and softmax a(h 1 ,g(x 1 )) = a(h 1 ,g(x i )) (hT 1 g(x 1 )) g(x i ) h 1 ˆ x

Slide 27

Slide 27 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM g(x i ) ˆ h 1 h 1 + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query 27 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) is a sum of weighted according to the relevance to a(h 1 ,g(x i )) r 1 weighted sum r 1 g(x i ) h 1 r 1 = a(h 1 ,g(x i )) i=1 |S| ∑ g(x i ) ˆ x

Slide 28

Slide 28 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM g(x i ) ˆ h 1 h 1 + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query 28 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(h 1 ,g(x i )) r 1 weighted sum LSTM ˆ h 1 + h 1 is calculated using S h 1 ˆ x

Slide 29

Slide 29 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+, NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM r k−1 a(h k−1 ,g(x i )) g(x i ) LSTM f ( ˆ x,S) = h K ˆ h k−1 h k−1 ˆ h k + + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query weighted sum 29 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) Let be the output after K steps f ( ˆ x,S) ˆ x

Slide 30

Slide 30 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings n  Datasets ⁃  Image classiﬁcation sets •  Omniglot [Lake+, 2011] ⁃  Language modeling •  Penn Treebank [Marcus+, 1993] 30 •  ImageNet [Deng+, 2009] ref. http://karpathy.github.io/2014/09/02/what-i-learned- from-competing-against-a-convnet-on-imagenet/ 4.1.3 One-Shot Language Modeling We also introduce a new one-shot language task which is analogous to those examined for images. The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a u.s. scientist said. prominent 2. the show one of five new nbc is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the . dollar 4. we had a lot of people who threw in the today said ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. it’s not easy to roll out something that and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]

Slide 35

Slide 35 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (randImageNet, dogsImageNet) 35 classification. Thus, we believe that if we adapted our training strategy to samples S from fine grained sets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvements could be attained. We leave this as future work. Table 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that 6 =Lrand and 6 =Ldogs are sets of classes which are seen during training, but are provided for completeness. Model Matching Fn Fine Tune ImageNet 5-way 1-shot Acc L rand 6=L rand L dogs 6=L dogs PIXELS Cosine N 42.0% 42.8% 41.4% 43.0% INCEPTION CLASSIFIER Cosine N 87.6% 92.6% 59.8% 90.0% MATCHING NETS (OURS) Cosine (FCE) N 93.2% 97.0% 58.8% 96.4% INCEPTION ORACLE Softmax (Full) Y (Full) ⇡ 99% ⇡ 99% ⇡ 99% ⇡ 99% 7 n  Matching Networks outperformed Inception Classifier in , but degraded in n  The decrease of the performance in might be caused by the different distributions of labels between training and testing ⁃  Training support set comes from a random distribution whereas testing one comes from similar classes BASELINE CLASSIFIER Cosine Y 36 BASELINE CLASSIFIER Softmax Y 38 MATCHING NETS (OURS) Cosine N 41 MATCHING NETS (OURS) Cosine Y 42 MATCHING NETS (OURS) Cosine (FCE) N 44 MATCHING NETS (OURS) Cosine (FCE) Y 46 1 -shot tasks from the training data set, incorporating Full Context Emb Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are show Oracle (trained on all classes) performs almost perfectly when restricted not too surprising given its impressive top-1 accuracy. When trained so Nets improve upon Inception by almost 6% when tested on Lrand , halving two instances of 5-way one-shot learning, where Inception fails. Looking appears to sometimes prefer an image above all others (these images te example in the second column, or more constant in color). Matching Nets, to recover from these outliers that sometimes appear in the support set S 0 Matching Nets manage to improve upon Inception on the complementar this setup is not one-shot, as the feature extraction has been trained on the much more challenging Ldogs subset, our model degrades by 1% . We h 1 -shot tasks from the training data set, incorporating Full Context Embeddings an Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are shown in Table Oracle (trained on all classes) performs almost perfectly when restricted to 5 classe not too surprising given its impressive top-1 accuracy. When trained solely on 6 =L Nets improve upon Inception by almost 6% when tested on Lrand , halving the errors two instances of 5-way one-shot learning, where Inception fails. Looking at all the e appears to sometimes prefer an image above all others (these images tend to be c example in the second column, or more constant in color). Matching Nets, on the oth to recover from these outliers that sometimes appear in the support set S 0. Matching Nets manage to improve upon Inception on the complementary subset 6 = this setup is not one-shot, as the feature extraction has been trained on these labels). much more challenging Ldogs subset, our model degrades by 1% . We hypothesiz that the sampled set during training, S , comes from a random distribution of labels whereas the testing support set S 0 from Ldogs contains similar classes, more akin classification. Thus, we believe that if we adapted our training strategy to samples S f sets of labels instead of sampling uniformly from the leafs of the ImageNet class tre could be attained. We leave this as future work. 1 -shot tasks from the training data set, incorporating Full C Networks and training strategy. The results of the randImageNet and dogsImageNet experimen Oracle (trained on all classes) performs almost perfectly whe not too surprising given its impressive top-1 accuracy. When Nets improve upon Inception by almost 6% when tested on Lr two instances of 5-way one-shot learning, where Inception fa appears to sometimes prefer an image above all others (thes example in the second column, or more constant in color). Ma to recover from these outliers that sometimes appear in the su Matching Nets manage to improve upon Inception on the com this setup is not one-shot, as the feature extraction has been tra much more challenging Ldogs subset, our model degrades b that the sampled set during training, S , comes from a random whereas the testing support set S 0 from Ldogs contains simi classification. Thus, we believe that if we adapted our training sets of labels instead of sampling uniformly from the leafs of could be attained. We leave this as future work.

Slide 36

Slide 36 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (Penn Treebank) 36 x i Support Set（S） ˆ x Query g(x i ) f ( ˆ x,S) a Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the su k ∑ P(ˆ y | ˆ x, S) = where xi, yi are the inputs and correspondin { (xi, yi) }k i =1 , and a is an attention mechanism tially describes the output for a new class as a Where the attention mechanism a is a kernel on X Where the attention mechanism is zero for the metric and an appropriate constant otherwise, th (although this requires an extension to the atten Thus (1) subsumes both KDE and kNN methods. mechanism and the yi act as values bound to the this case we can understand this as a particular we “point” to the corresponding example in the s form defined by the classifier cS(ˆ x) is very flexib 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the atten fier. The simplest form that this takes (and w attention models and kernel functions) is to a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( xj )) w ate neural networks (potentially with f = g ) to examples where f and g are parameterised var tasks (as in VGG[22] or Inception[24]) or a sim Section 4). We note that, though related to metric learning, th For a given support set S and sample to classify pairs (x 0 , y 0 ) 2 S such that y 0 = y and misalign methods such as Neighborhood Component An nearest neighbor [28]. However, the objective that we are trying to opti classification, and thus we expect it to perform b Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support set { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that eq. 1 tially describes the output for a new class as a linear combination of the labels in the suppo Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel density esti Where the attention mechanism is zero for the b furthest xi from ˆ x according to some dis metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest neigh (although this requires an extension to the attention mechanism that we describe in Section 2 Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an atte mechanism and the yi act as values bound to the corresponding keys xi , much like a hash tab y i Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support s { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that eq. tially describes the output for a new class as a linear combination of the labels in the su Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel density Where the attention mechanism is zero for the b furthest xi from ˆ x according to some metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest ne (although this requires an extension to the attention mechanism that we describe in Secti Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an mechanism and the yi act as values bound to the corresponding keys xi , much like a hash this case we can understand this as a particular kind of associative memory where, given we “point” to the corresponding example in the support set, retrieving its label. Hence the f form defined by the classifier cS(ˆ x) is very flexible and can adapt easily to any new suppo 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the attention mechanism, which fully specifies th fier. The simplest form that this takes (and which has very tight relationships with attention models and kernel functions) is to use the softmax over the cosine distanc a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( xj )) with embedding functions f and g being ate neural networks (potentially with f = g ) to embed ˆ x and xi . In our experiments we examples where f and g are parameterised variously as deep convolutional networks f tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language t Section 4). We note that, though related to metric learning, the classifier defined by Equation 1 is discri c: cosine distance LSTM LSTM … virus a LSTM LSTM … new nbc LSTM LSTM on the … LSTM LSTM the yesterday … 4.1.3 One-Shot Language Modeling We also introduce a new one-shot language task which is analogous to those examined for images. The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a u.s. scientist said. prominent 2. the show one of five new nbc is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the . dollar 4. we had a lot of people who threw in the today said ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. it’s not easy to roll out something that and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30] trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined a similar setup wherein a sentence was presented to the model with a single word filled in with 5 different possible words (including the correct answer). For each of these 5 sentences the model gave The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a u.s. scientist said. prominent 2. the show one of five new nbc is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the . dollar 4. we had a lot of people who threw in the today said ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. it’s not easy to roll out something that and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30] trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined a similar setup wherein a sentence was presented to the model with a single word filled in with 5 different possible words (including the correct answer). For each of these 5 sentences the model gave a log-likelihood and the max of these was taken to be the choice of the model. n  Fill in a brank in a query sentence by a label in a support set

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text