Matching Networks for One Shot Learning

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AI System Dept.
System Management Unit Kazuki Fujikawa Matching Networks for One Shot Learning https://papers.nips.cc/paper/6385-matching-networks-for-one- shot-learning 論⽂紹介 1 NIPS2016 読み会 @Preferred Networks 2017/01/19

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n  One-shot learning
with attention and memory ⁃  Learn a concept from one or only a few training examples ⁃  Train a fully end-to-end nearest neighbor classiﬁer: incorporating the best characteristics from both parametric and non-parametric models ⁃  Improved one-shot accuracy on Omniglot from 88.0% to 93.2% compared to competing approaches 2 Abstract Figure 1: Matching Networks architecture

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AGENDA n  Introduction
n  Related work ⁃  One-shot learning ⁃  Attention mechanisms n  Matching Networks n  Experiments ⁃  Omniglot ⁃  ImageNet ⁃  Penn Treebank 3

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Supervised Learning n 
Learn a correspondence between training data and labels ⁃  Require a large labeled dataset for training (ex. CIFAR10 [Krizhevsky+, 2009]: 6000 data / class) ⁃  It is hard to let classifiers learn new concepts from little data 4 airplane automobile bird cat deer Classifier examples Labels 0 airplane 1 automobile 0 bird 0 cat 0 deer Classifier Training phase Predicting phase https://www.cs.toronto.edu/~kriz/cifar.html

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. One-shot Learning n 
Learn a concept from one or only a few training examples ⁃  A classifier can be trained by datasets with labels which donʼt be used in predicting phase 5 airplane automobile bird cat deer Classifier examples Labels 0 airplane 1 automobile 0 bird 0 cat 0 deer Classifier （Pre-）Training phase Predicting phase（one-shot learning phase） https://www.cs.toronto.edu/~kriz/cifar.html dog frog horse ship truck Classifier examples Labels

Task: N-way k-shot learning 6 T’: Testing task T: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer •  Separate labels for training and testing •  All the labels which you use in testing phase (one-shot learning phase) are not used in training phase https://www.cs.toronto.edu/~kriz/cifar.html

Task: N-way k-shot learning 7 T’: Testing task T: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer •  T’ is used for one-shot learning •  T can be used freely to train （e.g. Multiclass classiﬁcation） https://www.cs.toronto.edu/~kriz/cifar.html

Task: N-way k-shot learning 8 T’: Testing task T: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer L’: Label set sampling N labels from Tʼ •  In this ﬁgure, Lʼ has 3 classes, thus “3-way k-shot learning” automobile cat deer https://www.cs.toronto.edu/~kriz/cifar.html

Task: N-way k-shot learning 9 T’: Testing task T: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer L’: Label set S’: Support set : Query automobile cat deer sampling N labels from Tʼ sampling k examples from Lʼ sampling 1 example from Lʼ ˆ x •  Task: classify into 3 classes, {automobile, cat, deer}, using support set ˆ x https://www.cs.toronto.edu/~kriz/cifar.html

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (One-shot
Learning) n  Convolutional Siamese Network [Koch+, 2015] ⁃  Learn image representation with a siamese neural network ⁃  Reuse features from the network for one-shot learning 10 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in deﬁning a model and training criterion amenable for one-shot learning, we contribute by the deﬁnition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work CNN CNN Same?

Learning) n  Memory-Augmented Neural Networks (MANN) [Santoro+, 2016] ⁃  Quickly encode and retrieve new information using external memory, inspired by the idea of Neural Turing Machine 11

Learning) n  Siamese Learnet [Bertinetto+, NIPS2016] ⁃  Learn the parameters of a network to incorporate domain specific information from a few examples 12 siamese siamese learnet learnet Figure 1: Our proposed architectures predict the parameters of a network from a single example, replacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predicts the parameters of an embedding function that is applied to both inputs, whereas the single-stream learnet predicts the parameters of a function that is applied to the other input. Linear layers are denoted by ⇤ and nonlinear layers by . Dashed connections represent parameter sharing. discriminative one-shot learning is to find a mechanism to incorporate domain-specific information in the learner, i.e. learning to learn . Another challenge, which is of practical importance in applications of one-shot learning, is to avoid a lengthy optimization process such as eq. (1). We propose to address both challenges by learning the parameters W of the predictor from a single exemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function ! that maps ( z ; W 0) to W . Since in practice this function will be implemented using a deep neural network, we call it a learnet . The learnet depends on the exemplar z , which is a single representative of the class of interest, and contains parameters W 0 of its own. Learning to learn can now be posed as the problem of optimizing the learnet meta-parameters W 0 using an objective function defined below. Furthermore, the feed-forward learnet evaluation is much faster than solving the optimization problem (1). In order to train the learnet, we require the latter to produce good predictors given any possible exemplar z , which is empirically evaluated as an average over n training samples zi :

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Related Work (Attention
Mechanism) n  Sequence to Sequence with Attention [Bahdanau+, 2014] ⁃  Attend to the word relevant to the generation of the next target word in the source sentence 13 t t her architectures such as a hybrid of an RNN alchbrenner and Blunsom, 2013). ral machine translation. The new architecture 3.2) and a decoder that emulates searching n (Sec. 3.1). x 1 x 2 x 3 x T + α t,1 α t,2 α t,3 α t,T y t-1 y t h 1 h 2 h 3 h T h 1 h 2 h 3 h T s t-1 s t Figure 1: The graphical illus- tration of the proposed model trying to generate the t -th target word yt given a source sentence (x1, x2, . . . , xT ) . al probability (4) by –decoder ap- on a distinct annotations ntence. Each put sequence word of the ons are com- sum of these (5) ij) Figure 1: The graphical illus- tration of the proposed model trying to generate the t -th target word yt given a source sentence (x1, x2, . . . , xT ) . si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoder–decoder approach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi . The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i -th word of the input sequence. We explain in detail how the annotations are computed in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi : ci = Tx X j =1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) P Tx k =1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi , Eq. (4)) and the j -th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Figure 1: The graphical illus- tration of the proposed model trying to generate the t -th target word yt given a source sentence (x1, x2, . . . , xT ) . si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoder–decoder approach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi . The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i -th word of the input sequence. We explain in detail how the annotations are computed in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi : ci = Tx X j =1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) P Tx k =1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi , Eq. (4)) and the j -th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Figure 1: The graphical illus- tration of the proposed model trying to generate the t -th target word yt given a source sentence (x1, x2, . . . , xT ) . proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi . The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i -th word of the input sequence. We explain in detail how the annotations are computed in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi : ci = Tx X j =1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) P Tx k =1 exp (eik) , (6) where eij = a(si 1, hj) is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state si 1 (just before emitting yi , Eq. (4)) and the j -th annotation hj of the input sentence. We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, 3 Published as a conference paper at ICLR 2015 (a) (b)

Mechanism) n  Pointer Networks [Vinyals+, 2015] ⁃  Generate output sequence using a distribution over the dictionary of inputs 14 (a) Sequence-to-Sequence (b) Ptr-Net Figure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a code vector that is used to generate the output sequence (purple) using the probability chain rule and another RNN. The output dimensionality is fixed by the dimensionality of the problem and it is the same during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequence to a code (blue) that is fed to the generating network (purple). At each step, the generating network produces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). The output of the attention mechanism is a softmax distribution with dictionary size equal to the length of the input. ion (i.e., when we only have examples of inputs and desired outputs). The proposed approach is depicted in Figure 1. The main contributions of our work are as follows: This model performs significantly better than the sequence-to-sequence model on the co problem, but it is not applicable to problems where the output dictionary size depends on Nevertheless, a very simple extension (or rather reduction) of the model allows us to do th 2.3 Ptr-Net We now describe a very simple modification of the attention model that allows us to method to solve combinatorial optimization problems where the output dictionary size d the number of elements in the input sequence. The sequence-to-sequence model of Section 2.1 uses a softmax distribution over a fixed si dictionary to compute p ( Ci |C1 , . . . , Ci 1 , P ) in Equation 1. Thus it cannot be used for our where the size of the output dictionary is equal to the length of the input sequence. To problem we model p ( Ci |C1 , . . . , Ci 1 , P ) using the attention mechanism of Equation 3 a ui j = vT tanh( W1 ej + W2 di) j 2 (1 , . . . , n ) p ( Ci |C1 , . . . , Ci 1 , P ) = softmax ( ui ) where softmax normalizes the vector ui (of length n) to be an output distribution over the of inputs, and v, W1 , and W2 are learnable parameters of the output model. Here, we do the encoder state ej to propagate extra information to the decoder, but instead, use ui j a to the input elements. In a similar way, to condition on Ci 1 as in Equation 1, we sim the corresponding PCi 1 as the input. Both our method and the attention model can be application of content-based attention mechanisms proposed in [6, 5, 2]. We also note that our approach specifically targets problems whose outputs are discrete spond to positions in the input. Such problems may be addressed artificially – for example learn to output the coordinates of the target point directly using an RNN. However, at this solution does not respect the constraint that the outputs map back to the inputs exac out the constraints, the predictions are bound to become blurry over longer sequences as sequence-to-sequence models for videos [12]. 3 Motivation and Datasets Structure In the following sections, we review each of the three problems we considered, as well a generation protocol.1 In the training data, the inputs are planar point sets P = {P1 , . . . , Pn } with n elements ea Pj = ( xj , yj) are the cartesian coordinates of the points over which we find the convex hu launay triangulation or the solution to the corresponding Travelling Salesman Problem. In we sample from a uniform distribution in [0 , 1] ⇥ [0 , 1] . The outputs CP = {C1 , . . . , C sequences representing the solution associated to the point set P. In Figure 2, we find an i of an input/output pair ( P, CP ) for the convex hull and the Delaunay problems.

Mechanism) n  Sequence to Sequence for Sets [Vinyals+, ICLR2016] ⁃  Handle input sets using an extension of seq2seq framework: Read-Process-and Write model 15 ural models with memories coupled to differentiable addressing mechanism have been success- y applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- au et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 5). Since we are interested in associative memories we employed a “content” based attention. s has the property that the vector retrieved from our memory would not change if we randomly fﬂed the memory. This is crucial for proper treatment of the input set X as such. In particular, process block based on an attention mechanism uses the following: qt = LSTM ( q⇤ t 1) (3) ei,t = f ( mi , qt) (4) ai,t = exp( ei,t) P j exp( ej,t) (5) rt = X i ai,t mi (6) q⇤ t = [ qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. ere i indexes through each memory vector mi (typically equal to the cardinality of X), qt is uery vector which allows us to read rt from the memories, f is a function that computes a gle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a urrent state but which takes no inputs. q⇤ t is the state which this LSTM evolves, and is formed concatenating the query qt with the resulting attention readout rt . t is the index which indicates

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Matching Networks [Vinyals+,
NIPS2016] n  Motivation ⁃  It is important for one-shot learning to attain rapid learning from new examples while keeping an ability for common examples •  Simple parametric models such as deep classifiers need to be optimized to treat with new examples •  Non-parametric models such as k-nearest neighbor donʼt require optimization but performance depends on the chosen metric ⁃  It could be efficient to train a end-to-end nearest neighbor based classifier 16

NIPS2016] n  Train a classiﬁer through one-shot learning 17 T’: Testing task T: Training task dog frog horse ship truck airplane automobile bird cat deer dog frog horse ship truck airplane automobile bird cat deer L: Label set S: Support set B : Batch dog horse ship sampling N labels from T sampling k examples from L sampling b example from L https://www.cs.toronto.edu/~kriz/cifar.html

NIPS2016] n  System Overview ⁃  Embedding functions f, g are parameterized as a simple CNN (e.g. VGG or Inception) or a fully conditional embedding function mentioned later 18 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by first defining and explaining our model whilst linking its several components to related work. Then in the following section we briefly elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S , our model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) . Second, we employ ˆ x Query f g(x i ) f ( ˆ x,S) a ∑ P(ˆ y | ˆ x where xi, yi are the inputs and corresp { (xi, yi) }k i =1 , and a is an attention mech tially describes the output for a new class Where the attention mechanism a is a kerne Where the attention mechanism is zero f metric and an appropriate constant otherw (although this requires an extension to the Thus (1) subsumes both KDE and kNN me mechanism and the yi act as values bound this case we can understand this as a parti we “point” to the corresponding example i form defined by the classifier cS(ˆ x) is very 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the fier. The simplest form that this takes attention models and kernel functions) a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( ate neural networks (potentially with f = examples where f and g are parameteris tasks (as in VGG[22] or Inception[24]) or Section 4). We note that, though related to metric learn For a given support set S and sample to cl pairs (x 0 , y 0 ) 2 S such that y 0 = y and mi methods such as Neighborhood Compone nearest neighbor [28]. However, the objective that we are trying classification, and thus we expect it to per Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that e tially describes the output for a new class as a linear combination of the labels in the s Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel densit Where the attention mechanism is zero for the b furthest xi from ˆ x according to som metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest n (although this requires an extension to the attention mechanism that we describe in Sec Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a mechanism and the yi act as values bound to the corresponding keys xi , much like a has this case we can understand this as a particular kind of associative memory where, give Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. Figure 1: Matching Networks architecture x i Support Set（S） y i g

NIPS2016] n  The Attention Kernel ⁃  Calculate softmax over the cosine distance between and •  Similar to nearest neighbor calculation ⁃  Train a network using cross entropy loss 19 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to minibatch, much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot learning, we contribute by the definition of tasks that can be used to benchmark other approaches on both ImageNet and small scale language modeling. We hope that our results will encourage others to work on this challenging problem. We organized the paper by first defining and explaining our model whilst linking its several components to related work. Then in the following section we briefly elaborate on some of the related work to the task and our model. In Section 4 we describe both our general setup and the experiments we performed, demonstrating strong results on one-shot learning on a variety of tasks and setups. 2 Model Our non-parametric approach to solving one-shot learning is based on two components which we describe in the following subsections. First, our model architecture follows recent advances in neural networks augmented with memory (as discussed in Section 3). Given a (small) support set S , our model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) . Second, we employ ˆ x Query f g(x i ) f ( ˆ x,S) a Our model in its simplest form computes a probability over ˆ y as follow P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss b tially describes the output for a new class as a linear combination of Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin Where the attention mechanism is zero for the b furthest xi from ˆ x metric and an appropriate constant otherwise, then (1) is equivalent t (although this requires an extension to the attention mechanism that w ∑ P(ˆ y | ˆ x where xi, yi are the inputs and corresp { (xi, yi) }k i =1 , and a is an attention mech tially describes the output for a new class Where the attention mechanism a is a kerne Where the attention mechanism is zero f metric and an appropriate constant otherw (although this requires an extension to the Thus (1) subsumes both KDE and kNN me mechanism and the yi act as values bound this case we can understand this as a parti we “point” to the corresponding example i form defined by the classifier cS(ˆ x) is very 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the fier. The simplest form that this takes attention models and kernel functions) a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( ate neural networks (potentially with f = examples where f and g are parameteris tasks (as in VGG[22] or Inception[24]) or Section 4). We note that, though related to metric learn For a given support set S and sample to cl pairs (x 0 , y 0 ) 2 S such that y 0 = y and mi methods such as Neighborhood Compone nearest neighbor [28]. However, the objective that we are trying classification, and thus we expect it to per Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that e tially describes the output for a new class as a linear combination of the labels in the s Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel densit Where the attention mechanism is zero for the b furthest xi from ˆ x according to som metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest n (although this requires an extension to the attention mechanism that we describe in Sec Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as a mechanism and the yi act as values bound to the corresponding keys xi , much like a has this case we can understand this as a particular kind of associative memory where, give Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the suppo { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that tially describes the output for a new class as a linear combination of the labels in the Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel dens Where the attention mechanism is zero for the b furthest xi from ˆ x according to so metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest (although this requires an extension to the attention mechanism that we describe in Se Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as mechanism and the yi act as values bound to the corresponding keys xi , much like a ha this case we can understand this as a particular kind of associative memory where, giv we “point” to the corresponding example in the support set, retrieving its label. Hence th form defined by the classifier cS(ˆ x) is very flexible and can adapt easily to any new sup 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the attention mechanism, which fully specifie fier. The simplest form that this takes (and which has very tight relationships wi attention models and kernel functions) is to use the softmax over the cosine dist a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( xj )) with embedding functions f and g bein ate neural networks (potentially with f = g ) to embed ˆ x and xi . In our experiments w examples where f and g are parameterised variously as deep convolutional network tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for languag Section 4). We note that, though related to metric learning, the classifier defined by Equation 1 is di For a given support set S and sample to classify ˆ x , it is enough for ˆ x to be sufficiently a pairs (x 0 , y 0 ) 2 S such that y 0 = y and misaligned with the rest. This kind of loss is als c: cosine distance Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. Figure 1: Matching Networks architecture x i Support Set（S） y i g ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) hk = ˆ hk + f 0 (ˆ x) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] h the output (i.e., cell after the output gate), and c the cell. a is commonly referred based attention. We do K steps of “reads”, so f(ˆ x, S) = hK where hk is as describ 2.2 Training Strategy In the previous subsection we described Matching Networks which map a support set t function, S ! c(ˆ x) . We achieve this via a modification of the set-to-set paradigm attention, with the resulting mapping being of the form P✓(. | ˆ x, S) , noting that ✓ are of the model (i.e. of the embedding functions f and g described previously). The training procedure has to be chosen carefully so as to match inference at test t has to perform well with support sets S 0 which contain classes never seen during tra More specifically, let us define a task T as distribution over possible label sets L consider T to uniformly weight all data sets of up to a few unique classes (e.g. examples per class (e.g., up to 5). In this case, a label set L sampled from a task typically have 5 to 25 examples. To form an “episode” to compute gradients and update our model, we first sample L could be the label set { cats, dogs }). We then use L to sample the support set S (i.e., both S and B are labelled examples of cats and dogs). The Matching Net is minimise the error predicting the labels in the batch B conditioned on the support form of meta-learning since the training procedure explicitly learns to learn from a g to minimise a loss over a batch. More precisely, the Matching Nets training objectiv ✓ = arg max ✓ EL ⇠ T 2 4 ES ⇠ L,B ⇠ L 2 4 X ( x,y )2 B log P✓ (y | x, S) 3 5 3 5 . Training ✓ with eq. 6 yields a model which works well when sampling S 0 ⇠ T 0 g(x i ) f ( ˆ x,S)

NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ LSTM LSTM + Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 20 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i g’ LSTM LSTM + g’ LSTM LSTM + noting that LSTM (x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-ou concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = | S |. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax (hk 1g(xi)) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-out rk 1 from concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S , g as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (similar to f 0 above VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = | S |. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n x i g(x i ,S)

NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 21 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i g’ g’ noting that LSTM (x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-ou concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = | S |. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax (hk 1g(xi)) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-out rk 1 from concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S , g as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (similar to f 0 above VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = | S |. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n Embed into vector using g’ （g’: neural network such as VGG or Inception） x i x i

NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ LSTM LSTM Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 22 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i g’ LSTM LSTM g’ LSTM LSTM noting that LSTM (x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-ou concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = | S |. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax (hk 1g(xi)) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-out rk 1 from concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S , g as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (similar to f 0 above VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = | S |. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n Feed into Bi-LSTM （gʼ: neural network such as VGG or Inception） g'(x i ) x i

NIPS2016] n  The Fully Conditional Embedding g ⁃  Embed in consideration of S g’ LSTM LSTM + Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. 23 Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i g’ LSTM LSTM + g’ LSTM LSTM + noting that LSTM (x, h, c) follows the same LSTM implementation defined in [ h the output (i.e., cell after the output gate), and c the cell. a is commonly refe based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-ou concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the sup as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (simila VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation de the input, h the output (i.e., cell after the output gate), and c the cell. Note tha starts from i = | S |. As in eq. 3, we add a skip connection between input and ou B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – excluded for training during our one-shot experiments described in section 4.1. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n016882 n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n020953 10 g’: neural network (e.g., VGG or Inception) a(hk 1, g(xi)) = softmax (hk 1g(xi)) noting that LSTM (x, h, c) follows the same LSTM implementation defined in [23] with x th h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “c based attention, and the softmax in eq. 6 normalizes w.r.t. g(xi) . The read-out rk 1 from concatenated to hk 1 . Since we do K steps of “reads”, attLSTM (f 0 (ˆ x), g(S), K) = hK w is as described in eq. 3. A.2 The Fully Conditional Embedding g In section 2.1.2 we described the encoding function for the elements in the support set S , g as a bidirectional LSTM. More precisely, let g 0 (xi) be a neural network (similar to f 0 above VGG or Inception model). Then we define g(xi, S) = ~ hi + ~ hi + g 0 (xi) with: ~ hi,~ ci = LSTM (g 0 (xi),~ hi 1,~ ci 1) ~ hi, ~ ci = LSTM (g 0 (xi), ~ hi +1, ~ ci +1) where, as in above, LSTM (x, h, c) follows the same LSTM implementation defined in [23] the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursio starts from i = | S |. As in eq. 3, we add a skip connection between input and outputs. B ImageNet Class Splits Here we define the two class splits used in our full ImageNet experiments – these classe excluded for training during our one-shot experiments described in section 4.1.2. Lrand = n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n g(x i ,S) Let be the sum of and outputs of Bi-LSTM g(x i ,S) g'(x i ) x i

NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM r k−1 a(h k−1 ,g(x i )) g(x i ) LSTM f ( ˆ x,S) = h K ˆ h k−1 h k−1 ˆ h k + + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query weighted sum 24 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ˆ x ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4)

NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM g(x i ) ˆ h 1 h 1 + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query 25 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) is calculated without using S h 1 = LSTM( f '( ˆ x),[ ˆ h 0 ,r 0 ],c 0 )+ f '( ˆ x) h 1 ˆ x

NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM g(x i ) ˆ h 1 h 1 + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query 26 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) Calculate the relevance between and softmax a(h 1 ,g(x 1 )) = a(h 1 ,g(x i )) (hT 1 g(x 1 )) g(x i ) h 1 ˆ x

NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM g(x i ) ˆ h 1 h 1 + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query 27 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) is a sum of weighted according to the relevance to a(h 1 ,g(x i )) r 1 weighted sum r 1 g(x i ) h 1 r 1 = a(h 1 ,g(x i )) i=1 |S| ∑ g(x i ) ˆ x

NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM g(x i ) ˆ h 1 h 1 + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query 28 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(h 1 ,g(x i )) r 1 weighted sum LSTM ˆ h 1 + h 1 is calculated using S h 1 ˆ x

NIPS2016] n  The Fully Conditional Embedding f ⁃  Embed in consideration of S g Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from mi much like how it will be tested when presented with a few examples of a new Besides our contributions in defining a model and training criterion amenable we contribute by the definition of tasks that can be used to benchmark other ImageNet and small scale language modeling. We hope that our results will enc on this challenging problem. We organized the paper by first defining and explaining our model whilst linki nents to related work. Then in the following section we briefly elaborate on som to the task and our model. In Section 4 we describe both our general setup an performed, demonstrating strong results on one-shot learning on a variety of ta 2 Model Our non-parametric approach to solving one-shot learning is based on two co describe in the following subsections. First, our model architecture follows rece networks augmented with memory (as discussed in Section 3). Given a (sma model defines a function cS (or classifier) for each S , i.e. a mapping S ! cS(.) a training strategy which is tailored for one-shot learning from the support set 2.1 Model Architecture In recent years, many groups have investigated ways to augment neural netwo f’ LSTM r k−1 a(h k−1 ,g(x i )) g(x i ) LSTM f ( ˆ x,S) = h K ˆ h k−1 h k−1 ˆ h k + + ˆ x so, we define the following recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) a(hk 1, g(xi)) = eh T k 1g ( xi) / | S | X j =1 eh T k 1g ( xj ) (5) Query weighted sum 29 Figure 1: Matching Networks architecture by showing only a few examples per class, switching the task from minibatch to minibatch, ike how it will be tested when presented with a few examples of a new task. s our contributions in defining a model and training criterion amenable for one-shot learning, ntribute by the definition of tasks that can be used to benchmark other approaches on both Net and small scale language modeling. We hope that our results will encourage others to work challenging problem. ganized the paper by first defining and explaining our model whilst linking its several compo- o related work. Then in the following section we briefly elaborate on some of the related work Figure 1: Matching Networks architecture it by showing only a few examples per class, switching the task from minibatch to minibatch, h like how it will be tested when presented with a few examples of a new task. des our contributions in defining a model and training criterion amenable for one-shot learning, ontribute by the definition of tasks that can be used to benchmark other approaches on both geNet and small scale language modeling. We hope that our results will encourage others to work his challenging problem. Figure 1: Matching Networks architecture examples per class, switching the task from minibatch to minibatch, much like when presented with a few examples of a new task. utions in defining a model and training criterion amenable for one-shot learning, x i Support Set（S） y i ollowing recurrence over “processing” steps k , following work from [26]: ˆ hk, ck = LSTM (f 0 (ˆ x), [hk 1, rk 1], ck 1) (2) hk = ˆ hk + f 0 (ˆ x) (3) rk 1 = | S | X i =1 a(hk 1, g(xi))g(xi) (4) Let be the output after K steps f ( ˆ x,S) ˆ x

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings n 
Datasets ⁃  Image classiﬁcation sets •  Omniglot [Lake+, 2011] ⁃  Language modeling •  Penn Treebank [Marcus+, 1993] 30 •  ImageNet [Deng+, 2009] ref. http://karpathy.github.io/2014/09/02/what-i-learned- from-competing-against-a-convnet-on-imagenet/ 4.1.3 One-Shot Language Modeling We also introduce a new one-shot language task which is analogous to those examined for images. The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a <blank_token> u.s. scientist said. prominent 2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the <blank_token>. dollar 4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (Omniglot)
n  Baseline ⁃  Matching on raw pixels ⁃  Matching on discriminative features from VGG (Baseine classiﬁer) ⁃  MANN ⁃  Convolutional Siamese Network n  Datasets ⁃  training: 1200 characters ⁃  testing: 423 characters 31

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (Omniglot)
32 n  Fully Conditional Embedding (FCE) did not seem to help much n  Baseline and Siamese Net were improved with ﬁne-tuning took this network and used the features from the last layer (before the softmax) for nearest neighbour matching, a strategy commonly used in computer vision [3] which has achieved excellent results across many tasks. Following [11], the convolutional siamese nets were trained on a same-or-different task of the original training data set and then the last layer was used for nearest neighbour matching. Model Matching Fn Fine Tune 5-way Acc 20-way Acc 1-shot 5-shot 1-shot 5-shot PIXELS Cosine N 41.7% 63.2% 26.7% 42.6% BASELINE CLASSIFIER Cosine N 80.0% 95.0% 69.5% 89.1% BASELINE CLASSIFIER Cosine Y 82.3% 98.4% 70.6% 92.0% BASELINE CLASSIFIER Softmax Y 86.0% 97.6% 72.9% 92.3% MANN (NO CONV) [21] Cosine N 82.8% 94.9% – – CONVOLUTIONAL SIAMESE NET [11] Cosine N 96.7% 98.4% 88.0% 96.5% CONVOLUTIONAL SIAMESE NET [11] Cosine Y 97.3% 98.4% 88.1% 97.0% MATCHING NETS (OURS) Cosine N 98.1% 98.9% 93.8% 98.5% MATCHING NETS (OURS) Cosine Y 97.9% 98.7% 93.5% 98.7% Table 1: Results on the Omniglot dataset. 5

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (ImageNet)
n  Baseline ⁃  Matching on raw pixels ⁃  Matching on discriminative features from InceptionV3 (Baseine classiﬁer) n  Datasets ⁃  miniImageNet (size: 84x84) •  training: (80 classes) •  testing: (20 classes) ⁃  randImageNet •  training: randomly picked up classes (882 classes) •  testing: remaining classes (118 classes) ⁃  dogsImageNet •  training: all non-dog classes (882 classes) •  testing: dog classes (118 classes) 33

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (miniImageNet)
34 Figure 2: Example of two 5-way problem instance on ImageNet. The images in the set S 0 contain classes never seen during training. Our model makes far less mistakes than the Inception baseline. Table 2: Results on miniImageNet. Model Matching Fn Fine Tune 5-way Acc 1-shot 5-shot PIXELS Cosine N 23.0% 26.6% BASELINE CLASSIFIER Cosine N 36.6% 46.0% BASELINE CLASSIFIER Cosine Y 36.2% 52.2% BASELINE CLASSIFIER Softmax Y 38.4% 51.2% MATCHING NETS (OURS) Cosine N 41.2% 56.2% MATCHING NETS (OURS) Cosine Y 42.4% 58.0% MATCHING NETS (OURS) Cosine (FCE) N 44.2% 57.0% MATCHING NETS (OURS) Cosine (FCE) Y 46.6% 60.0% 1 -shot tasks from the training data set, incorporating Full Context Embeddings and our Matching Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are shown in Table 3. The Inception Oracle (trained on all classes) performs almost perfectly when restricted to 5 classes only, which is not too surprising given its impressive top-1 accuracy. When trained solely on 6 =Lrand , Matching Nets improve upon Inception by almost 6% when tested on Lrand , halving the errors. Figure 2 shows two instances of 5-way one-shot learning, where Inception fails. Looking at all the errors, Inception appears to sometimes prefer an image above all others (these images tend to be cluttered like the example in the second column, or more constant in color). Matching Nets, on the other hand, manage to recover from these outliers that sometimes appear in the support set S 0. Matching Nets manage to improve upon Inception on the complementary subset 6 =Ldogs (although n  Matching Networks overtook baseline n  Fully Conditional Embedding (FCE) was shown eﬀective to improve the performance in this task

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Results (randImageNet,
dogsImageNet) 35 classification. Thus, we believe that if we adapted our training strategy to samples S from fine grained sets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvements could be attained. We leave this as future work. Table 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that 6 =Lrand and 6 =Ldogs are sets of classes which are seen during training, but are provided for completeness. Model Matching Fn Fine Tune ImageNet 5-way 1-shot Acc L rand 6=L rand L dogs 6=L dogs PIXELS Cosine N 42.0% 42.8% 41.4% 43.0% INCEPTION CLASSIFIER Cosine N 87.6% 92.6% 59.8% 90.0% MATCHING NETS (OURS) Cosine (FCE) N 93.2% 97.0% 58.8% 96.4% INCEPTION ORACLE Softmax (Full) Y (Full) ⇡ 99% ⇡ 99% ⇡ 99% ⇡ 99% 7 n  Matching Networks outperformed Inception Classifier in , but degraded in n  The decrease of the performance in might be caused by the different distributions of labels between training and testing ⁃  Training support set comes from a random distribution whereas testing one comes from similar classes BASELINE CLASSIFIER Cosine Y 36 BASELINE CLASSIFIER Softmax Y 38 MATCHING NETS (OURS) Cosine N 41 MATCHING NETS (OURS) Cosine Y 42 MATCHING NETS (OURS) Cosine (FCE) N 44 MATCHING NETS (OURS) Cosine (FCE) Y 46 1 -shot tasks from the training data set, incorporating Full Context Emb Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are show Oracle (trained on all classes) performs almost perfectly when restricted not too surprising given its impressive top-1 accuracy. When trained so Nets improve upon Inception by almost 6% when tested on Lrand , halving two instances of 5-way one-shot learning, where Inception fails. Looking appears to sometimes prefer an image above all others (these images te example in the second column, or more constant in color). Matching Nets, to recover from these outliers that sometimes appear in the support set S 0 Matching Nets manage to improve upon Inception on the complementar this setup is not one-shot, as the feature extraction has been trained on the much more challenging Ldogs subset, our model degrades by 1% . We h 1 -shot tasks from the training data set, incorporating Full Context Embeddings an Networks and training strategy. The results of the randImageNet and dogsImageNet experiments are shown in Table Oracle (trained on all classes) performs almost perfectly when restricted to 5 classe not too surprising given its impressive top-1 accuracy. When trained solely on 6 =L Nets improve upon Inception by almost 6% when tested on Lrand , halving the errors two instances of 5-way one-shot learning, where Inception fails. Looking at all the e appears to sometimes prefer an image above all others (these images tend to be c example in the second column, or more constant in color). Matching Nets, on the oth to recover from these outliers that sometimes appear in the support set S 0. Matching Nets manage to improve upon Inception on the complementary subset 6 = this setup is not one-shot, as the feature extraction has been trained on these labels). much more challenging Ldogs subset, our model degrades by 1% . We hypothesiz that the sampled set during training, S , comes from a random distribution of labels whereas the testing support set S 0 from Ldogs contains similar classes, more akin classification. Thus, we believe that if we adapted our training strategy to samples S f sets of labels instead of sampling uniformly from the leafs of the ImageNet class tre could be attained. We leave this as future work. 1 -shot tasks from the training data set, incorporating Full C Networks and training strategy. The results of the randImageNet and dogsImageNet experimen Oracle (trained on all classes) performs almost perfectly whe not too surprising given its impressive top-1 accuracy. When Nets improve upon Inception by almost 6% when tested on Lr two instances of 5-way one-shot learning, where Inception fa appears to sometimes prefer an image above all others (thes example in the second column, or more constant in color). Ma to recover from these outliers that sometimes appear in the su Matching Nets manage to improve upon Inception on the com this setup is not one-shot, as the feature extraction has been tra much more challenging Ldogs subset, our model degrades b that the sampled set during training, S , comes from a random whereas the testing support set S 0 from Ldogs contains simi classification. Thus, we believe that if we adapted our training sets of labels instead of sampling uniformly from the leafs of could be attained. We leave this as future work.

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings (Penn
Treebank) 36 x i Support Set（S） ˆ x Query g(x i ) f ( ˆ x,S) a Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the su k ∑ P(ˆ y | ˆ x, S) = where xi, yi are the inputs and correspondin { (xi, yi) }k i =1 , and a is an attention mechanism tially describes the output for a new class as a Where the attention mechanism a is a kernel on X Where the attention mechanism is zero for the metric and an appropriate constant otherwise, th (although this requires an extension to the atten Thus (1) subsumes both KDE and kNN methods. mechanism and the yi act as values bound to the this case we can understand this as a particular we “point” to the corresponding example in the s form defined by the classifier cS(ˆ x) is very flexib 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the atten fier. The simplest form that this takes (and w attention models and kernel functions) is to a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( xj )) w ate neural networks (potentially with f = g ) to examples where f and g are parameterised var tasks (as in VGG[22] or Inception[24]) or a sim Section 4). We note that, though related to metric learning, th For a given support set S and sample to classify pairs (x 0 , y 0 ) 2 S such that y 0 = y and misalign methods such as Neighborhood Component An nearest neighbor [28]. However, the objective that we are trying to opti classification, and thus we expect it to perform b Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support set { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that eq. 1 tially describes the output for a new class as a linear combination of the labels in the suppo Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel density esti Where the attention mechanism is zero for the b furthest xi from ˆ x according to some dis metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest neigh (although this requires an extension to the attention mechanism that we describe in Section 2 Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an atte mechanism and the yi act as values bound to the corresponding keys xi , much like a hash tab y i Our model in its simplest form computes a probability over ˆ y as follows: P(ˆ y | ˆ x, S) = k X i =1 a(ˆ x, xi)yi where xi, yi are the inputs and corresponding label distributions from the support s { (xi, yi) }k i =1 , and a is an attention mechanism which we discuss below. Note that eq. tially describes the output for a new class as a linear combination of the labels in the su Where the attention mechanism a is a kernel on X ⇥ X , then (1) is akin to a kernel density Where the attention mechanism is zero for the b furthest xi from ˆ x according to some metric and an appropriate constant otherwise, then (1) is equivalent to ‘ k b ’-nearest ne (although this requires an extension to the attention mechanism that we describe in Secti Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an mechanism and the yi act as values bound to the corresponding keys xi , much like a hash this case we can understand this as a particular kind of associative memory where, given we “point” to the corresponding example in the support set, retrieving its label. Hence the f form defined by the classifier cS(ˆ x) is very flexible and can adapt easily to any new suppo 2.1.1 The Attention Kernel Equation 1 relies on choosing a(., .) , the attention mechanism, which fully specifies th fier. The simplest form that this takes (and which has very tight relationships with attention models and kernel functions) is to use the softmax over the cosine distanc a(ˆ x, xi) = ec ( f (ˆ x ) ,g ( xi)) / P k j =1 ec ( f (ˆ x ) ,g ( xj )) with embedding functions f and g being ate neural networks (potentially with f = g ) to embed ˆ x and xi . In our experiments we examples where f and g are parameterised variously as deep convolutional networks f tasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language t Section 4). We note that, though related to metric learning, the classifier defined by Equation 1 is discri c: cosine distance LSTM LSTM … virus a LSTM LSTM … new nbc LSTM LSTM on the … LSTM LSTM the yesterday … 4.1.3 One-Shot Language Modeling We also introduce a new one-shot language task which is analogous to those examined for images. The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a <blank_token> u.s. scientist said. prominent 2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the <blank_token>. dollar 4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30] trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined a similar setup wherein a sentence was presented to the model with a single word filled in with 5 different possible words (including the correct answer). For each of these 5 sentences the model gave The task is as follows: given a query sentence with a missing word in it, and a support set of sentences which each have a missing word and a corresponding 1-hot label, choose the label from the support set that best matches the query sentence. Here we show a single example, though note that the words on the right are not provided and the labels for the set are given as 1-hot-of-5 vectors. 1. an experimental vaccine can alter the immune response of people infected with the aids virus a <blank_token> u.s. scientist said. prominent 2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far this fall. series 3. however since eastern first filed for chapter N protection march N it has consistently promised to pay creditors N cents on the <blank_token>. dollar 4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in benjamin jacobson & sons a specialist in trading ual stock on the big board. towel 5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N marks late friday and at N yen down from N yen late friday. dollar Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set and batch are populated with sentences that are non-overlapping. This means that we do not use words with very low frequency counts; e.g. if there is only a single sentence for a given word we do not use this data since the sentence would need to be in both the set and the batch. As with the image tasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch size of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured that the same number of sentences were available for each class in the set. We split the words into a randomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report results. Thus, neither the words nor the sentences used during test time had been seen during training. We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30] trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot learning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examined a similar setup wherein a sentence was presented to the model with a single word filled in with 5 different possible words (including the correct answer). For each of these 5 sentences the model gave a log-likelihood and the max of these was taken to be the choice of the model. n  Fill in a brank in a query sentence by a label in a support set

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Experimental Settings and
Results (Penn Treebank) 37 n  Baseline ⁃  Oracle LSTM-LM •  Trained on all the words (not one-shot) •  Consider this model as an upper bound n  Datasets ⁃  training: 9000 words ⁃  testing: 1000 words n  Results Model 5 way accuracy 1-shot 2-shot 3-shot Matching Nets 32.4% 36.1% 38.2% Oracle LSTM-LM (72.8%) - -

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Conclusion n  They
proposed Matching Networks: nearest neighbor based approach trained fully end-to-end n  Keypoints ⁃  “One-shot learning is much easier if you train the network to do one-shot learning” [Vinyals+, 2016] ⁃  Matching Network has non-parametric structure, thus has ability to acquisition of new examples rapidly n  Findings ⁃  Matching Networks was effective to improve the performance for Omniglot, miniImageNet, randImageNet, however it degraded for dogsImageNet ⁃  One-shot learning with fine-grained sets of labels is difficult to solve thus could be exciting challenge in this area 38

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. References n  Matching
Networks ⁃  Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in Neural Information Processing Systems. 2016. n  One-shot Learning ⁃  Koch, Gregory. Siamese neural networks for one-shot image recognition. Diss. University of Toronto, 2015. ⁃  Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks." Proceedings of The 33rd International Conference on Machine Learning. 2016. ⁃  Bertinetto, Luca, et al. "Learning feed-forward one-shot learners." Advances in Neural Information Processing Systems. 2016. n  Attention Mechanisms ⁃  Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). ⁃  Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in Neural Information Processing Systems. 2015. ⁃  Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. "Order matters: Sequence to sequence for sets." In ICLR2016 39

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. References n  Datasets
⁃  Krizhevsky, Alex, and Geoﬀrey Hinton. "Learning multiple layers of features from tiny images." (2009). ⁃  Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009. ⁃  Lake, Brenden M., et al. "One shot learning of simple visual concepts." Proceedings of the 33rd Annual Conference of the Cognitive Science Society. Vol. 172. 2011. ⁃  Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. "Building a large annotated corpus of English: The Penn Treebank." Computational linguistics 19.2 (1993): 313-330. 40

Matching Networks for One Shot Learning

Matching Networks for One Shot Learning

More Decks by Kazuki Fujikawa

Featured

Transcript