MAMLとその派生サーベイ

Slide 1

Slide 1 text

MAML ͱͦͷ೿ੜαʔϕΠ ߴ໦༏հ Nagoya Institute of Technology Takeuchi & Karasuyama Lab 2020/03/18

Slide 9

Slide 9 text

MAML ֓ཁ nostic Meta-Learning for Fast Adaptation of Deep Networks rk is a simple model- ta-learning that trains mall number of gradi- g on a new task. We nt model types, includ- l networks, and in sev- shot regression, image rning. Our evaluation ithm compares favor- ing methods designed ion, while using fewer dily applied to regres- nt learning in the pres- y outperforming direct rning eve rapid adaptation, a zed as few-shot learn- he problem setup and ithm. p s to train a model that only a few datapoints ish this, the model or arning phase on a set l can quickly adapt to of examples or trials. meta-learning learning/adaptation ✓ rL1 rL2 rL3 ✓⇤ 1 ✓⇤ 2 ✓⇤ 3 Figure 1. Diagram of our model-agnostic meta-learning algorithm (MAML), which optimizes for a representation ✓ that can quickly adapt to new tasks. In our meta-learning scenario, we consider a distribution over tasks p(T ) that we want our model to be able to adapt to. In the K-shot learning setting, the model is trained to learn a new task T i drawn from p(T ) from only K samples drawn from qi and feedback LTi generated by T i . During meta-training, a task T i is sampled from p(T ), the model is trained with K samples and feedback from the corresponding loss LTi from T i , and then tested on new samples from T i . The model f is then improved by considering how the test error on new data from qi changes with respect to the parameters. In effect, the test error on sampled tasks T i serves as the training error of the meta-learning process. At the end of meta-training, new tasks are sampled from p(T ), and meta-performance is measured by the model’s performance after learning from K samples. Generally, tasks used for meta-testing are held out during meta-training. 2.2. A Model-Agnostic Meta-Learning Algorithm In contrast to prior work, which has sought to train re- Figure 1: To compute the meta-gradient P i dLi( i) d✓ , the MAML algorith the optimization path, as shown in green, while first-order MAML compu approximating d i d✓ as I. Our implicit MAML approach derives an analytic meta-gradient without differentiating through the optimization path by esti The main contribution of our work is the development of the implicit MAM an approach for optimization-based meta-learning with deep neural networ for differentiating through the optimization path. Our algorithm aims to such that an optimization algorithm that is initialized at and regularized leads to good generalization for a variety of learning tasks. By leveraging th approach, we derive an analytical expression for the meta (or outer level) g on the solution to the inner optimization and not the path taken by the inne as depicted in Figure 1. This decoupling of meta-gradient computation a optimizer has a number of appealing properties. First, the inner optimization path need not be stored nor differentiated t implicit MAML memory efficient and scalable to a large number of inner ond, implicit MAML is agnostic to the inner optimization method used, approximate solution to the inner-level optimization problem. This permit methods, and in principle even non-differentiable optimization methods or based optimization, line-search, or those provided by proprietary software ( also provide the first (to our knowledge) non-asymptotic theoretical analy tion. We show that an ✏–approximate meta-gradient can be computed vi ˜ O(log(1/✏)) gradient evaluations and ˜ O(1) memory, meaning the memory with number of gradient steps. [Fig : Rajeswaran et al. 2019] • P(T ) ͔Βੜ੒͞Εͨ৽ͨͳλεΫ Ti ʹରͯ͠ग़དྷΔ͚ͩૣ͘, গͳ͍ αϯϓϧͰֶशͰ͖ΔΑ͏ͳॳظ஋ θ Λֶश • ৽ͨͳλεΫ = ֶशͰ࢖ΘΕ͍ͯͳ͍ΫϥεͰ෼ྨ • Model-Agnostic • ඍ෼ՄೳͰ͋ΔҎ֎, Ϟσϧ΍ଛࣦؔ਺ͷܗࣜΛԾఆ͠ͳ͍ • Task-Agnostic • ճؼ, ෼ྨ, ڧԽֶशͳͲ, ༷ʑͳλεΫʹద༻Ͱ͖Δ 6 / 90

Slide 13

Slide 13 text

࣮ݧ Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Table 1. Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The ± shows 95% confidence intervals over tasks. Note that the Omniglot results may not be strictly comparable since the train/test splits used in the prior work were not available. The MiniImagenet evaluation of baseline methods and matching networks is from Ravi & Larochelle (2017). 5-way Accuracy 20-way Accuracy Omniglot (Lake et al., 2011) 1-shot 5-shot 1-shot 5-shot MANN, no conv (Santoro et al., 2016) 82.8% 94.9% – – MAML, no conv (ours) 89.7 ± 1.1% 97.5 ± 0.6% – – Siamese nets (Koch, 2015) 97.3% 98.4% 88.2% 97.0% matching nets (Vinyals et al., 2016) 98.1% 98.9% 93.8% 98.5% neural statistician (Edwards & Storkey, 2017) 98.1% 99.5% 93.2% 98.1% memory mod. (Kaiser et al., 2017) 98.4% 99.6% 95.0% 98.6% MAML (ours) 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% 5-way Accuracy MiniImagenet (Ravi & Larochelle, 2017) 1-shot 5-shot fine-tuning baseline 28.86 ± 0.54% 49.79 ± 0.79% nearest neighbor baseline 41.08 ± 0.70% 51.04 ± 0.65% matching nets (Vinyals et al., 2016) 43.56 ± 0.84% 55.31 ± 0.73% meta-learner LSTM (Ravi & Larochelle, 2017) 43.44 ± 0.77% 60.60 ± 0.71% MAML, first order approx. (ours) 48.07 ± 1.75% 63.15 ± 0.91% MAML (ours) 48.70 ± 1.84% 63.11 ± 0.92% fewer overall parameters compared to matching networks and the meta-learner LSTM, since the algorithm does not introduce any additional parameters beyond the weights of the classifier itself. Compared to these prior methods, memory-augmented neural networks (Santoro et al., 2016) specifically, and recurrent meta-learning models in general, represent a more broadly applicable class of methods that, like MAML, can be used for other tasks such as reinforcement learning (Duan et al., 2016b; Wang et al., Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Table 1. Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The ± shows 95% confidence intervals over tasks. Note that the Omniglot results may not be strictly comparable since the train/test splits used in the prior work were not available. The MiniImagenet evaluation of baseline methods and matching networks is from Ravi & Larochelle (2017). 5-way Accuracy 20-way Accuracy Omniglot (Lake et al., 2011) 1-shot 5-shot 1-shot 5-shot MANN, no conv (Santoro et al., 2016) 82.8% 94.9% – – MAML, no conv (ours) 89.7 ± 1.1% 97.5 ± 0.6% – – Siamese nets (Koch, 2015) 97.3% 98.4% 88.2% 97.0% matching nets (Vinyals et al., 2016) 98.1% 98.9% 93.8% 98.5% neural statistician (Edwards & Storkey, 2017) 98.1% 99.5% 93.2% 98.1% memory mod. (Kaiser et al., 2017) 98.4% 99.6% 95.0% 98.6% MAML (ours) 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% 5-way Accuracy MiniImagenet (Ravi & Larochelle, 2017) 1-shot 5-shot fine-tuning baseline 28.86 ± 0.54% 49.79 ± 0.79% nearest neighbor baseline 41.08 ± 0.70% 51.04 ± 0.65% matching nets (Vinyals et al., 2016) 43.56 ± 0.84% 55.31 ± 0.73% meta-learner LSTM (Ravi & Larochelle, 2017) 43.44 ± 0.77% 60.60 ± 0.71% MAML, first order approx. (ours) 48.07 ± 1.75% 63.15 ± 0.91% MAML (ours) 48.70 ± 1.84% 63.11 ± 0.92% fewer overall parameters compared to matching networks and the meta-learner LSTM, since the algorithm does not introduce any additional parameters beyond the weights of the classifier itself. Compared to these prior methods, memory-augmented neural networks (Santoro et al., 2016) specifically, and recurrent meta-learning models in general, represent a more broadly applicable class of methods that, like MAML, can be used for other tasks such as reinforcement learning (Duan et al., 2016b; Wang et al., • ଟ͘ͷطଘͷ Few-shot ༻ख๏ʹউར • FOMAML ͸ MAML ͱൺ΂ͯ͋·Γਫ਼౓Λམͱͣ͞ʹֶशՄೳ • MAML ʹൺ΂ 30%Ҏ্ߴ଎Խ 10 / 90

Slide 15

Slide 15 text

MAML ͱ֊૚ϕΠζ θ − log p( xjn | θ ) pTj (x) pD (T ) φj − log p( xjN+m | φj ) − log p( X | θ ) ∇θ J N M θ xjn φj N J Figure 1: (Left) The computational graph of the MAML (Finn et al., 2017) algorithm covered in Section 2.1. Straight arrows denote deterministic computations and crooked arrows denote sampling operations. (Right) The probabilistic graphical model for which MAML provides an inference procedure as described in Section 3.1. In each figure, plates denote repeated computations (left) or factorization (right) across inde- pendent and identically distributed samples. θ on which each task-specific parameter is statistically dependent. With this formulation, the mutual dependence of the task-specific parameters φj is realized only through their individual dependence on the meta-level parameters θ As such, estimating θ provides a way to constrain the estimation of each of the φj . Given some data in a multi-task setting, we may estimate θ by integrating out the task-specific parameters to form the marginal likelihood of the data. Formally, grouping all of the data from each of the tasks as X and again denoting by xj1 , . . . , xjN a sample from task Tj , the marginal likelihood of the observed data is given by p ( X | θ ) = j p xj1 , . . . , xjN | φj p φj | θ dφj . (2) Maximizing (2) as a function of θ gives a point estimate for θ, an instance of a method known as empirical Bayes (Bernardo & Smith, 2006; Gelman et al., 2014) due to its use of the data to estimate the parameters of the prior distribution. Hierarchical Bayesian models have a long history of use in both transfer learning and domain adap- • σʔλ xj1 , . . . , xjN , xjN+1 , . . . , xjN+M ∼ pTj (x) Λαϯϓϧ • લ൒ N ݸ͸܇࿅σʔλ, ޙ൒ M ݸΛςετσʔλ • MAML ͸ҎԼͷ໬౓Λ࠷େԽ͢Δ໰୊Ͱ͋ͬͨ L(θ) = 1 J ∑ j [ 1 M ∑ m − log p ( xjN+m θ − α∇θ 1 N ∑ n − log p (xjn |θ) )] (3.1) 11 / 90

Slide 27

Slide 27 text

࣮ݧ 5-way acc. (%) Model 1-shot Fine-tuning∗ 28.86 ± 0.54 Nearest Neighbor∗ 41.08 ± 0.70 Matching Networks FCE (Vinyals et al., 2016)∗ 43.56 ± 0.84 Meta-Learner LSTM (Ravi & Larochelle, 2017)∗ 43.44 ± 0.77 SNAIL (Anonymous, 2018)∗∗ 45.1 ± —— Prototypical Networks (Snell et al., 2017)∗∗∗ 46.61 ± 0.78 mAP-DLM (Triantafillou et al., 2017) 49.82 ± 0.78 MAML (Finn et al., 2017) 48.70 ± 1.84 LLAMA (Ours) 49.40 ± 1.83 Table 1: One-shot classification performance on the miniImageNet test set, with comparison methods or- dered by one-shot performance. All results are averaged over 600 test episodes, and we report 95% confidence intervals. ∗Results reported by Ravi & Larochelle (2017). ∗∗We report test accuracy for a comparable architecture.1∗∗∗We report test accuracy for models matching train and test “shot” and “way”. We use a neural network architecture standard to few-shot classification (e.g., Vinyals et al., 2016; Ravi & Larochelle, 2017), consisting of 4 layers with 3 × 3 convolutions and 64 filters, followed by batch normalization (BN) (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2×2 max-pooling. For the scaling variable β and centering variable γ of BN (see Ioffe & Szegedy, 2015), we ignore the fast adaptation update as well as the Fisher factors for K-FAC. We use Adam (Kingma & Ba, 2014) as the meta-optimizer, and standard batch gradient descent with a fixed learning rate to update the model during fast adaptation. LLAMA requires the prior precision term τ as well as an additional parameter η ∈ R+ that weights the regularization term log det ˆ H contributed by the Laplace approximation. We fix τ = 0.001 and selected η = 10−6 via cross-validation; all other parameters are set to the values reported in Finn et al. (2017). We find that LLAMA is practical enough to be applied to this larger-scale problem. In particular, our TensorFlow implementation of LLAMA trains for 60,000 iterations on one TITAN Xp GPU in • ఏҊख๏ (LLAMA) Λ༻͍ͯ Mini-ImageNet Ͱ෼ྨ • ҎԼ͸ΫϩεόϦσʔγϣϯͰܾఆ • τ = 0.001 • η = 10−6 • ଞͷϋΠύʔύϥϝʔλ͸ MAML ࿦จͱ౷Ұ • MAML ʹউར 23 / 90

Slide 44

Slide 44 text

࣮ݧ • Mini-ImageNet ͱ Omniglot Ͱ Few-shot • ্ : Mini-ImageNet • Լ : Omniglot • MAML ΍ FOMAML ͱଝ৭ͳ͍ਫ਼౓ • Transduction ʹ͍ͭͯ͸ޙड़ classiﬁcation, then you would show it 25 examples (5 per class) and ask it to classify a 26 example. In addition to the above setup, we also experimented with the transductive setting, where the model classiﬁes the entire test set at once. In our transductive experiments, information was shared between the test samples via batch normalization [9]. In our non-transductive experiments, batch normalization statistics were computed using all of the training samples and a single test sample. We note that Finn et al. [4] use transduction for evaluating MAML. For our experiments, we used the same CNN architectures and data preprocessing as Finn et al. [4]. We used the Adam optimizer [10] in the inner loop, and vanilla SGD in the outer loop, throughout our experiments. For Adam we set 1 = 0 because we found that momentum reduced performance across the board.1 During training, we never reset or interpolated Adam’s rolling moment data; instead, we let it update automatically at every inner-loop training step. However, we did backup and reset the Adam statistics when evaluating on the test set to avoid information leakage. The results on Omniglot and Mini-ImageNet are shown in Tables 1 and 2. While MAML, FOMAML, and Reptile have very similar performance on all of these tasks, Reptile does slightly better than the alternatives on Mini-ImageNet and slightly worse on Omniglot. It also seems that transduction gives a performance boost in all cases, suggesting that further research should pay close attention to its use of batch normalization during testing. Algorithm 1-shot 5-way 5-shot 5-way MAML + Transduction 48.70 ± 1.84% 63.11 ± 0.92% 1st-order MAML + Transduction 48.07 ± 1.75% 63.15 ± 0.91% Reptile 47.07 ± 0.26% 62.74 ± 0.37% Reptile + Transduction 49.97 ± 0.32% 65.99 ± 0.58% Table 1: Results on Mini-ImageNet. Both MAML and 1st-order MAML results are from [4]. Algorithm 1-shot 5-way 5-shot 5-way 1-shot 20-way 5-shot 20-way MAML + Transduction 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% 1st-order MAML + Transduction 98.3 ± 0.5% 99.2 ± 0.2% 89.4 ± 0.5% 97.9 ± 0.1% Reptile 95.39 ± 0.09% 98.90 ± 0.10% 88.14 ± 0.15% 96.65 ± 0.33% Reptile + Transduction 97.68 ± 0.04% 99.48 ± 0.06% 89.43 ± 0.14% 97.12 ± 0.32% between the test samples via batch normalization [9]. In our non-transductive experiments, batch normalization statistics were computed using all of the training samples and a single test sample. We note that Finn et al. [4] use transduction for evaluating MAML. For our experiments, we used the same CNN architectures and data preprocessing as Finn et al. [4]. We used the Adam optimizer [10] in the inner loop, and vanilla SGD in the outer loop, throughout our experiments. For Adam we set 1 = 0 because we found that momentum reduced performance across the board.1 During training, we never reset or interpolated Adam’s rolling moment data; instead, we let it update automatically at every inner-loop training step. However, we did backup and reset the Adam statistics when evaluating on the test set to avoid information leakage. The results on Omniglot and Mini-ImageNet are shown in Tables 1 and 2. While MAML, FOMAML, and Reptile have very similar performance on all of these tasks, Reptile does slightly better than the alternatives on Mini-ImageNet and slightly worse on Omniglot. It also seems that transduction gives a performance boost in all cases, suggesting that further research should pay close attention to its use of batch normalization during testing. Algorithm 1-shot 5-way 5-shot 5-way MAML + Transduction 48.70 ± 1.84% 63.11 ± 0.92% 1st-order MAML + Transduction 48.07 ± 1.75% 63.15 ± 0.91% Reptile 47.07 ± 0.26% 62.74 ± 0.37% Reptile + Transduction 49.97 ± 0.32% 65.99 ± 0.58% Table 1: Results on Mini-ImageNet. Both MAML and 1st-order MAML results are from [4]. Algorithm 1-shot 5-way 5-shot 5-way 1-shot 20-way 5-shot 20-way MAML + Transduction 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% 1st-order MAML + Transduction 98.3 ± 0.5% 99.2 ± 0.2% 89.4 ± 0.5% 97.9 ± 0.1% Reptile 95.39 ± 0.09% 98.90 ± 0.10% 88.14 ± 0.15% 96.65 ± 0.33% Reptile + Transduction 97.68 ± 0.04% 99.48 ± 0.06% 89.43 ± 0.14% 97.12 ± 0.32% Table 2: Results on Omniglot. MAML results are from [4]. 1st-order MAML results were generated by the code for [4] with the same hyper-parameters as MAML. 39 / 90

Slide 58

Slide 58 text

࣮ݧ Table 2: Omniglot results. MAML results are taken from the original work of Finn et al. [15], and first-order MAML and Reptile results are from Nichol et al. [43]. iMAML with gradient descent (GD) uses 16 and 25 steps for 5-way and 20-way tasks respectively. iMAML with Hessian-free uses 5 CG steps to compute the search direction and performs line-search to pick step size. Both versions of iMAML use = 2.0 for regularization, and 5 CG steps to compute the task meta-gradient. Algorithm 5-way 1-shot 5-way 5-shot 20-way 1-shot 20-way 5-shot MAML [15] 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% first-order MAML [15] 98.3 ± 0.5% 99.2 ± 0.2% 89.4 ± 0.5% 97.9 ± 0.1% Reptile [43] 97.68 ± 0.04% 99.48 ± 0.06% 89.43 ± 0.14% 97.12 ± 0.32% iMAML, GD (ours) 99.16 ± 0.35% 99.67 ± 0.12% 94.46 ± 0.42% 98.69 ± 0.1% iMAML, Hessian-Free (ours) 99.50 ± 0.26% 99.74 ± 0.11% 96.18 ± 0.36% 99.14 ± 0.1% dataset for different numbers of class labels and shots (in the N-way, K-shot setting), and compare two variants of iMAML with published results of the most closely related algorithms: MAML, FOMAML, and Reptile. While these methods are not state-of-the-art on this benchmark, they provide an apples-to-apples comparison for studying the use of implicit gradients in optimization-based meta-learning. For a fair comparison, we use the identical convolutional architecture as these prior works. Note however that architecture tuning can lead to better results for all algorithms [27]. The first variant of iMAML we consider involves solving the inner level problem (the regularized objective function in Eq. 4) using gradient descent. The meta-gradient is computed using conjugate gradient, and the meta-parameters are updated using Adam. This presents the most straightforward comparison with MAML, which would follow a similar procedure, but backpropagate through the path of optimization as opposed to invoking implicit differentiation. The second variant of iMAML uses a second order method for the inner level problem. In particular, we consider the Hessian-free or Newton-CG [44, 36] method. This method makes a local quadratic approximation to the objective function (in our case, G( 0 , ✓) and approximately computes the Newton search direction using CG. Since CG requires only Hessian-vector products, this way of approximating the Newton search direction is scalable to large deep neural networks. The step size can be computed using regularization, damping, trust-region, or linesearch. We use a linesearch on the training loss in our experiments to also illustrate how our method can handle non-differentiable inner optimization loops. We refer the readers to Nocedal & Wright [44] and Martens [36] for a more detailed exposition of this optimization algorithm. Similar approaches have also gained prominence in reinforcement learning [52, 47]. e first variant of iMAML we consider involves solving the inner level problem (the regularized ective function in Eq. 4) using gradient descent. The meta-gradient is computed using conjugate dient, and the meta-parameters are updated using Adam. This presents the most straightforward mparison with MAML, which would follow a similar procedure, but backpropagate through the h of optimization as opposed to invoking implicit differentiation. The second variant of iMAML s a second order method for the inner level problem. In particular, we consider the Hessian-free Newton-CG [44, 36] method. This method makes a local quadratic approximation to the objective ction (in our case, G( 0 , ✓) and approximately computes the Newton search direction using CG. ce CG requires only Hessian-vector products, this way of approximating the Newton search di- ion is scalable to large deep neural networks. The step size can be computed using regularization, mping, trust-region, or linesearch. We use a linesearch on the training loss in our experiments to o illustrate how our method can handle non-differentiable inner optimization loops. We refer the ders to Nocedal & Wright [44] and Martens [36] for a more detailed exposition of this optimiza- n algorithm. Similar approaches have also gained prominence in reinforcement learning [52, 47]. Table 3: Mini-ImageNet 5-way-1-shot accuracy Algorithm 5-way 1-shot MAML 48.70 ± 1.84 % first-order MAML 48.07 ± 1.75 % Reptile 49.97 ± 0.32 % iMAML GD (ours) 48.96 ± 1.84 % iMAML HF (ours) 49.30 ± 1.88 % les 2 and 3 present the results on Omniglot Mini-ImageNet, respectively. On the Om- lot domain, we find that the GD version of AML is competitive with the full MAML algo- m, and substatially better than its approxima- ns (i.e., first-order MAML and Reptile), espe- ly for the harder 20-way tasks. We also find that AML with Hessian-free optimization performs stantially better than the other methods, suggest- that powerful optimizers in the inner loop can of- benifits to meta-learning. In the Mini-ImageNet main, we find that iMAML performs better than MAML and FOMAML. We used = 0.5 and 10 dient steps in the inner loop. We did not perform an extensive hyperparameter sweep, and expect the results can improve with better hyperparameters. 5 CG steps were used to compute the a-gradient. The Hessian-free version also uses 5 CG steps for the search direction. Additional erimental details are Appendix F. Related Work r work considers the general meta-learning problem [51, 55, 41], including few-shot learning [30, . Meta-learning approaches can generally be categorized into metric-learning approaches that n an embedding space where non-parametric nearest neighbors works well [29, 57, 54, 45, 3], ck-box approaches that train a recurrent or recursive neural network to take datapoints as input • Omniglot(্ஈ), Mini-ImageNet(Լஈ) Ͱͷਫ਼౓ͷൺֱ • Inner ͷ࠷దԽʹ Hessian-free Λ࢖༻ͨ͠ iMAML ͕ڧ͍ • ಉۙ͡ࣅख๏Ͱ͋Δ FOMAML ΍ Reptile ͸೉͍͠λεΫ (20-way 1-shot) Ͱେ͖͘ਫ਼౓͕Լ͕Δ 52 / 90

Slide 77

Slide 77 text

෼ྨ࣮ݧ Mouth Open Young Wearing Hat Mouth Open Young Wearing Hat Mouth Open Young Wearing Hat Mouth Open Young Wearing Hat Figure 6: Sampled classifiers for an ambiguous meta-test task. In the meta-test training set (a), PLATIPUS observes five positives that share three attributes, and five negatives. A classifier that uses any two attributes can correctly classify the training set. On the right (b), we show the three possible two-attribute tasks that the training set can correspond to, and illustrate the labels (positive indicated by purple border) predicted by the best sampled classifier for that task. We see that different samples can effectively capture the three possible explanations, with some samples paying attention to hats (2nd and 3rd column) and others not (1st column). Ambiguous celebA (5-shot) Accuracy Coverage (max=3) Average NLL MAML 89.00 ± 1.78% 1.00 ± 0.0 0.73 ± 0.06 MAML + noise 84.3 ± 1.60 % 1.89 ± 0.04 0.68 ± 0.05 PLATIPUS (ours) (KL weight = 0.05) 88.34 ± 1.06 % 1.59 ± 0.03 0.67± 0.05 PLATIPUS (ours) (KL weight = 0.15) 87.8 ± 1.03 % 1.94 ± 0.04 0.56 ± 0.04 Table 1: Our method covers almost twice as many modes compared to MAML, with comparable accuracy. MAML + noise is a method that adds noise to the gradient, but does not perform variational inference. This improves coverage, but results in lower accuracy average log likelihood. We bold results above the highest confidence interval lowerbound. gradient descent with injected noise. During meta-training, the model parameters are optimized with respect to a variational lower bound on the likelihood for the meta-training tasks, so as to enable this simple adaptation procedure to produce approximate samples from the model posterior when conditioned on a few-shot training set. This approach has a number of benefits. The adaptation procedure is exceedingly simple, and the method can be applied to any standard model architecture. The algorithm introduces a modest number of additional parameters: besides the initial model weights, we must learn a variance on each parameter for the inference network and prior, and the number of parameters scales only linearly with the number of model weights. Our experimental results show that our method can be used to effectively sample diverse solutions to both regression and classification tasks at meta-test time, including with task families that have multi-modal task distributions. We additionally showed how our approach can be applied in settings where uncertainty can directly guide data acquisition, leading to better few-shot active learning. Although our approach is simple and broadly applicable, it has potential limitations that could be • ਫ਼౓Ͱ͸ MAML ʹෛ͚͍ͯΔ͕ Coverage ͸ 2 ʹ͍ۙ஋Λऔ͍ͬͯͯᐆ ດੑΛߟྀͰ͖͍ͯΔ • Coverage ͕ 2 ʹ͍ۙ ⇒ ֶश࣌ͱগ͠ҟͳΔςετ͕དྷͯ΋༗༻ͳॳظ ஋ͷαϯϓϦϯά͕Ͱ͖ΔՄೳੑ͕͋Δ • ͨͩਫ਼౓Ͱ͸ෛ͚͍ͯΔͷͰ୯७ͳը૾෼ྨʹ͓͍ͯར఺͕͋Δ͔͸ෆ໌ • MAML ͸ܾఆతʹॳظ஋͕ܾ·ΔͷͰ Coverage ͕ৗʹ 1 69 / 90

Slide 81

Slide 81 text

MMAML Figure 1: Model overview. The modulation network produces a task embedding , which is used to generate parameters {⌧i } that modulates the task network. The task network adapts modulated parameters to fit to the target task. Algorithm 1 MMAML META-TRAINING PROCEDURE. 1: Input: Task distribution P(T ), Hyper-parameters ↵ and 2: Randomly initialize ✓ and !. 3: while not DONE do 4: Sample batches of tasks Tj ⇠ P(T ) 5: for all j do 6: Infer = h({x, y}K; !h) with K samples from Dtrain Tj . 7: Generate parameters ⌧ = {gi( ; !g) | i = 1, · · · , N} to modulate each block of the task network f. 8: Evaluate r✓ LTj (f(x; ✓, ⌧); Dtrain Tj ) w.r.t the K samples 9: Compute adapted parameter with gradient descent: ✓0 Tj = ✓ ↵r✓ LTj f(x; ✓, ⌧); Dtrain Tj 10: end for 11: Update ✓ with r✓ P Tj ⇠P (T ) LTj f(x; ✓0, ⌧); Dval Tj 12: Update !g with r!g P Tj ⇠P (T ) LTj f(x; ✓0, ⌧); Dval Tj 13: Update !h with r!h P Tj ⇠P (T ) LTj f(x; ✓0, ⌧); Dval Tj 14: end while not the task-specific parameters from modulation network) is further adapted to target task through gradient-based optimization. A conceptual illustration can be found in Figure 1. In the rest of this section, we introduce our modulation network and a variety of modulation operators in section 4.1. Then we describe our task network and the training details for MMAML in section 4.2. 4.1 Modulation Network As mentioned above, modulation network is responsible for identifying the mode of a sampled task, and generate a set of parameters specific to the task. To achieve this, it first takes the given K data points and their labels {xk, yk }k=1,...,K as input to the task encoder f and produces an embedding vector that encodes the characteristics of a task: = h ⇣ {(xk, yk) | k = 1, · · · , K}; !h ⌘ (1) • MMAML ͷྲྀΕ͸ 1. Modulation Network ͰλεΫͷ embedding ϕΫτϧ v Λܭࢉ 2. v Λ༻͍ͯ Task Network ͷύϥϝʔλΛม׵͠λεΫʹ͋ͬͨॳظ஋Λ ੜ੒ 72 / 90

Slide 86

Slide 86 text

݁Ռ : ճؼ Table 1: Mean square error (MSE) on the multimodal 5-shot regression with 2, 3, and 5 modes. A Gaussian noise with µ = 0 and = 0.3 is applied. Multi-MAML uses ground-truth task modes to select the corresponding MAML model. Our method (with FiLM modulation) outperforms other methods by a margin. Method 2 Modes 3 Modes 5 Modes Post Modulation Post Adaptation Post Modulation Post Adaptation Post Modulation Post Adaptation MAML [8] - 1.085 - 1.231 - 1.668 Multi-MAML - 0.433 - 0.713 - 1.082 LSTM Learner 0.362 - 0.548 - 0.898 - Ours: MMAML (Softmax) 1.548 0.361 2.213 0.444 2.421 0.939 Ours: MMAML (FiLM) 2.421 0.336 1.923 0.444 2.166 0.868 Table 2: Classification testing accuracies on the multimodal few-shot image classification with 2, 3, and 5 modes. Multi-MAML uses ground-truth dataset labels to select corresponding MAML models. Our method outperforms MAML and achieve comparable results with Multi-MAML in all the scenarios. Method & Setup 2 Modes 3 Modes 5 Modes Way 5-way 20-way 5-way 20-way 5-way 20-way Shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot MAML [8] 66.80% 77.79% 44.69% 54.55% 67.97% 28.22% 44.09% 54.41% 28.85% Multi-MAML 66.85% 73.07% 53.15% 55.90% 62.20% 39.77% 45.46% 55.92% 33.78% MMAML (ours) 69.93% 78.73% 47.80% 57.47% 70.15% 36.27% 49.06% 60.83% 33.97% output value y, which further increases the difficulty of identifying which function generated the data. Please refer to the supplementary materials for details and parameters for regression experiments. Baselines and Our Approach. As mentioned before, we have MAML and Multi-MAML as two baseline methods, both with MLP task networks. Our method (MMAML) augments the task network • 5 छྨͷυϝΠϯ͔ΒճؼλεΫΛੜ੒ • sin, Ұ࣍ؔ਺, ೋ࣍ؔ਺, Ұ࣍ؔ਺ͷઈର஋, tanh • 2 छྨ, 3 छྨ, 5 छྨΛ࢖༻ͨ͠৔߹ͰͦΕͧΕϕʔεϥΠϯͱൺֱ • Modulation Network ʹ͸ LSTM Λ࢖༻ • σʔλΛ x Ͱιʔτͯ͠ॱʹೖྗ • FiLM Λ༻͍ͨ MMAML ͕΋ͬͱ΋ྑ͍݁Ռ 77 / 90

Slide 87

Slide 87 text

݁Ռ : ը૾෼ྨ Ours: MMAML (FiLM) 2.421 0.336 1.923 0.444 2.166 0.868 Table 2: Classification testing accuracies on the multimodal few-shot image classification with 2, 3, and 5 modes. Multi-MAML uses ground-truth dataset labels to select corresponding MAML models. Our method outperforms MAML and achieve comparable results with Multi-MAML in all the scenarios. Method & Setup 2 Modes 3 Modes 5 Modes Way 5-way 20-way 5-way 20-way 5-way 20-way Shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot MAML [8] 66.80% 77.79% 44.69% 54.55% 67.97% 28.22% 44.09% 54.41% 28.85% Multi-MAML 66.85% 73.07% 53.15% 55.90% 62.20% 39.77% 45.46% 55.92% 33.78% MMAML (ours) 69.93% 78.73% 47.80% 57.47% 70.15% 36.27% 49.06% 60.83% 33.97% output value y, which further increases the difficulty of identifying which function generated the data. Please refer to the supplementary materials for details and parameters for regression experiments. Baselines and Our Approach. As mentioned before, we have MAML and Multi-MAML as two baseline methods, both with MLP task networks. Our method (MMAML) augments the task network with a modulation network. We choose to use an LSTM to serve as the modulation network due to its nature as good at handling sequential inputs and generate predictive outputs. Data points (sorted by x value) are first input to this network to generate task-specific parameters that modulate the task network. The modulated task network is then further adapted using gradient-based optimization. Two variants of modulation operators – softmax and FiLM are explored to be used in our approach. Additionally, to study the effectiveness of the LSTM model, we evaluate another baseline (referred to as the LSTM Learner) that uses the LSTM as the modulation network (with FiLM) but does not perform gradient-based updates. Please refer to the supplementary materials for concrete specification of each model. Results. The quantitative results are shown in Table 1. We observe that MAML has the highest error in all settings and that incorporating task identity (Multi-MAML) can improve over MAML significantly. This suggests that MAML degenerates under multimodal task distributions. The LSTM • N-way K-shot ͷλεΫ • Omniglot, Mini-ImageNet, FC100, CUB, AIRCRAFT ͷ 5 ͭͷσʔλ ηοτ࢖༻ • 2 छྨ, 3 छྨ, 5 छྨΛ࢖༻ͨ͠৔߹ͰͦΕͧΕϕʔεϥΠϯͱൺֱ • Modulation Network ʹ͸ CNN, τ ͷੜ੒ʹ͸ MLP • 5-way ͷ 1-shot, 5-shot Ͱ͸ڧ͍͕ 20-way Ͱ͸ۤઓ • ೉͍͠λεΫʹ͸ରԠ͖͠Ε͍ͯͳ͍ʁ 78 / 90

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text