MAMLとその派生サーベイ

MAML ͱͦͷ೿ੜαʔϕΠ ߴ໦༏հ Nagoya Institute of Technology Takeuchi & Karasuyama
Lab 2020/03/18

Outline 1 MAML ͱͦͷ೿ੜͷؔ܎ੑ 2 Model-Agnostic Meta-Learning for Fast Adaptation
of Deep Networks 3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL BAYES 4 On First-Order Meta-Learning Algorithms 5 Meta-Learning with Implicit Gradients 6 Bayesian Model-Agnostic Meta-Learning 7 Probabilistic Model-Agnostic Meta-Learning 8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation 9 HOW TO TRAIN YOUR MAML 1 / 90

Next Section 1 MAML ͱͦͷ೿ੜͷؔ܎ੑ 2 Model-Agnostic Meta-Learning for Fast
Adaptation of Deep Networks 3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL BAYES 4 On First-Order Meta-Learning Algorithms 5 Meta-Learning with Implicit Gradients 6 Bayesian Model-Agnostic Meta-Learning 7 Probabilistic Model-Agnostic Meta-Learning 8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation 9 HOW TO TRAIN YOUR MAML

঺հ࿦จ • ϝλֶशͷΞϧΰϦζϜͷ̍ͭͰ͋Δ MAML ʹ͸༷ʑͳ೿ੜ͕͋Δ • ͜ͷεϥΠυͰ͸ҎԼͷ࿦จΛ঺հ • Model-Agnostic Meta-Learning
for Fast Adaptation of Deep Networks (MAML) (ICML 2017) • On First-Order Meta-Learning Algorithms (Reptile) (OpenAI 2018) • RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL BAYES (LLAMA) (ICLR 2018) • Bayesian Model-Agnostic Meta-Learning (BMAML) (NeurIPS 2018) • Probabilistic Model-Agnostic Meta-Learning (PLATIPUS) (NeurIPS 2018) • HOW TO TRAIN YOUR MAML (MAML++) (ICLR 2019) • Meta-Learning with Implicit Gradients (iMAML) (NeurIPS 2019) • Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation (MMAML) (NeurIPS 2019) 2 / 90

঺հ࿦จ • MAML Λ֊૚ϕΠζͰଊ͑௚͢ • LLAMA • MAML ͷޯ഑ܭࢉྔΛݮΒ͢ •
(FOMAML), Reptile, iMAML • MAML Λ֦ு • BMAML, PLATIPUS, MAML++, MMAML 3 / 90

஫ҙ • ͜ͷεϥΠυͰ͸ը૾෼ྨλεΫΛओ࣠ʹઆ໌ • ه߸ʹ͍ͭͯ͸શख๏Ͱ΄ͱΜͲ౷Ұͯ͠࢖༻ • ˞ ֤࿦จͰ࢖༻͞Ε͍ͯΔจࣈͱҟͳΔ৔߹͕ଟ͍Ͱ͢ • εϥΠυ಺Ͱ࢖༻͞Ε͍ͯΔը૾͸ಛʹ஫ऍ͕ͳ͍ݶΓͦͷ࿦จ಺
or ࣗ࡞ͷ΋ͷ 4 / 90

Notation ҎԼͷه߸͸Ҏ߱ͷઆ໌Ͱجຊతʹ౷Ұ • λεΫ Ti : σʔληοτ Di ͱଛࣦؔ਺ Li
͕ηοτʹͳͬͨ΋ͷ • Di = {xik , yik }K k=1 • ֶश࣌͸ Di Λ Dtr i ͱ Dtest i ʹ෼͚Δ • Li (ϕi , Di ) • ֤λεΫ Ti ͸λεΫ෼෍ P(T ) ͔Βੜ੒͞ΕΔ • ը૾෼ྨͳΒ Few-shot learning ͷઃఆ͕ଟ͍ • θ : ॳظ஋ύϥϝʔλ • ϕi : Ti ݻ༗ͷύϥϝʔλ • Ϟσϧ fϕi (x) : X → Y • ଞͷه߸ʹ͍ͭͯ͸ͦͷ౎౓આ໌ 5 / 90

MAML ֓ཁ nostic Meta-Learning for Fast Adaptation of Deep Networks
rk is a simple model- ta-learning that trains mall number of gradi- g on a new task. We nt model types, includ- l networks, and in sev- shot regression, image rning. Our evaluation ithm compares favor- ing methods designed ion, while using fewer dily applied to regres- nt learning in the pres- y outperforming direct rning eve rapid adaptation, a zed as few-shot learn- he problem setup and ithm. p s to train a model that only a few datapoints ish this, the model or arning phase on a set l can quickly adapt to of examples or trials. meta-learning learning/adaptation ✓ rL1 rL2 rL3 ✓⇤ 1 ✓⇤ 2 ✓⇤ 3 Figure 1. Diagram of our model-agnostic meta-learning algorithm (MAML), which optimizes for a representation ✓ that can quickly adapt to new tasks. In our meta-learning scenario, we consider a distribution over tasks p(T ) that we want our model to be able to adapt to. In the K-shot learning setting, the model is trained to learn a new task T i drawn from p(T ) from only K samples drawn from qi and feedback LTi generated by T i . During meta-training, a task T i is sampled from p(T ), the model is trained with K samples and feedback from the corresponding loss LTi from T i , and then tested on new samples from T i . The model f is then improved by considering how the test error on new data from qi changes with respect to the parameters. In effect, the test error on sampled tasks T i serves as the training error of the meta-learning process. At the end of meta-training, new tasks are sampled from p(T ), and meta-performance is measured by the model’s performance after learning from K samples. Generally, tasks used for meta-testing are held out during meta-training. 2.2. A Model-Agnostic Meta-Learning Algorithm In contrast to prior work, which has sought to train re- Figure 1: To compute the meta-gradient P i dLi( i) d✓ , the MAML algorith the optimization path, as shown in green, while first-order MAML compu approximating d i d✓ as I. Our implicit MAML approach derives an analytic meta-gradient without differentiating through the optimization path by esti The main contribution of our work is the development of the implicit MAM an approach for optimization-based meta-learning with deep neural networ for differentiating through the optimization path. Our algorithm aims to such that an optimization algorithm that is initialized at and regularized leads to good generalization for a variety of learning tasks. By leveraging th approach, we derive an analytical expression for the meta (or outer level) g on the solution to the inner optimization and not the path taken by the inne as depicted in Figure 1. This decoupling of meta-gradient computation a optimizer has a number of appealing properties. First, the inner optimization path need not be stored nor differentiated t implicit MAML memory efficient and scalable to a large number of inner ond, implicit MAML is agnostic to the inner optimization method used, approximate solution to the inner-level optimization problem. This permit methods, and in principle even non-differentiable optimization methods or based optimization, line-search, or those provided by proprietary software ( also provide the first (to our knowledge) non-asymptotic theoretical analy tion. We show that an ✏–approximate meta-gradient can be computed vi ˜ O(log(1/✏)) gradient evaluations and ˜ O(1) memory, meaning the memory with number of gradient steps. [Fig : Rajeswaran et al. 2019] • P(T ) ͔Βੜ੒͞Εͨ৽ͨͳλεΫ Ti ʹରͯ͠ग़དྷΔ͚ͩૣ͘, গͳ͍ αϯϓϧͰֶशͰ͖ΔΑ͏ͳॳظ஋ θ Λֶश • ৽ͨͳλεΫ = ֶशͰ࢖ΘΕ͍ͯͳ͍ΫϥεͰ෼ྨ • Model-Agnostic • ඍ෼ՄೳͰ͋ΔҎ֎, Ϟσϧ΍ଛࣦؔ਺ͷܗࣜΛԾఆ͠ͳ͍ • Task-Agnostic • ճؼ, ෼ྨ, ڧԽֶशͳͲ, ༷ʑͳλεΫʹద༻Ͱ͖Δ 6 / 90

MAML ΞϧΰϦζϜ MAML ΞϧΰϦζϜࣗମ͸ͱͯ΋γϯϓϧͰҎԼͷૢ࡞Λ܁Γฦ͢ • P(T ) ͔Βෳ਺ͷλεΫ Ti Λαϯϓϧ
• ϕ0 i = θ ͱ͠ޯ഑๏Λ༻͍ͯλεΫ Ti ݻ༗ͷύϥϝʔλ ϕi Λֶश • s = 1, ..., S ϕs i = ϕs−1 i − α∇ ϕs−1 i Li(ϕs−1 i , Dtr i ) • ֤λεΫͷςετޡࠩΛԼ͛ΔΑ͏ʹ θ ΛҎԼͷࣜͰߋ৽ θ = θ − β∇θ ∑ i Li(ϕS i , Dtest i ) λεΫΛαϯϓϧͯ͠ θ Λߋ৽͢ΔҰ࿈ͷྲྀΕΛ Outer-loop ݸผλεΫͷֶशΛ Inner-loop ͱݺͿ 7 / 90

FOMAML • MAML ͷ໰୊఺ ⇒ Outer-loop ͷޯ഑ܭࢉ͕ॏ͍ • ϔγΞϯͷܭࢉ͕ඞཁ •
ޯ഑Λ 1 ࣍ۙࣅͯ͠࢖༻͢Δํ๏΋ MAML ࿦จ಺ͰఏҊ͞Ε͍ͯΔ (FOMAML) • Inner-loop ࣌ͷޯ഑Λอଘ͓ͯ͘͠ඞཁ͕ͳ͍ͨΊ͍ܰ θ = θ − β ∑ i ∇ϕS i Li(ϕS i , Dtest i ) Figure 1: To compute the meta-gradient P i dLi( i) d✓ , the MAML algorithm differentiates through the optimization path, as shown in green, while ﬁrst-order MAML computes the meta-gradient by [Fig : Rajeswaran et al. 2019] 8 / 90

࣮ݧ • Few-shot learning ͱݺ͹ΕΔ໰୊ઃఆ • N-way K-shot ͳΒ N
Ϋϥε෼ྨ, ֤Ϋϥε K ຕͷը૾͕ 1 ͭͷλεΫ σʔληοτ • N Ϋϥεͷ෼ྨ಺༰͸λεΫຖʹมΘΔ • ϝλςετ࣌͸ֶश࣌ʹ͸ଘࡏ͠ͳ͍Ϋϥε෼ྨΛߦ͏͜ͱʹͳΔ • 2-way ͳΒҎԼͷΑ͏ͳײ͡ • ܇࿅λεΫ : ݘ, ೣ, Ԑ, ௗ ͔Β 2 Ϋϥεબͼֶश (ݘ vs ೣ) (ೣ vs ௗ) ... • ϝλςετλεΫ : (അ vs ދ) • ະ஌ͷΫϥε෼ྨΛߴ଎ʹͰ͖Δ͜ͱ͕ٻΊΒΕΔ • ࢖༻σʔλ͸ Omniglot, Mini-ImageNet • Omniglot ͸खॻ͖จࣈ • Mini-ImageNet ͸ࣗવը૾ 9 / 90

࣮ݧ Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Table
1. Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The ± shows 95% confidence intervals over tasks. Note that the Omniglot results may not be strictly comparable since the train/test splits used in the prior work were not available. The MiniImagenet evaluation of baseline methods and matching networks is from Ravi & Larochelle (2017). 5-way Accuracy 20-way Accuracy Omniglot (Lake et al., 2011) 1-shot 5-shot 1-shot 5-shot MANN, no conv (Santoro et al., 2016) 82.8% 94.9% – – MAML, no conv (ours) 89.7 ± 1.1% 97.5 ± 0.6% – – Siamese nets (Koch, 2015) 97.3% 98.4% 88.2% 97.0% matching nets (Vinyals et al., 2016) 98.1% 98.9% 93.8% 98.5% neural statistician (Edwards & Storkey, 2017) 98.1% 99.5% 93.2% 98.1% memory mod. (Kaiser et al., 2017) 98.4% 99.6% 95.0% 98.6% MAML (ours) 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% 5-way Accuracy MiniImagenet (Ravi & Larochelle, 2017) 1-shot 5-shot fine-tuning baseline 28.86 ± 0.54% 49.79 ± 0.79% nearest neighbor baseline 41.08 ± 0.70% 51.04 ± 0.65% matching nets (Vinyals et al., 2016) 43.56 ± 0.84% 55.31 ± 0.73% meta-learner LSTM (Ravi & Larochelle, 2017) 43.44 ± 0.77% 60.60 ± 0.71% MAML, first order approx. (ours) 48.07 ± 1.75% 63.15 ± 0.91% MAML (ours) 48.70 ± 1.84% 63.11 ± 0.92% fewer overall parameters compared to matching networks and the meta-learner LSTM, since the algorithm does not introduce any additional parameters beyond the weights of the classifier itself. Compared to these prior methods, memory-augmented neural networks (Santoro et al., 2016) specifically, and recurrent meta-learning models in general, represent a more broadly applicable class of methods that, like MAML, can be used for other tasks such as reinforcement learning (Duan et al., 2016b; Wang et al., Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Table 1. Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The ± shows 95% confidence intervals over tasks. Note that the Omniglot results may not be strictly comparable since the train/test splits used in the prior work were not available. The MiniImagenet evaluation of baseline methods and matching networks is from Ravi & Larochelle (2017). 5-way Accuracy 20-way Accuracy Omniglot (Lake et al., 2011) 1-shot 5-shot 1-shot 5-shot MANN, no conv (Santoro et al., 2016) 82.8% 94.9% – – MAML, no conv (ours) 89.7 ± 1.1% 97.5 ± 0.6% – – Siamese nets (Koch, 2015) 97.3% 98.4% 88.2% 97.0% matching nets (Vinyals et al., 2016) 98.1% 98.9% 93.8% 98.5% neural statistician (Edwards & Storkey, 2017) 98.1% 99.5% 93.2% 98.1% memory mod. (Kaiser et al., 2017) 98.4% 99.6% 95.0% 98.6% MAML (ours) 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% 5-way Accuracy MiniImagenet (Ravi & Larochelle, 2017) 1-shot 5-shot fine-tuning baseline 28.86 ± 0.54% 49.79 ± 0.79% nearest neighbor baseline 41.08 ± 0.70% 51.04 ± 0.65% matching nets (Vinyals et al., 2016) 43.56 ± 0.84% 55.31 ± 0.73% meta-learner LSTM (Ravi & Larochelle, 2017) 43.44 ± 0.77% 60.60 ± 0.71% MAML, first order approx. (ours) 48.07 ± 1.75% 63.15 ± 0.91% MAML (ours) 48.70 ± 1.84% 63.11 ± 0.92% fewer overall parameters compared to matching networks and the meta-learner LSTM, since the algorithm does not introduce any additional parameters beyond the weights of the classifier itself. Compared to these prior methods, memory-augmented neural networks (Santoro et al., 2016) specifically, and recurrent meta-learning models in general, represent a more broadly applicable class of methods that, like MAML, can be used for other tasks such as reinforcement learning (Duan et al., 2016b; Wang et al., • ଟ͘ͷطଘͷ Few-shot ༻ख๏ʹউར • FOMAML ͸ MAML ͱൺ΂ͯ͋·Γਫ਼౓Λམͱͣ͞ʹֶशՄೳ • MAML ʹൺ΂ 30%Ҏ্ߴ଎Խ 10 / 90

MAML ͱ֊૚ϕΠζ θ − log p( xjn | θ )
pTj (x) pD (T ) φj − log p( xjN+m | φj ) − log p( X | θ ) ∇θ J N M θ xjn φj N J Figure 1: (Left) The computational graph of the MAML (Finn et al., 2017) algorithm covered in Section 2.1. Straight arrows denote deterministic computations and crooked arrows denote sampling operations. (Right) The probabilistic graphical model for which MAML provides an inference procedure as described in Section 3.1. In each figure, plates denote repeated computations (left) or factorization (right) across inde- pendent and identically distributed samples. θ on which each task-specific parameter is statistically dependent. With this formulation, the mutual dependence of the task-specific parameters φj is realized only through their individual dependence on the meta-level parameters θ As such, estimating θ provides a way to constrain the estimation of each of the φj . Given some data in a multi-task setting, we may estimate θ by integrating out the task-specific parameters to form the marginal likelihood of the data. Formally, grouping all of the data from each of the tasks as X and again denoting by xj1 , . . . , xjN a sample from task Tj , the marginal likelihood of the observed data is given by p ( X | θ ) = j p xj1 , . . . , xjN | φj p φj | θ dφj . (2) Maximizing (2) as a function of θ gives a point estimate for θ, an instance of a method known as empirical Bayes (Bernardo & Smith, 2006; Gelman et al., 2014) due to its use of the data to estimate the parameters of the prior distribution. Hierarchical Bayesian models have a long history of use in both transfer learning and domain adap- • σʔλ xj1 , . . . , xjN , xjN+1 , . . . , xjN+M ∼ pTj (x) Λαϯϓϧ • લ൒ N ݸ͸܇࿅σʔλ, ޙ൒ M ݸΛςετσʔλ • MAML ͸ҎԼͷ໬౓Λ࠷େԽ͢Δ໰୊Ͱ͋ͬͨ L(θ) = 1 J ∑ j [ 1 M ∑ m − log p ( xjN+m θ − α∇θ 1 N ∑ n − log p (xjn |θ) )] (3.1) 11 / 90

MAML ͱ֊૚ϕΠζ j (x) (T ) φj − log p(
xjN+m | φj ) − log p( X | θ ) J M θ xjn φj N J tational graph of the MAML (Finn et al., 2017) algorithm covered in Section 2.1. ministic computations and crooked arrows denote sampling operations. (Right) model for which MAML provides an inference procedure as described in plates denote repeated computations (left) or factorization (right) across inde- buted samples. fic parameter is statistically dependent. With this formulation, the mutual ecific parameters φj is realized only through their individual dependence ers θ As such, estimating θ provides a way to constrain the estimation of lti-task setting, we may estimate θ by integrating out the task-specific rginal likelihood of the data. Formally, grouping all of the data from each denoting by xj1 , . . . , xjN a sample from task Tj , the marginal likelihood en by ) = p x , . . . , x | φ p φ | θ dφ . (2) • ֤λεΫͷύϥϝʔλ ϕj ͸ଞͷλεΫݻ༗ͷύϥϝʔλʹӨڹΛड͚ ͍ͯΔͱ͢Δ • ͜ΕΛϞσϧԽ͢ΔͨΊ֤λεΫݻ༗ͷύϥϝʔλ͸ڞ௨ͷύϥϝʔλ θ ʹґଘ͍ͯ͠Δͱ͢Δ • ֤λεΫͷ؍ଌσʔλΛ X ͱ͢ΔͱҎԼͷࣜͰදݱՄೳ p(X|θ) = ∏ j (∫ p(xj1 , . . . , xjN |ϕj)p(ϕj|θ)dϕj ) (3.2) 12 / 90

MAML ͱ֊૚ϕΠζ • (3.2) Λ࠷େԽ͢Δ θ ΛٻΊΔ͜ͱ͸ܦݧϕΠζͱݺ͹ΕΔ • (3.2) ͸ܭࢉࠔ೉ͳͷͰ఺ਪఆͱͯ͠
ˆ ϕj ΛαϯϓϦϯά͢Δ͜ͱ͕ଟ͍ • ͦͷ৔߹, ෛͷର਺पล໬౓͸ҎԼͷࣜ − log p(X|θ) ≈ ∑ j [ − log p(xjN+1 , . . . , xjN+M | ˆ ϕj) ] (3.3) • ˆ ϕj = θ + α∇θ log p(xj1 , . . . , xjN |θ) ͱஔ͚͹ࣜ (3.1) ͱಉ͡ܗʹͳΔ • MAML ͷֶश͸֊૚ϕΠζͰ֤λεΫݻ༗ͷύϥϝʔλΛ఺ਪఆͯ͠ ۙࣅͨ͠৔߹ͷपล໬౓࠷େԽʹରԠ 13 / 90

MAML ͱ֊૚ϕΠζ • Inner-loop ͷεςοϓ਺ͱ֤λεΫ΁ͷదԠ౓͸τϨʔυΦϑͷؔ܎ • Inner-loop Ͱͷޯ഑߱Լ๏ͷ early stopping
ͷҙຯΛߟ͑Δ • Ҏ߱λεΫΠϯσοΫε͸লུ • ֤λεΫͷෛͷର਺໬౓ ℓ(ϕ) = − log p(xj1 , . . . , xjN |ϕ) Λ࠷খ஋ ϕ∗ पΓͰ 2 ࣍ۙࣅ͢Δͱ ℓ(ϕ) ≈ ˜ ℓ(ϕ) := 1 2 ∥ϕ − ϕ∗∥2 H−1 + ℓ(ϕ∗) • H = ∇2 ϕ ℓ(ϕ∗) • ∥z∥Q = z⊤Q−1z 14 / 90

MAML ͱ֊૚ϕΠζ • ۂ཰ߦྻ B Λ༻͍ΔͱҎԼͷߋ৽͕ࣜಘΒΕΔ • B = (∇2
ϕ ˜ ℓ(ϕk−1))−1 ͳΒχϡʔτϯ๏ ϕk = ϕk−1 − B∇ϕ ˜ ℓ(ϕk−1) • ϕ0 = θ ͱͯ͠ k εςοϓߋ৽Λߦͳ͏ͱҎԼͷࣜΛղ͘ ϕ ͕ಘΒΕΔ min ( ∥ϕ − ϕ∗∥2 H−1 + ∥ϕ0 − ϕ∥2 Q ) (3.9) • Q = OΛ−1((I − BΛ)−k − I)O⊤ • Λ = O⊤HO = diag(λ1 , . . . , λn ) • B = O⊤B−1O = diag(b1 , . . . , bn ) • λi , bi ≥ 0, i = 1, . . . , n 15 / 90

MAML ͱ֊૚ϕΠζ • (3.9) ͸ॳظ஋͔Βͦ͜·Ͱ཭Εͳ͍Α͏ͳ੍໿ (=early stopping) ෇͖ ࠷খԽ໰୊ •
(3.9) ͸ࣄޙ෼෍ p(ϕ|x1, . . . , xN , θ) ∝ p(x1, . . . , xN |ϕ)p(ϕ|θ) ͷ࠷େ ԽʹରԠ • ͞Βʹࣄલ෼෍ p(ϕ|θ) ʹฏۉ θ, ڞ෼ࢄߦྻ Q ͷਖ਼ن෼෍Λબ୒͢Δ͜ ͱʹରԠ (Santos, 1996) • Ώ͑ʹ early stopping ͸ࣄલ෼෍ͷબ୒ʹؔ܎͍ͯ͠Δ • ޙʹఏҊख๏ʹͯ͜ͷࣄ࣮Λ࢖༻ 16 / 90

LLAMA • ϕ ͷࣄޙ෼෍͕؇΍͔ͳ෼෍ͩͬͨ৔߹ʹ఺ਪఆ͕͏·͘ػೳ͠ͳ͍ • (3.2) Λ MAP ਪఆ͢ΔͷͰ͸ͳ͘ϥϓϥεۙࣅΛ༻͍Δํ๏ΛఏҊ •
Lightweight Laplace Approximation for Meta-Adaptation (LLAMA) 17 / 90

LLAMA • ࣄޙ෼෍ͷ΁γΞϯ Hj ͸ҎԼͷΑ͏ʹॻ͘͜ͱ͕Ͱ͖Δ Hj = ∇2 ϕj [log
p(Xj|ϕj)] + ∇2 ϕj [log p(ϕj|θ)] • ࣄલ෼෍ p(ϕj|θ) ͸ਖ਼ن෼෍ͰۙࣅՄೳͰ͕͋ͬͨର֯Ͱͳ͍ͨΊѻ͍ ʹ͍͘ • ࣮ݧͰ͸ਫ਼౓ τ ͷ౳ํڞ෼ࢄߦྻͷਖ਼ن෼෍Ͱۙࣅ • τ ͸ΫϩεόϦσʔγϣϯͰܾఆ͢Δ 19 / 90

LLAMA • ର਺໬౓ͷ΁γΞϯ΋ͦͷ··Ͱ͸ܭࢉࠔ೉ͳͨΊۙࣅ͍ͨ͠ • ϑΟογϟʔ৘ใߦྻΛ༻͍ͯۙࣅ • ࣗવޯ഑ֶश๏ͷจ຺ͰϑΟογϟʔ৘ใྔߦྻͷٯߦྻΛۙࣅ͢Δ Kronecker-factored approximate curvature
(K-FAC) ͱ͍͏ख๏͕͋Δ • ϒϩοΫର֯ۙࣅͱΫϩωοΧʔੵͰۙࣅ͢Δख๏ • K-FAC ͷख๏Λ༻͍Δ͜ͱͰϑΟογϟʔ৘ใߦྻͷߦྻ͕ࣜޮ཰తʹ ۙࣅܭࢉՄೳ 20 / 90

LLAMA • Ҏ্ϥϓϥεۙࣅͱ K-FAC Λ༻͍Δ͜ͱͰҎԼͷ LLAMA ΛಘΔ • ˆ H
͸ H ͷۙࣅ • η ͸ϋΠύʔύϥϝʔλ Subroutine ML-LAPLACE(θ, T ) Draw N samples x1 , . . . , xN ∼ pT (x) Initialize φ ← θ for k in 1, . . . , K do Update φ ← φ + α ∇φ log p( x1 , . . . , xN | φ ) end Draw M samples xN+1 , . . . , xN+M ∼ pT (x) Estimate quadratic curvature ˆ H return − log p( xN+1 , . . . , xN+M | φ ) + η log det( ˆ H) Subroutine 4: Subroutine for computing a Laplace approximation of the marginal likelihood. 21 / 90

࣮ݧ • ϥϓϥεۙࣅͰλεΫݻ༗ύϥϝʔλ ϕj ͷࣄޙ෼෍Λۙࣅ͢Ε͹ෆ֬ ࣮ੑΛݟΔ͜ͱ͕Ͱ͖Δ • ৼ෯ [0.1, 5.0],
Ґ૬ [0, π] ͷ sin x ͷճؼΛߦ͏λεΫ • ֤λεΫ 10 ఺ͷ؍ଌ஋͕༩͑ΒΕΔ −5 0 5 −10 −5 0 5 10 −5 0 5 MAML −10 −5 0 5 10 MAML with uncertainty ground truth function few-shot training examples model prediction sample from the model Figure 5: Our method is able to meta-learn a model that can quickly adapt to sinusoids with varying phases and amplitudes, and the interpretation of the method as hierarchical Bayes makes it practical to directly sample models from the posterior. In this ﬁgure, we illustrate various samples from the posterior of a model that is meta-trained on different sinusoids, when presented with a few datapoints (in red) from a new, previously unseen sinusoid. Note that the random samples from the posterior predictive describe a distribution of functions that are all sinusoidal and that there is increased uncertainty when the datapoints are less informative (i.e., when the datapoints are sampled only from the lower part of the range input, shown in the bottom-right example). ࠨ: ௨ৗͷ MAML. ӈ : ϝλςετ࣌ʹ༩͑ΒΕͨ 10 ఺Ͱ ϕj Λֶशޙ ϥϓϥεۙࣅͰಘΒΕͨࣄޙ෼෍͔ΒαϯϓϦϯάͨ͠ύϥϝʔλͰճؼͨ͠΋ͷ. 22 / 90

࣮ݧ 5-way acc. (%) Model 1-shot Fine-tuning∗ 28.86 ± 0.54
Nearest Neighbor∗ 41.08 ± 0.70 Matching Networks FCE (Vinyals et al., 2016)∗ 43.56 ± 0.84 Meta-Learner LSTM (Ravi & Larochelle, 2017)∗ 43.44 ± 0.77 SNAIL (Anonymous, 2018)∗∗ 45.1 ± —— Prototypical Networks (Snell et al., 2017)∗∗∗ 46.61 ± 0.78 mAP-DLM (Triantafillou et al., 2017) 49.82 ± 0.78 MAML (Finn et al., 2017) 48.70 ± 1.84 LLAMA (Ours) 49.40 ± 1.83 Table 1: One-shot classification performance on the miniImageNet test set, with comparison methods or- dered by one-shot performance. All results are averaged over 600 test episodes, and we report 95% confidence intervals. ∗Results reported by Ravi & Larochelle (2017). ∗∗We report test accuracy for a comparable architecture.1∗∗∗We report test accuracy for models matching train and test “shot” and “way”. We use a neural network architecture standard to few-shot classification (e.g., Vinyals et al., 2016; Ravi & Larochelle, 2017), consisting of 4 layers with 3 × 3 convolutions and 64 filters, followed by batch normalization (BN) (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2×2 max-pooling. For the scaling variable β and centering variable γ of BN (see Ioffe & Szegedy, 2015), we ignore the fast adaptation update as well as the Fisher factors for K-FAC. We use Adam (Kingma & Ba, 2014) as the meta-optimizer, and standard batch gradient descent with a fixed learning rate to update the model during fast adaptation. LLAMA requires the prior precision term τ as well as an additional parameter η ∈ R+ that weights the regularization term log det ˆ H contributed by the Laplace approximation. We fix τ = 0.001 and selected η = 10−6 via cross-validation; all other parameters are set to the values reported in Finn et al. (2017). We find that LLAMA is practical enough to be applied to this larger-scale problem. In particular, our TensorFlow implementation of LLAMA trains for 60,000 iterations on one TITAN Xp GPU in • ఏҊख๏ (LLAMA) Λ༻͍ͯ Mini-ImageNet Ͱ෼ྨ • ҎԼ͸ΫϩεόϦσʔγϣϯͰܾఆ • τ = 0.001 • η = 10−6 • ଞͷϋΠύʔύϥϝʔλ͸ MAML ࿦จͱ౷Ұ • MAML ʹউར 23 / 90

͜ͷ࿦จ͕ओு͢Δ MAML ͷ໰୊఺ • MAML ͷ໰୊఺ͱͯ͠ Outer-loop Ͱͷඍ෼ܭࢉ͕େม • ܭࢉྔ,
ϝϞϦڞʹ • ܭࢉίετΛ࡟ݮ͢Δ 1 ࣍ۙࣅख๏ΛఏҊ (Reptile) • FOMAML ΑΓ௚ײత • ڧԽֶशͰͷ࣮ݧ͸ͳ͍ 24 / 90

Reptile • MAML ͱಉ༷ʹҎԼͷࣜͰ֤λεΫݻ༗ͷύϥϝʔλ ϕi Λֶश ϕs = ϕs−1 −
α∇ϕs−1 L(ϕs−1, Dtr) • ͦͷޙҎԼͷࣜͰॳظ஋ θ Λֶश • ॳظ஋ύϥϝʔλͷಈ͔͠ํ͸ Reptile ͷํ͕ FOMAML ΑΓࣗવ θ ← θ + ϵ(ϕS − θ) = $ ɾɾɾ ∇ℒ((, +,-+) / 0 (1$ ( '0.".- 3FQUJMF 25 / 90

Reptile • Inner-loop ͷ֤εςοϓͰͷޯ഑Λ gs ͱ͢Δͱ ϕS ͸ ϕS =
θ − α(g1 + · · · + gS−1) • ॳظ஋ θ Λֶश͸ҎԼͷΑ͏ʹ΋ղऍՄೳ θ ← θ − ϵα(g1 + · · · + gS−1) = $ ɾɾɾ 3FQUJMF −$ −' −()' −()$ ( 26 / 90

joint training ͱ Reptile • Ұ൪؆୯ͳ meta learning ͷํ๏͸ joint
training • ຬวͳ͘λεΫΛ͜ͳͤΔΑ͏ʹҎԼͷظ଴஋Λ࠷খʹ͢ΔΑ͏ʹύϥ ϝʔλ θ Λֶश min Eτ [Lτ (θ, Dtr τ )] • ͨͩ joint training Ͱ͸্खֶ͘शͰ͖ͳ͍ • ༷ʑͳ sin ؔ਺Λճؼ͢ΔλεΫΛߟ͑Δͱ joint training Ͱ͸ৗʹ 0 Λ ฦ͢Α͏ʹͳΔ 27 / 90

joint training ͱ Reptile • Reptile ͸ joint training ͸ࣅ͍ͯΔ
• Inner-loop ͕ 1 ճ͚ͩͷ৔߹͸ joint training ʹҰக • ࣍ϖʔδ͔Βઆ໌͢ΔΑ͏ʹෳ਺εςοϓͱΔͱҧ͍͕ग़ͯ͘Δ 28 / 90

MAML, FOMAML, Reptile ͷൺֱ ˞͜͜ͷٞ࿦Ͱ͸ Notation Λݩ࿦จʹ߹Θͤ·͢ • ࣜΛ؆қʹ͢ΔͨΊҎԼͷΑ͏ʹఆٛ •
Li : ಉ͡λεΫͷ i ൪໨ͷϛχόονʹ͓͚Δଛࣦؔ਺ • ϕ1 : ॳظ஋ύϥϝʔλ (ࠓ·Ͱͷ θ ʹ૬౰) • gi = L′ i (ϕi ) • ϕi+1 = ϕi − αgi • ¯ gi = L′ i (ϕ1 ) • ¯ Hi = L′′ i (ϕ1 ) • i ∈ [1, k] 29 / 90

MAML, FOMAML, Reptile ͷൺֱ • ޯ഑ gi Λ ϕ1 पΓͰۙࣅ
gi = L′ i (ϕi) = L′ i (ϕ1) + L′′ i (ϕ1)(ϕi − ϕ1) + O(α2) = ¯ gi − α ¯ Hi i−1 ∑ j gj + O(α2) = ¯ gi − α ¯ Hi i−1 ∑ j ¯ gj + O(α2) • 1 ߦ໨͔Β 2 ߦ໨ : ϕi − ϕ1 = −α ∑ i−1 j gj • 2 ߦ໨͔Β 3 ߦ໨ : gi = ¯ gi + O(α) 30 / 90

MAML, FOMAML, Reptile ͷൺֱ • Ui(ϕ) = ϕ − αL′
i (ϕ) ͱఆٛ͢Δͱ, MAML ʹ͓͚Δ Outer-loop Ͱͷޯ ഑ gMAML ͸ gMAML = ∂ ∂ϕ1 Lkϕk = ∂ ∂ϕ1 Lk(Uk−1(Uk−2(. . . (U1(ϕ1))))) = U′ 1 (ϕ1) · · · U′ k−1 (ϕk−1)L′ k (ϕk) = (I − αL′′ 1 (ϕ1)) · · · (I − αL′′ k−1 (ϕk−1))L′ k (ϕk) =   k−1 ∏ j=1 (I − αL′′ j (ϕj)   gk 31 / 90

MAML, FOMAML, Reptile ͷൺֱ • gk = ¯ gk −
α ¯ Hk ∑ i−1 k ¯ gk + O(α2), L′′ j (ϕj) = ¯ Hj + O(α) ΑΓ gMAML =   k−1 ∏ j=1 (I − α ¯ Hj)   ( ¯ gk − α ¯ Hk i−1 ∑ k ¯ gk ) + O(α2) =  I − α k−1 ∑ j=1 ¯ Hj   ( ¯ gk − α ¯ Hk i−1 ∑ k ¯ gk ) + O(α2) = ¯ gk − α k−1 ∑ j=1 ¯ Hj ¯ gk − α ¯ Hk k−1 ∑ j=1 ¯ gj + O(α2) 32 / 90

MAML, FOMAML, Reptile ͷൺֱ • k = 2 ͱ͢Δͱ MAML,
FOMAML, Reptile ͷޯ഑͸ҎԼͷΑ͏ʹͳΔ gMAML = ¯ g2 − α ¯ H2¯ g1 − α ¯ H1¯ g2 + O(α2) gFOMAML = g2 = ¯ g2 − α ¯ H2¯ g1 + O(α2) gReptile = g1 + g2 = ¯ g1 + ¯ g2 − α ¯ H2¯ g1 + O(α2) • ͜ΕΒͷޯ഑Λෳ਺λεΫͰظ଴஋ΛऔΔ͜ͱΛߟ͑Δ • Ҏ߱ग़ͯ͘Δ Eτ,1,2 [. . . ] ͸֤λεΫ τ ͱϛχόον L1 , L2 ͦΕͧΕͰظ ଴஋Λऔͬͨ΋ͷ 33 / 90

MAML, FOMAML, Reptile ͷൺֱ • AvgGrad ͱݺ͹ΕΔ΋ͷΛҎԼͷࣜͰఆٛ AvgGrad = Eτ,1[¯
g1] • ॳظ஋ύϥϝʔλΛͲͪΒʹಈ͔ͤ͹ޡࠩΛখ͘͞Ͱ͖Δ͔Λදͯ͠ ͍Δ • joint training Ͱͷ࠷খԽ໰୊ͱಉ͡ 34 / 90

MAML, FOMAML, Reptile ͷൺֱ • AvgGradInner ͱݺ͹ΕΔ΋ͷ΋ҎԼͷࣜͰఆٛ AvgGradInner = Eτ,1,2[
¯ H2¯ g1] = Eτ,1,2[ ¯ H1¯ g2] = 1 2 Eτ,1,2[ ¯ H2¯ g1 + ¯ H1¯ g2] = 1 2 Eτ,1,2 [ ∂ ∂ϕ1 (¯ g1 · ¯ g2) ] • ͜Ε͸ϛχόονؒͷޯ഑ͷ಺ੵΛ૿Ճͤ͞Δํ޲Λද͍ͯ͠Δ • ͲͷλεΫͰ΋֤ϛχόονͰͷߋ৽͕ಉ͡Α͏ͳํ޲ʹͳΔ (≒ ্ख͘ ֶश͕ਐΉ) ॳظ஋ΛಘΔ໾ׂ • λεΫ൚Խʹ໾ཱͭ 35 / 90

MAML, FOMAML, Reptile ͷൺֱ • AvgGrad ͱ AvgGradInner Λ༻͍ͯ k
= 2 Ͱͷ MAML, FOMAML, Reptile ͷޯ഑ͷظ଴஋Λද͢ͱ E[gMAML] = (1)AveGrad − (2α)AvgGradInner + O(α2) E[gFOMAML] = (1)AveGrad − (α)AvgGradInner + O(α2) E[gReptile] = (2)AveGrad − (α)AvgGradInner + O(α2) 36 / 90

MAML, FOMAML, Reptile ͷൺֱ • k > 2 ͷ৔߹͸ E[gMAML]
= (1)AveGrad − (2(k − 1)α)AvgGradInner + O(α2) E[gFOMAML] = (1)AveGrad − ((k − 1)α)AvgGradInner + O(α2) E[gReptile] = (k)AveGrad − ( 1 2 k(k − 1)α)AvgGradInner + O(α2) • AvgGradInner ͱ AveGrad ͷൺ͸ MAML > FOMAML > Reptile • MAML ʹൺ΂ FOMAML ͸ AvgGradInner ͕൒෼ͳͨΊੑೳ͸མͪΔ • Reptile ͸ AveGrad ͕ଟ͍ͨΊߴ଎ʹޡ͕ࠩݮΔ͜ͱ͕ظ଴͞ΕΔ 37 / 90

Reptile ͱλεΫ࠷దղͷଟ༷ମͷؔ܎ * 1 * 2 ϕ ure 2: The
above illustration shows the sequence of iterates obtained by moving alternately towards t imal solution manifolds W1 and W2 and converging to the point that minimizes the average squar tance. One might object to this picture on the grounds that we converge to the same point regardless ether we perform one step or multiple steps of gradient descent. That statement is true, however, no t minimizing the expected distance objective E ⌧ [D( , W⌧ )] is di↵erent than minimizing the expected lo ective E ⌧ [L⌧ (f )]. In particular, there is a high-dimensional manifold of minimizers of the expected lo (e.g., in the sine wave case, many neural network parameters give the zero function f( ) = 0), but t nimizer of the expected distance objective is typically a single point. 2 Finding a Point Near All Solution Manifolds re, we argue that Reptile converges towards a solution that is close (in Euclidean distance) h task ⌧’s manifold of optimal solutions. This is a informal argument and should be taken mu • Reptile ͸ॳظ஋ θ ͕֤λεΫͷ࠷దղͷଟ༷ମ W∗ τ ʹ͍ۙղʹ޲͔ͬ ͯऩଋ • ϢʔΫϦουڑ཭Ͱͷ࿩ • ଟ༷ମΛߟ͑Δͷ͸࠷దղ͕ແ਺ʹ͋Δͱߟ͑ΒΕΔ͔Β • ͨͩ࿦จ಺Ͱ͸ informal ͳओுͱݴ͍ͬͯΔ • ৄࡉ͸লུ 38 / 90

࣮ݧ • Mini-ImageNet ͱ Omniglot Ͱ Few-shot • ্ :
Mini-ImageNet • Լ : Omniglot • MAML ΍ FOMAML ͱଝ৭ͳ͍ਫ਼౓ • Transduction ʹ͍ͭͯ͸ޙड़ classiﬁcation, then you would show it 25 examples (5 per class) and ask it to classify a 26 example. In addition to the above setup, we also experimented with the transductive setting, where the model classiﬁes the entire test set at once. In our transductive experiments, information was shared between the test samples via batch normalization [9]. In our non-transductive experiments, batch normalization statistics were computed using all of the training samples and a single test sample. We note that Finn et al. [4] use transduction for evaluating MAML. For our experiments, we used the same CNN architectures and data preprocessing as Finn et al. [4]. We used the Adam optimizer [10] in the inner loop, and vanilla SGD in the outer loop, throughout our experiments. For Adam we set 1 = 0 because we found that momentum reduced performance across the board.1 During training, we never reset or interpolated Adam’s rolling moment data; instead, we let it update automatically at every inner-loop training step. However, we did backup and reset the Adam statistics when evaluating on the test set to avoid information leakage. The results on Omniglot and Mini-ImageNet are shown in Tables 1 and 2. While MAML, FOMAML, and Reptile have very similar performance on all of these tasks, Reptile does slightly better than the alternatives on Mini-ImageNet and slightly worse on Omniglot. It also seems that transduction gives a performance boost in all cases, suggesting that further research should pay close attention to its use of batch normalization during testing. Algorithm 1-shot 5-way 5-shot 5-way MAML + Transduction 48.70 ± 1.84% 63.11 ± 0.92% 1st-order MAML + Transduction 48.07 ± 1.75% 63.15 ± 0.91% Reptile 47.07 ± 0.26% 62.74 ± 0.37% Reptile + Transduction 49.97 ± 0.32% 65.99 ± 0.58% Table 1: Results on Mini-ImageNet. Both MAML and 1st-order MAML results are from [4]. Algorithm 1-shot 5-way 5-shot 5-way 1-shot 20-way 5-shot 20-way MAML + Transduction 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% 1st-order MAML + Transduction 98.3 ± 0.5% 99.2 ± 0.2% 89.4 ± 0.5% 97.9 ± 0.1% Reptile 95.39 ± 0.09% 98.90 ± 0.10% 88.14 ± 0.15% 96.65 ± 0.33% Reptile + Transduction 97.68 ± 0.04% 99.48 ± 0.06% 89.43 ± 0.14% 97.12 ± 0.32% between the test samples via batch normalization [9]. In our non-transductive experiments, batch normalization statistics were computed using all of the training samples and a single test sample. We note that Finn et al. [4] use transduction for evaluating MAML. For our experiments, we used the same CNN architectures and data preprocessing as Finn et al. [4]. We used the Adam optimizer [10] in the inner loop, and vanilla SGD in the outer loop, throughout our experiments. For Adam we set 1 = 0 because we found that momentum reduced performance across the board.1 During training, we never reset or interpolated Adam’s rolling moment data; instead, we let it update automatically at every inner-loop training step. However, we did backup and reset the Adam statistics when evaluating on the test set to avoid information leakage. The results on Omniglot and Mini-ImageNet are shown in Tables 1 and 2. While MAML, FOMAML, and Reptile have very similar performance on all of these tasks, Reptile does slightly better than the alternatives on Mini-ImageNet and slightly worse on Omniglot. It also seems that transduction gives a performance boost in all cases, suggesting that further research should pay close attention to its use of batch normalization during testing. Algorithm 1-shot 5-way 5-shot 5-way MAML + Transduction 48.70 ± 1.84% 63.11 ± 0.92% 1st-order MAML + Transduction 48.07 ± 1.75% 63.15 ± 0.91% Reptile 47.07 ± 0.26% 62.74 ± 0.37% Reptile + Transduction 49.97 ± 0.32% 65.99 ± 0.58% Table 1: Results on Mini-ImageNet. Both MAML and 1st-order MAML results are from [4]. Algorithm 1-shot 5-way 5-shot 5-way 1-shot 20-way 5-shot 20-way MAML + Transduction 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% 1st-order MAML + Transduction 98.3 ± 0.5% 99.2 ± 0.2% 89.4 ± 0.5% 97.9 ± 0.1% Reptile 95.39 ± 0.09% 98.90 ± 0.10% 88.14 ± 0.15% 96.65 ± 0.33% Reptile + Transduction 97.68 ± 0.04% 99.48 ± 0.06% 89.43 ± 0.14% 97.12 ± 0.32% Table 2: Results on Omniglot. MAML results are from [4]. 1st-order MAML results were generated by the code for [4] with the same hyper-parameters as MAML. 39 / 90

࣮ݧ • transductive learning ͸ڭࢣͳֶ͠शͷҰछ • ςετσʔλ (ϥϕϧͳ͠) ͕લ΋ͬͯ༩͑ΒΕֶ͍ͯͯशʹ࢖͑Δઃఆ •
Transduction ͋Γ : Test ࣌ʹ Test set Λશ෦࢖༻ͯ͠ϥϕϧΛ༧ଌ • Transduction ͳ͠ : Test set Λ 1 αϯϓϧͣͭ༧ଌ • MAML Ͱ͸ batch normalization ͷ౷ܭྔʹৗʹ batch ͷ஋Λ࢖༻ ⇒ શ Test set Λ batch ʹͯ͠༧ଌ͢Δͱશ Test set ͷ৘ใΛ༧ଌʹ࢖༻Մೳ ⇒ ਫ਼౓޲্ • MAML ࿦จͷ࣮ݧઃఆͰ͸҉ʹ Transduction learning Λ͍ͯ͠ΔΑ͏ͳઃ ఆʹͳͬͯ͠·͍ͬͯΔ • ࿦จ಺Ͱ͸ batch normalization ͷѻ͍ʹ஫ҙΛ෷ͬͨํ͕ྑ͍ͱड़΂ ͍ͯΔ 40 / 90

࣮ݧ (a) Final test performance vs. number of inner-loop iterations.
(b) Final test performance vs. inner-loop batch size. (c) Final outer-loop tail FOMA 100 (full b Figure 4: The results of hyper-parameter sweeps on 5-shot 5-way Omn 6.3 Overlap Between Inner-Loop Mini-Batches Both Reptile and FOMAML use stochastic optimization in their inner-loops this optimization procedure can lead to large changes in ﬁnal performance. T the sensitivity of Reptile and FOMAML to the inner loop hyperparameters, FOMAML’s performance signiﬁcantly drops if mini-batches are selected the w The experiments in this section look at the di↵erence between shared-tai • Inner-loop ͷΠςϨʔγϣϯճ਺΍ mini batch ਺Λม࣮͑ͨݧ • shared-tail : ֶशσʔλ͔Βద౰ʹαϯϓϧͯ͠ Inner-loop ͷςετ • separate-tail : Inner-loop Ͱ train ͱ test ͷσʔλ෼ׂΛߦ͏ • replacement / cycling : Inner-loop ࣌ͷ mini batch Λຖճ࡞Γ௚͔͢൱͔ • Reptile ͸ɾɾɾ • Inner-loop Ͱͷσʔλ෼ׂෆཁ • mini batch Λຖճ࡞Γ௚͢ඞཁ΋ͳ͍ 41 / 90

͜ͷ࿦จ͕ओு͢Δ MAML ͷ໰୊఺ • MAML ͷ໰୊఺ͱͯ͠ Outer-loop Ͱͷඍ෼ܭࢉ͕େม • ܭࢉྔ,
ϝϞϦڞʹ • ӄؔ਺ඍ෼ͱڞ໾ޯ഑๏Λ༻͍ͯܭࢉίετΛ࡟ݮ͢Δख๏ΛఏҊ (iMAML) • FOMAML ΍ Reptile ͸ 2 ֊ඍ෼Λܭࢉͤͣʹۙࣅ • iMAML ͸ 2 ֊ඍ෼Λۙࣅͯ͠ٻΊΔ͜ͱͰਫ਼౓Λ޲্ • ڧԽֶशͰͷ࣮ݧ͸ͳ͍ 42 / 90

iMAML • MAML ͷֶश͸ҎԼͷΑ͏ʹද͢͜ͱ͕Ͱ͖ͨ θ ← θ − η 1
M M ∑ i=1 ∇θLi(Algi(θ)) • Algi (θ) = Algi (θ, Dtr i ) = ϕi • Li (ϕi ) = Li (ϕi , Dtest i ) • ͜͜ͰҎԼͰఆٛ͞ΕΔ Inner-loop ͷ࠷దղ ϕ′ ͕ಘΒΕͨͱ͢Δ Alg⋆ i (θ) := arg min ϕ′∈Φ Gi(ϕ′, θ), where Gi(ϕ′, θ) = ˆ Li(ϕ′)+ λ 2 ∥ϕ′ −θ∥2 • ˆ Li (ϕ) := Li (ϕ, Dtr i ) 43 / 90

iMAML • ͜ͷ࣌ θ ͷߋ৽ࣜ͸ chain rule ΑΓ θ ←
θ − η 1 M M ∑ i=1 dAlg⋆ i (θ) dθ ∇ϕLi(Alg⋆ i (θ)) • dAlg⋆ i (θ) dθ ͷܭࢉ͕ॏ͍ͷͰۙࣅ͍ͨ͠ ⇒ ӄؔ਺ඍ෼Λ༻͍Δ 44 / 90

iMAML • inner-loop ͰಘΒΕͨ ϕi ͕࠷దղ ϕ′ Ͱ͋ͬͨͱ͢Δͱ ∇ϕ′ G(ϕ′,
θ)|ϕ′=ϕi = 0 =⇒ ∇ϕi ˆ Li(ϕi) + λ(ϕi − θ) = 0 =⇒ ϕi = θ − 1 λ ∇ϕi ˆ Li(ϕi) • ӄؔ਺ඍ෼Λߦ͏ͱ dϕi dθ = I − 1 λ ∇2 ˆ Li(ϕi) dϕi dθ =⇒ dAlg⋆ i (θ) dθ = ( I + 1 λ ∇2 ˆ Li(ϕi) ) −1 • Inner-loop ࠷ޙͷύϥϝʔλ ϕi ͚ͩ͋Ε͹ܭࢉՄೳ 45 / 90

iMAML • ( I + 1 λ ∇2 ˆ Li(ϕi)
) −1 ͷܭࢉʹ͸ 2 ͭͷ໰୊఺ 1. ࣮ࡍʹಘΒΕΔ ϕi ͸ۙࣅղ 2. େ͖͍ϞσϧͰ͸ٯߦྻͷܭࢉ͕ѻ͑ͳ͍ • 1 ʹؔͯ͠͸࿦จͷ Appendix ʹͯޡࠩʹ͍ͭͯͷٞ࿦͋Γ • 2 ͸ڞ໾ޯ഑๏Λ༻͍ͯղܾ • ରশਖ਼ఆ஋ߦྻΛ܎਺ͱ͢Δ࿈ཱҰ࣍ํఔࣜΛղͨ͘ΊͷΞϧΰϦζϜ 46 / 90

iMAML • θ ͷߋ৽ࣜʹग़ͯ͘ΔҎԼͷࣜΛ gi ͱ͓͘ ( I + 1
λ ∇2 ˆ Li(ϕi) ) −1 ∇ϕLi(Alg⋆ i (θ)) = gi • ٻΊ͍ͨ gi ͸ҎԼͷઢܕํఔࣜͷղʹͳΔ ( I + 1 λ ∇2 ˆ Li(ϕi) ) gi = ∇ϕLi(Alg⋆ i (θ)) • ͜Ε͸ҎԼͷ࠷খԽ໰୊Λڞ໾ޯ഑๏Ͱղ͚͹ٻΊΒΕΔ min w w⊤ ( I + 1 λ ∇2 ˆ Li(ϕi) ) w − w⊤∇ϕLi(Alg⋆ i (θ)) 47 / 90

iMAML • ࣮ݧతʹ͸ڞ໾ޯ഑๏͸ 5 ΠςϨʔγϣϯճͤ͹΄΅ऩଋ͢Δ • ڞ໾ޯ഑๏ͷ 1 εςοϓ͋ͨΓͷܭࢉίετ͸ inner-loop
Ͱͷ GD ͷ 1 εςοϓͱಉ͘͡Β͍ ⇒ MAML ΑΓ΋ܭࢉίετ͕௿͍ • ҎԼ iMAML ͷΞϧΰϦζϜ Algorithm 1 Implicit Model-Agnostic Meta-Learning (iMAML) 1: Require: Distribution over tasks P(T ), outer step size ⌘, regularization strength , 2: while not converged do 3: Sample mini-batch of tasks {Ti }B i=1 ⇠ P(T ) 4: for Each task Ti do 5: Compute task meta-gradient gi = Implicit-Meta-Gradient(Ti, ✓, ) 6: end for 7: Average above gradients to get ˆ rF(✓) = (1/B) PB i=1 gi 8: Update meta-parameters with gradient descent: ✓ ✓ ⌘ ˆ rF(✓) // (or Adam) 9: end while Algorithm 2 Implicit Meta-Gradient Computation 1: Input: Task Ti , meta-parameters ✓, regularization strength 2: Hyperparameters: Optimization accuracy thresholds and 0 3: Obtain task parameters i using iterative optimization solver such that: k i Alg? i (✓)k  4: Compute partial outer-level gradient vi = r LT ( i) 5: Use an iterative solver (e.g. CG) along with reverse mode differentiation (to compute Hessian vector products) to compute gi such that: kgi I + 1 r2 ˆ Li( i) 1 vi k  0 6: Return: gi 48 / 90

iMAML ͷར఺ • Inner-loop ͷճ਺Λ૿΍͢͜ͱ͕Մೳ • MAML Ͱ͸ܭࢉάϥϑอ࣋ͷͨΊʹϝϞϦ͕ඞཁͰ͋·Γ૿΍ͤͳ͍ • Inner-loop
ͷ࠷దԽʹ 2 ࣍ޯ഑ͷख๏͕࢖༻Մೳ • Hessian-free ΍ Newton-CG ͳͲ • MAML Ͱ࢖༻͢Δʹ͸ 3 ࣍ޯ഑Λܭࢉ͢Δඞཁ 49 / 90

࣮ݧ Finally, we study empirical performance of iMAML on the
Om Following the few-shot learning protocol in prior work [57], w (a) Figure 2: Accuracy, Computation, and Memory tradeoffs of iMA gradient accuracy level in synthetic example. Computed gradients are per Def 3. (b) Computation and memory trade-offs with 4 layer CN implemented iMAML in PyTorch, and for an apples-to-apples compa of MAML from: https://github.com/dragen1860/MAML-Pytor 7 • ύϥϝʔλͷઢܗͰදݱ͞ΕΔؔ਺ͷճؼ໰୊Ͱϝλޯ഑ͷਫ਼౓Λൺֱ • ਅͷ࠷దղ Alg⋆ i ͕ݟ͔ͭͬͨࡍͷϝλޯ഑ dθ Li (Alg⋆ i (θ)) ͱ inner-loop ͷ֤ GD ճ਺ͰಘΒΕΔ ϕi Ͱͷϝλޯ഑ͱͷࠩΛൺֱ • iMAML ͕ MAML ΑΓ΋ྑ͍ 50 / 90

࣮ݧ pirical performance of iMAML on the Omniglot and Mini-ImageNet
domains. hot learning protocol in prior work [57], we run the iMAML algorithm on the (b) Computation, and Memory tradeoffs of iMAML, MAML, and FOMAML. (a) Meta- in synthetic example. Computed gradients are compared against the exact meta-gradient tation and memory trade-offs with 4 layer CNN on 20-way-5-shot Omniglot task. We in PyTorch, and for an apples-to-apples comparison, we use a PyTorch implementation s://github.com/dragen1860/MAML-Pytorch 7 • Omniglot ͷ 20-way, 5-shot ͷઃఆͰ GPU ϝϞϦޮ཰ͱܭࢉ࣌ؒͷൺֱ • GPU ϝϞϦ͸ FOMAML ͱಉ͘͡ Inner-loop ͷ GD ճ਺ʹґଘ͠ͳ͍ • ܭࢉ࣌ؒ͸ MAML ΑΓ΋؇΍͔ • ڞ໾ޯ഑๏ͷܭࢉ͕͋ΔͷͰ FOMAML ʹ͸উͯͳ͍ 51 / 90

࣮ݧ Table 2: Omniglot results. MAML results are taken from
the original work of Finn et al. [15], and first-order MAML and Reptile results are from Nichol et al. [43]. iMAML with gradient descent (GD) uses 16 and 25 steps for 5-way and 20-way tasks respectively. iMAML with Hessian-free uses 5 CG steps to compute the search direction and performs line-search to pick step size. Both versions of iMAML use = 2.0 for regularization, and 5 CG steps to compute the task meta-gradient. Algorithm 5-way 1-shot 5-way 5-shot 20-way 1-shot 20-way 5-shot MAML [15] 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2% first-order MAML [15] 98.3 ± 0.5% 99.2 ± 0.2% 89.4 ± 0.5% 97.9 ± 0.1% Reptile [43] 97.68 ± 0.04% 99.48 ± 0.06% 89.43 ± 0.14% 97.12 ± 0.32% iMAML, GD (ours) 99.16 ± 0.35% 99.67 ± 0.12% 94.46 ± 0.42% 98.69 ± 0.1% iMAML, Hessian-Free (ours) 99.50 ± 0.26% 99.74 ± 0.11% 96.18 ± 0.36% 99.14 ± 0.1% dataset for different numbers of class labels and shots (in the N-way, K-shot setting), and compare two variants of iMAML with published results of the most closely related algorithms: MAML, FOMAML, and Reptile. While these methods are not state-of-the-art on this benchmark, they provide an apples-to-apples comparison for studying the use of implicit gradients in optimization-based meta-learning. For a fair comparison, we use the identical convolutional architecture as these prior works. Note however that architecture tuning can lead to better results for all algorithms [27]. The first variant of iMAML we consider involves solving the inner level problem (the regularized objective function in Eq. 4) using gradient descent. The meta-gradient is computed using conjugate gradient, and the meta-parameters are updated using Adam. This presents the most straightforward comparison with MAML, which would follow a similar procedure, but backpropagate through the path of optimization as opposed to invoking implicit differentiation. The second variant of iMAML uses a second order method for the inner level problem. In particular, we consider the Hessian-free or Newton-CG [44, 36] method. This method makes a local quadratic approximation to the objective function (in our case, G( 0 , ✓) and approximately computes the Newton search direction using CG. Since CG requires only Hessian-vector products, this way of approximating the Newton search direction is scalable to large deep neural networks. The step size can be computed using regularization, damping, trust-region, or linesearch. We use a linesearch on the training loss in our experiments to also illustrate how our method can handle non-differentiable inner optimization loops. We refer the readers to Nocedal & Wright [44] and Martens [36] for a more detailed exposition of this optimization algorithm. Similar approaches have also gained prominence in reinforcement learning [52, 47]. e first variant of iMAML we consider involves solving the inner level problem (the regularized ective function in Eq. 4) using gradient descent. The meta-gradient is computed using conjugate dient, and the meta-parameters are updated using Adam. This presents the most straightforward mparison with MAML, which would follow a similar procedure, but backpropagate through the h of optimization as opposed to invoking implicit differentiation. The second variant of iMAML s a second order method for the inner level problem. In particular, we consider the Hessian-free Newton-CG [44, 36] method. This method makes a local quadratic approximation to the objective ction (in our case, G( 0 , ✓) and approximately computes the Newton search direction using CG. ce CG requires only Hessian-vector products, this way of approximating the Newton search di- ion is scalable to large deep neural networks. The step size can be computed using regularization, mping, trust-region, or linesearch. We use a linesearch on the training loss in our experiments to o illustrate how our method can handle non-differentiable inner optimization loops. We refer the ders to Nocedal & Wright [44] and Martens [36] for a more detailed exposition of this optimiza- n algorithm. Similar approaches have also gained prominence in reinforcement learning [52, 47]. Table 3: Mini-ImageNet 5-way-1-shot accuracy Algorithm 5-way 1-shot MAML 48.70 ± 1.84 % first-order MAML 48.07 ± 1.75 % Reptile 49.97 ± 0.32 % iMAML GD (ours) 48.96 ± 1.84 % iMAML HF (ours) 49.30 ± 1.88 % les 2 and 3 present the results on Omniglot Mini-ImageNet, respectively. On the Om- lot domain, we find that the GD version of AML is competitive with the full MAML algo- m, and substatially better than its approxima- ns (i.e., first-order MAML and Reptile), espe- ly for the harder 20-way tasks. We also find that AML with Hessian-free optimization performs stantially better than the other methods, suggest- that powerful optimizers in the inner loop can of- benifits to meta-learning. In the Mini-ImageNet main, we find that iMAML performs better than MAML and FOMAML. We used = 0.5 and 10 dient steps in the inner loop. We did not perform an extensive hyperparameter sweep, and expect the results can improve with better hyperparameters. 5 CG steps were used to compute the a-gradient. The Hessian-free version also uses 5 CG steps for the search direction. Additional erimental details are Appendix F. Related Work r work considers the general meta-learning problem [51, 55, 41], including few-shot learning [30, . Meta-learning approaches can generally be categorized into metric-learning approaches that n an embedding space where non-parametric nearest neighbors works well [29, 57, 54, 45, 3], ck-box approaches that train a recurrent or recursive neural network to take datapoints as input • Omniglot(্ஈ), Mini-ImageNet(Լஈ) Ͱͷਫ਼౓ͷൺֱ • Inner ͷ࠷దԽʹ Hessian-free Λ࢖༻ͨ͠ iMAML ͕ڧ͍ • ಉۙ͡ࣅख๏Ͱ͋Δ FOMAML ΍ Reptile ͸೉͍͠λεΫ (20-way 1-shot) Ͱେ͖͘ਫ਼౓͕Լ͕Δ 52 / 90

͜ͷ࿦จ͕ओு͢Δϝλֶश (ओʹ MAML) ͷ໰୊఺ • few examples ͳλεΫ͸ᐆດੑ͕େ͖͍ • աֶश͢Δ৔߹΋͋Δ
⇒ ϩόετͳख๏͕ඞཁͱओு • ϕΠζख๏Λ༻͍Δ͜ͱͰ͜ΕΒͷ໰୊ʹରԠ (BMAML) • Stein Variational Gradient Descent (SVGD) • Chaser loss 53 / 90

Stein Variational Gradient Descent (SVGD) • BMAML ͷ Inner-loop Ͱ͸
SVGD Λ༻ֶ͍ͯश • ม෼ۙࣅʹൺ΂ SVGD ͸ਅͷࣄޙ෼෍ʹରͯ͠ύϥϝτϦοΫͳ֬཰෼ ෍΍Ҽࢠ෼ղΛԾఆ͢Δඞཁ͕ͳ͍ • SVGD Ͱ͸ particles ͱݺ͹ΕΔύϥϝʔλͷू߹ Θ = {θm}M m=1 ʹର ͯ͠ t εςοϓ࣌ͷ֤ύϥϝʔλ θt ∈ Θt ΛҎԼͷࣜͰߋ৽ θt+1 ← θt + ϵtϕ (θt) where ϕ (θt) = 1 M M ∑ j=1 [ k ( θj t , θt ) ∇ θj t log p ( θj t ) + ∇ θj t k ( θj t , θt )] • ϵt ͸εςοϓαΠζ • k(x, x′) ͸ਖ਼ఆ஋Χʔωϧ 54 / 90

BMAML • ΑͬͯҎԼͷࣜͰ Θ0 ΛֶशՄೳ Θ0 ← Θ0 − β∇Θ0
∑ τ∈T log [ 1 M M ∑ m=1 p(Dtest τ |ϕm τ ) ] • ͔͜͠͠ͷֶश๏͸ෆ҆ఆ + աֶश͠΍͍͢ • Inner ͸ϕΠζख๏Ͱ΋ Outer ͕ϕΠζख๏Ͱͳ͍ͷ͸ɾɾɾ • Outer ʹ΋ᐆດੑΛอ࣋Ͱ͖Δख๏Λ࠾༻ 56 / 90

BMAML • Inner-loop ͰಘΒΕΔۙࣅͷλεΫࣄޙ෼෍ : pn τ ≡ pn(ϕτ |Dtr
τ ; Θ0) • n ͸ Inner-loop ͷεςοϓ਺ • ਅͷλεΫࣄޙ෼෍ : p∞ τ ≡ pn(ϕτ |Dtr τ ∪ Dtest τ ) • pn τ ͕ p∞ τ ʹۙ͘ͳΔ Θ0 ͕ཉ͍͠ ⇒ ҎԼͷ໰୊Λղ͘ arg min Θ0 ∑ τ dp(pn τ ∥p∞ τ ) ≈ arg min Θ0 ∑ τ ds(Θn τ (Θ0)∥Θ∞ τ ) • Θn τ , Θ∞ τ ͸ͦΕͧΕ pn τ , p∞ τ ͔ΒαϯϓϦϯά͞Εͨύϥϝʔλ • (̎ͭͱ΋ Θ ͷه߸Λ࿦จʹ߹Θͤ࢖ͬͯ͸͍Δ͕ Inner-loop ͷֶशͰύ ϥϝʔλͳͷͰத਎͸ ϕm) • dp (p∥q) ͸ 2 ͭͷ෼෍ؒͷ dissimilarity • ds (s1 ∥s2 ) ͸ 2 ͭͷू߹ؒͷڑ཭ 57 / 90

BMAML • ໰୊఺ͱͯ͠ p∞ τ ΍ Θ∞ τ ͸෼͔Βͳ͍ •
ͦ͜Ͱ Θ∞ τ Λ Θn+s τ Ͱ୅༻ • Θn+s τ ͸ҎԼͰಘΒΕΔ 1. ॳظ஋ Θ0 Λ Dtr τ Λ༻͍ͯ n εςοϓ SVGD Ͱֶश͠ Θn τ ΛಘΔ 2. ֶशσʔλʹ Dtest τ ΛՃ͑ Θn τ Λ s εςοϓ SVGD Ͱֶश 58 / 90

BMAML • Ҏ্ΑΓ Outer ͷ loss ͸ҎԼͷࣜ LBMAML(Θ0) = ∑
τ∈Tt ds(Θn τ ∥Θn+s τ ) = ∑ τ∈Tt M ∑ m=1 ∥θn,m τ − θn+s,m τ ∥2 2 • ࿦จͰ͸ n = s = 1 ͷΑ͏ͳখ͍͞஋Ͱྑ͍ಇ͖Λͨ͠ͱॻ͔Ε͍ͯΔ • ͜ͷख๏Λେ͖͍ϞσϧͰ࢖༻͢Δʹ͸อ࣋͢Δύϥϝʔλ਺͕େ͖͘ ͳͬͯ͠·͏໰୊ ⇒ େ͖͍ϞσϧΛ࢖༻͢Δࡍ͸ feature extractor ෦෼ͷύϥϝʔλΛશ particles Ͱڞ༗͠ classiﬁer ͸ M ݸͷ particles ͱ͢Δ͜ͱͰରԠ 59 / 90

࣮ݧ (a) (b) (c) Figure 2: Experimental results in miniImagenet
dataset: (a) few-shot image classiﬁcation using differen of particles, (b) using different number of tasks for meta-training, and (c) active learning setting. particles is slightly lower than having 5 particles2. Because a similar instability is also o in the SVPG paper (Liu & Wang, 2016), we presume that one possible reason is the instab SVGD such as sensitivity to kernel function parameters. To increase the inherent uncertainty in Fig. 2 (b), we reduced the number of training tasks |T | from 800K to 10K. We see that B provides robust predictions even for such a small number of training tasks while EMAML easily. Active Learning: In addition to the ensembled prediction accuracy, we can also evaluate fectiveness of the measured uncertainty by applying it to active learning. To demonstrate, • Mini-ImageNet Ͱ࣮ݧ • (a) : Particle ਺Λม͑ͯਫ਼౓ൺֱ • EMAML ͸ಠཱͳෳ਺ͷ MAML ͷΞϯαϯϒϧख๏ • (b) : meta-training ࣌ͷλεΫ਺ͱΠςϨʔγϣϯͷؔ܎ 60 / 90

͜ͷ࿦จ͕ओு͢Δϝλֶश (ओʹ MAML) ͷ໰୊఺ • few-shot learning ͷΑ͏ͳ໰୊ઃఆ͸ᐆດੑ͕େ͖͍ • ՄೳͳݶΓ࠷ߴͷॳظ஋͕
meta train ͰಘΒΕ͍ͯͨͱͯ͠΋৽ͨͳλ εΫΛղͨ͘Ίͷे෼ͳ৘ใ͕͋Δͱ͸ݶΒͳ͍ ⇒ ෳ਺ͷղΛఏҊͰ͖Δख๏͕๬·͍͠ ⇒ εέʔϥϏϦςΟͱෆ࣮֬ੑΛߟྀͨ͠ख๏ͷఏҊ (PLATIPUS) • amortized variational inference • ڧԽֶशͰͷ࣮ݧ͸ͳ͍ 61 / 90

MAML ෮श • MAML ͸֊૚ϕΠζͷ࿮૊ΈͰߟ͑Δ͜ͱ͕Ͱ͖ͨ p(ytest i |xtr i ,
ytr i , xtest i ) = ∫ p(ytest i |xtest i , ϕi)p(ϕi|xtr i , ytr i , θ)dϕi ≈ p(ytest i |xtest i , ϕ⋆ i ) • ϕ⋆ i ͸ MAP ਪఆ஋ • MAML Ͱ͸ Inner Ͱ GD Λߦ͍ਪఆ 62 / 90

PLATIPUS • MAML Ͱ͸ॳظ஋ θ ͸ܾఆతͰ͋ͬͨ • ఏҊख๏Ͱ͸ॳظ஋ͷ෼෍ p(θ) Λߟ͑
VAE ͳͲͰ࢖༻͞Ε͍ͯΔ amortized variational inference Λ༻ֶ͍ͯश • VAE ͸େ·͔ʹҎԼͷྲྀΕͰ͋ͬͨ • જࡏม਺ z Λೖྗ x ͔ΒωοτϫʔΫ qψ Λ༻͍ͯ Encode • αϯϓϧ͞Εͨ z Λ༻͍ͯ x Λ Decode • ෮ݩޡࠩΛখͭͭۙ͘͞͠ࣅ෼෍ qψ (z|x) ͱ z ͷࣄલ෼෍ p(z) Λ͚ۙͮ ֶͯश • ͜ͷྲྀΕΛ MAML ʹద༻͢Δ͜ͱΛߟ͑Δ 63 / 90

PLATIPUS • ॳظ஋ θ Λજࡏม਺ͱߟ͑ࣄલ෼෍ p(θ) Λઃఆ • p(θ) ͸ฏۉ
µθ , ର֯ͷڞ෼ࢄߦྻ σ2 θ ͷਖ਼ن෼෍ͱ͢Δ • µθ ͱ σ2 θ ͸ֶशՄೳύϥϝʔλ • VAE ͱҟͳΓࣄલ෼෍ͷύϥϝʔλ΋ֶश͞ΕΔ • MAML Ͱ͸λεΫݻ༗ύϥϝʔλ ϕi ͸ MAP ਪఆͰܾΊΒΕΔ • ͜͜Ͱਅͷ MAP ਪఆ஋ ϕ⋆ i ͕ಘΒΕΔͱ͢Δͱ θ ͱ xtr i , ytr i ͸ಠཱ ⇒ ۙࣅ෼෍͸ xtest i , ytest i ͕༩͑ΒΕͨݩͰͷ෼෍ qψ(θ|xtest i , ytest i ) Λߟ ͑Δ 64 / 90

PLATIPUS • p(θ) ͷۙࣅ෼෍ qψ(θ|xtest i , ytest i )
ΛҎԼͷΑ͏ʹఆٛ qψ(θ|xtest i , ytest i ) = N(µθ + γq∇ log p(ytest i |xtest i , µθ); vq) • qψ ͸ύϥϝʔλ ψ Λ࣋ͭωοτϫʔΫ • vq ͸ର֯ͷڞ෼ࢄߦྻͰ͋ΓֶशՄೳύϥϝʔλ • γq ͸ learning rate 65 / 90

PLATIPUS • ݱ࣮ʹ͸ਅͷ MAP ਪఆ஋͸ಘΒΕͳ͍ͨΊ θ ͱ xtr i ,
ytr i ͸ಠཱͰ͸ ͳ͍ • ͜ΕΒͷґଘؔ܎Λߟྀ͢ΔͨΊʹ ”ࣄલ෼෍” ΛҎԼͷΑ͏ʹमਖ਼ pi(θi|xtr i , ytr i ) = N(µθ + γp∇ log p(ytr i |xtr i , µθ); σ2 θ ) • γp ͸ learning rate • ࣮ݧతʹ΋͜ͷΑ͏ͳิਖ਼Λߦͳͬͨํ͕ྑ͍݁Ռ͕ಘΒΕͨΒ͍͠ 66 / 90

PLATIPUS • ֶश͸ҎԼͷۙࣅ໬౓ͷม෼ԼݶΛ࠷େԽ͢Δ • ෼ྨਫ਼౓Λେ͖ͭͭۙ͘͠ࣅ෼෍ͱࣄલ෼෍Λ͚ۙͮΔ log p(ytest i |xtest i
, xtr i , ytr i ) ≥ Eθ∼qψ [ log p(ytest i |xtest i , ϕ⋆ i ) ] + DKL(qψ(θ|xtest i , ytest i )∥p(θi|xtr i , ytr i )) • ҎԼ, ఏҊख๏ͷΞϧΰϦζϜ Algorithm 1 Meta-training, differences from MAML in red Require: p(T ): distribution over tasks 1: initialize ⇥ := {µ✓ , 2 ✓ , vq, p, q } 2: while not done do 3: Sample batch of tasks Ti ⇠ p(T ) 4: for all Ti do 5: Dtr, Dtest = Ti 6: Evaluate rµ✓ L(µ✓ , Dtest) 7: Sample ✓ ⇠ q = N(µ✓ q rµ✓ L(µ✓ , Dtest), vq) 8: Evaluate r✓ L(✓, Dtr) 9: Compute adapted parameters with gradient descent: i = ✓ ↵r✓ L(✓, Dtr) 10: Let p(✓|Dtr) = N(µ✓ p rµ✓ L(µ✓ , Dtr), 2 ✓ )) 11: Compute r⇥ P Ti L( i, Dtest) +DKL(q(✓|Dtest) || p(✓|Dtr)) 12: Update ⇥ using Adam Algorithm 2 Meta-testing Require: training data Dtr T for new task T Require: learned ⇥ 1: Sample ✓ from the prior p(✓|Dtr) 2: Evaluate r✓ L(✓, Dtr) 3: Compute adapted parameters with gradient descent: i = ✓ ↵r✓ L(✓, Dtr) 67 / 90

෼ྨ࣮ݧ • ᐆດੑͷར఺Λࣔͨ͢Ίʹਫ਼౓ʹՃ͑ Coverage ͱݺ͹ΕΔ਺Λܭࢉ • ࢖༻σʔληοτ͸ celebA • ਖ਼ྫͱෛྫΛ෼ྨ͢Δ͕
meta-test ࣌ͷֶशσʔλͷਖ਼ྫʹ͸ 3 ͭͷਖ਼ղ ཁૉؚ͕·Ε͍ͯΔ • ྫ) 1 : ๧ࢠΛඃ͍ͬͯΔ 2: ޱΛ։͚͍ͯΔ 3: ए͍ 3 ͭΛશͯຬ͍ͨͯ͠ Δը૾͕ਖ਼ྫ, ͲΕ΋ຬ͍ͨͯ͠ͳ͍ͷ͕ෛྫ • meta-test ࣌ͷςετσʔλ͸্ͷ 3 ͭͷ͏ͪ 2 ͭΛຬ͍ͨͯ͠Δը૾Λ ਖ਼ྫͱ͢Δ (ᐆດੑͷ͋ΔλεΫ) • 3 छྨͷ෼ྨςετ͕ଘࡏ (1 ͱ 2, 1 ͱ 3, 2 ͱ 3) • ֶशͨ͠ࣄલ෼෍͔Βॳظ஋ΛαϯϓϦϯάͯ͠ 3 छྨͷ෼ྨςετͷର ਺໬౓Λܭࢉ͠࠷େͱͳͬͨςετʹͦͷαϯϓϧΛׂΓ౰ͯΔ • ͭ·ΓͲͷςετʹ༗༻ͳαϯϓϦϯά͔Λܭࢉ • ͦΕΛԿճ͔܁Γฦ͠ 1 ͭҎ্ͷαϯϓϧׂ͕Γ౰ͯΒΕͨςετͷ਺ͷ ฏۉΛܭࢉ (͜Ε͕ Coverage) • ࠷খ͕ 1, ࠷େ͕ 3 68 / 90

෼ྨ࣮ݧ Mouth Open Young Wearing Hat Mouth Open Young Wearing
Hat Mouth Open Young Wearing Hat Mouth Open Young Wearing Hat Figure 6: Sampled classifiers for an ambiguous meta-test task. In the meta-test training set (a), PLATIPUS observes five positives that share three attributes, and five negatives. A classifier that uses any two attributes can correctly classify the training set. On the right (b), we show the three possible two-attribute tasks that the training set can correspond to, and illustrate the labels (positive indicated by purple border) predicted by the best sampled classifier for that task. We see that different samples can effectively capture the three possible explanations, with some samples paying attention to hats (2nd and 3rd column) and others not (1st column). Ambiguous celebA (5-shot) Accuracy Coverage (max=3) Average NLL MAML 89.00 ± 1.78% 1.00 ± 0.0 0.73 ± 0.06 MAML + noise 84.3 ± 1.60 % 1.89 ± 0.04 0.68 ± 0.05 PLATIPUS (ours) (KL weight = 0.05) 88.34 ± 1.06 % 1.59 ± 0.03 0.67± 0.05 PLATIPUS (ours) (KL weight = 0.15) 87.8 ± 1.03 % 1.94 ± 0.04 0.56 ± 0.04 Table 1: Our method covers almost twice as many modes compared to MAML, with comparable accuracy. MAML + noise is a method that adds noise to the gradient, but does not perform variational inference. This improves coverage, but results in lower accuracy average log likelihood. We bold results above the highest confidence interval lowerbound. gradient descent with injected noise. During meta-training, the model parameters are optimized with respect to a variational lower bound on the likelihood for the meta-training tasks, so as to enable this simple adaptation procedure to produce approximate samples from the model posterior when conditioned on a few-shot training set. This approach has a number of benefits. The adaptation procedure is exceedingly simple, and the method can be applied to any standard model architecture. The algorithm introduces a modest number of additional parameters: besides the initial model weights, we must learn a variance on each parameter for the inference network and prior, and the number of parameters scales only linearly with the number of model weights. Our experimental results show that our method can be used to effectively sample diverse solutions to both regression and classification tasks at meta-test time, including with task families that have multi-modal task distributions. We additionally showed how our approach can be applied in settings where uncertainty can directly guide data acquisition, leading to better few-shot active learning. Although our approach is simple and broadly applicable, it has potential limitations that could be • ਫ਼౓Ͱ͸ MAML ʹෛ͚͍ͯΔ͕ Coverage ͸ 2 ʹ͍ۙ஋Λऔ͍ͬͯͯᐆ ດੑΛߟྀͰ͖͍ͯΔ • Coverage ͕ 2 ʹ͍ۙ ⇒ ֶश࣌ͱগ͠ҟͳΔςετ͕དྷͯ΋༗༻ͳॳظ ஋ͷαϯϓϦϯά͕Ͱ͖ΔՄೳੑ͕͋Δ • ͨͩਫ਼౓Ͱ͸ෛ͚͍ͯΔͷͰ୯७ͳը૾෼ྨʹ͓͍ͯར఺͕͋Δ͔͸ෆ໌ • MAML ͸ܾఆతʹॳظ஋͕ܾ·ΔͷͰ Coverage ͕ৗʹ 1 69 / 90

ճؼ࣮ݧ Figure 2: Samples from PLATIPUS trained for 5-shot regression,
shown as colored dotted lines. The tasks consist of regressing to sinusoid and linear functions, shown in gray. MAML, shown in black, is a deterministic procedure and hence learns a single function, rather than reasoning about the distribution over potential functions. As seen on the bottom row, even though PLATIPUS is trained for 5-shot regression, it can effectively reason over its uncertainty when provided variable numbers of datapoints at test time (left vs. right). • ఏҊख๏Ͱ͸ෳ਺ͷαϯϓϧ͕͞Ε͍ͯΔ͜ͱ͕Θ͔Δ • MAML ͸ܾఆతʹ 1 ຊʹܾ·Δ • ࿦จͰ͸ Active learning ͷ࣮ݧ΋͞Ε͍ͯΔ 70 / 90

͜ͷ࿦จ͕ओு͢Δ MAML ͷ໰୊఺ • MAML Ͱ͸།Ұͷॳظ஋Λݟ͚ͭΔ͜ͱ͕໨ඪ • ͨͩͦΕͰ͸֤λεΫͷ໨ඪύϥϝʔλ͕͓ޓ͍ʹ͍ۙඞཁ͕͋Δ • λεΫ෼෍͕όϥόϥͰԕ͘཭Ε͍ͯΔ৔߹͸ෳ਺ͷॳظ஋͕͋ͬͨํ
͕ྑ͍ ⇒ λεΫຒΊࠐΈΛ࢖༻ͯ͠ෳ਺ͷσʔληοτ͔ΒͷλεΫʹରͯ͠ਝ ଎ʹదԠͰ͖Δख๏ͷఏҊ (MMAML) 71 / 90

MMAML Figure 1: Model overview. The modulation network produces
a task embedding , which is used to generate parameters {⌧i } that modulates the task network. The task network adapts modulated parameters to fit to the target task. Algorithm 1 MMAML META-TRAINING PROCEDURE. 1: Input: Task distribution P(T ), Hyper-parameters ↵ and 2: Randomly initialize ✓ and !. 3: while not DONE do 4: Sample batches of tasks Tj ⇠ P(T ) 5: for all j do 6: Infer = h({x, y}K; !h) with K samples from Dtrain Tj . 7: Generate parameters ⌧ = {gi( ; !g) | i = 1, · · · , N} to modulate each block of the task network f. 8: Evaluate r✓ LTj (f(x; ✓, ⌧); Dtrain Tj ) w.r.t the K samples 9: Compute adapted parameter with gradient descent: ✓0 Tj = ✓ ↵r✓ LTj f(x; ✓, ⌧); Dtrain Tj 10: end for 11: Update ✓ with r✓ P Tj ⇠P (T ) LTj f(x; ✓0, ⌧); Dval Tj 12: Update !g with r!g P Tj ⇠P (T ) LTj f(x; ✓0, ⌧); Dval Tj 13: Update !h with r!h P Tj ⇠P (T ) LTj f(x; ✓0, ⌧); Dval Tj 14: end while not the task-specific parameters from modulation network) is further adapted to target task through gradient-based optimization. A conceptual illustration can be found in Figure 1. In the rest of this section, we introduce our modulation network and a variety of modulation operators in section 4.1. Then we describe our task network and the training details for MMAML in section 4.2. 4.1 Modulation Network As mentioned above, modulation network is responsible for identifying the mode of a sampled task, and generate a set of parameters specific to the task. To achieve this, it first takes the given K data points and their labels {xk, yk }k=1,...,K as input to the task encoder f and produces an embedding vector that encodes the characteristics of a task: = h ⇣ {(xk, yk) | k = 1, · · · , K}; !h ⌘ (1) • MMAML ͷྲྀΕ͸ 1. Modulation Network ͰλεΫͷ embedding ϕΫτϧ v Λܭࢉ 2. v Λ༻͍ͯ Task Network ͷύϥϝʔλΛม׵͠λεΫʹ͋ͬͨॳظ஋Λ ੜ੒ 72 / 90

Modulation Network • λεΫͱͯ͠ K ݸͷσʔλͱϥϕϧ {xk, yk}k=1,...,K ͕༩͑ΒΕͨ࣌ ʹύϥϝʔλ
wh Λ࣋ͭωοτϫʔΫͰ embedding ϕΫτϧ v Λܭࢉ v = h ({(xk, yk)|k = 1, · · · , K} ; wh) • ͜ͷ v Λ༻͍ͯ Task Network ͷ֤ϒϩοΫ (֤৞ΈࠐΈ૚΍શ݁߹૚) ʹม׵ΛՃ͑Δ τi Λੜ੒ τi = gi(v; wgi ), where i = 1, · · · N • N ͸ Task Network ͷ૯ϒϩοΫ਺ 73 / 90

Modulation Network • τi ʹΑͬͯॳظ஋ͷ i ൪໨ͷϒϩοΫͷύϥϝʔλ θi ͸ҎԼͷΑ͏ʹ ม׵͞ΕΔ
ϕi = θi ⊙ τi • ⊙ ʹ͸ Attention ϕʔεͷख๏ͱ feature-wise linear modulation (FiLM) ͷख๏͕͋Δ͕࿦จͰ͸ޙऀΛબ୒ • FiLM ͷํ͕ྑ͍ਫ਼౓ͩͬͨ໛༷ • ੜ੒͞Εͨ τi ͸ Inner-loop தͰ͸ ﬁxed 74 / 90

FiLM • FiLM Ͱ͸ϕΫτϧ τ Λ τγ ͱ τβ ʹ෼ׂ͠ωοτϫʔΫͷϨΠϠ
Fθ ʹ ରͯ͠ҎԼͷม׵Λࢪ͢ Fϕ = Fθ ⊗ τγ + τβ • ͨͩ͠ ⊗ ͸ channel-wise multiplication • Attention ϕʔεΛҰൠԽͨ͠Α͏ͳม׵ʹͳ͍ͬͯΔ 75 / 90

࣮ݧ : ϕʔεϥΠϯ • MAML • Multi-MAML • MAML ΛυϝΠϯͷ਺༻ҙ֤͠Ϟσϧ͸֤υϝΠϯͷλεΫͷΈͰֶश
• ςετ࣌͸ͦͷλεΫͷυϝΠϯʹରԠ͢Δ MAML Λ࢖༻ • Αͬͯ MAML ͷ্քͰ͋Γݱ࣮తʹ͸࢖༻Ͱ͖ͳ͍Ϟσϧ • MAML ͕ Multi-MAML ΑΓ͔ͳΓѱ͍৔߹͸ MAML ͕λεΫ෼෍͕ό ϥόϥͷঢ়ଶʹରԠͰ͖͍ͯͳ͍ͱ͍͏͜ͱ • LSTM Learner (ճؼͷΈ) • LSTM Λ༻͍ͯճؼ • ճؼͰ͸ Modulation Network ʹ LSTM Λ࢖༻͢ΔͨΊൺֱͱͯ͠ LSTM ͷΈͷ৔߹Λ༻ҙ 76 / 90

݁Ռ : ճؼ Table 1: Mean square error (MSE) on
the multimodal 5-shot regression with 2, 3, and 5 modes. A Gaussian noise with µ = 0 and = 0.3 is applied. Multi-MAML uses ground-truth task modes to select the corresponding MAML model. Our method (with FiLM modulation) outperforms other methods by a margin. Method 2 Modes 3 Modes 5 Modes Post Modulation Post Adaptation Post Modulation Post Adaptation Post Modulation Post Adaptation MAML [8] - 1.085 - 1.231 - 1.668 Multi-MAML - 0.433 - 0.713 - 1.082 LSTM Learner 0.362 - 0.548 - 0.898 - Ours: MMAML (Softmax) 1.548 0.361 2.213 0.444 2.421 0.939 Ours: MMAML (FiLM) 2.421 0.336 1.923 0.444 2.166 0.868 Table 2: Classification testing accuracies on the multimodal few-shot image classification with 2, 3, and 5 modes. Multi-MAML uses ground-truth dataset labels to select corresponding MAML models. Our method outperforms MAML and achieve comparable results with Multi-MAML in all the scenarios. Method & Setup 2 Modes 3 Modes 5 Modes Way 5-way 20-way 5-way 20-way 5-way 20-way Shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot MAML [8] 66.80% 77.79% 44.69% 54.55% 67.97% 28.22% 44.09% 54.41% 28.85% Multi-MAML 66.85% 73.07% 53.15% 55.90% 62.20% 39.77% 45.46% 55.92% 33.78% MMAML (ours) 69.93% 78.73% 47.80% 57.47% 70.15% 36.27% 49.06% 60.83% 33.97% output value y, which further increases the difficulty of identifying which function generated the data. Please refer to the supplementary materials for details and parameters for regression experiments. Baselines and Our Approach. As mentioned before, we have MAML and Multi-MAML as two baseline methods, both with MLP task networks. Our method (MMAML) augments the task network • 5 छྨͷυϝΠϯ͔ΒճؼλεΫΛੜ੒ • sin, Ұ࣍ؔ਺, ೋ࣍ؔ਺, Ұ࣍ؔ਺ͷઈର஋, tanh • 2 छྨ, 3 छྨ, 5 छྨΛ࢖༻ͨ͠৔߹ͰͦΕͧΕϕʔεϥΠϯͱൺֱ • Modulation Network ʹ͸ LSTM Λ࢖༻ • σʔλΛ x Ͱιʔτͯ͠ॱʹೖྗ • FiLM Λ༻͍ͨ MMAML ͕΋ͬͱ΋ྑ͍݁Ռ 77 / 90

݁Ռ : ը૾෼ྨ Ours: MMAML (FiLM) 2.421 0.336 1.923 0.444
2.166 0.868 Table 2: Classification testing accuracies on the multimodal few-shot image classification with 2, 3, and 5 modes. Multi-MAML uses ground-truth dataset labels to select corresponding MAML models. Our method outperforms MAML and achieve comparable results with Multi-MAML in all the scenarios. Method & Setup 2 Modes 3 Modes 5 Modes Way 5-way 20-way 5-way 20-way 5-way 20-way Shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot MAML [8] 66.80% 77.79% 44.69% 54.55% 67.97% 28.22% 44.09% 54.41% 28.85% Multi-MAML 66.85% 73.07% 53.15% 55.90% 62.20% 39.77% 45.46% 55.92% 33.78% MMAML (ours) 69.93% 78.73% 47.80% 57.47% 70.15% 36.27% 49.06% 60.83% 33.97% output value y, which further increases the difficulty of identifying which function generated the data. Please refer to the supplementary materials for details and parameters for regression experiments. Baselines and Our Approach. As mentioned before, we have MAML and Multi-MAML as two baseline methods, both with MLP task networks. Our method (MMAML) augments the task network with a modulation network. We choose to use an LSTM to serve as the modulation network due to its nature as good at handling sequential inputs and generate predictive outputs. Data points (sorted by x value) are first input to this network to generate task-specific parameters that modulate the task network. The modulated task network is then further adapted using gradient-based optimization. Two variants of modulation operators – softmax and FiLM are explored to be used in our approach. Additionally, to study the effectiveness of the LSTM model, we evaluate another baseline (referred to as the LSTM Learner) that uses the LSTM as the modulation network (with FiLM) but does not perform gradient-based updates. Please refer to the supplementary materials for concrete specification of each model. Results. The quantitative results are shown in Table 1. We observe that MAML has the highest error in all settings and that incorporating task identity (Multi-MAML) can improve over MAML significantly. This suggests that MAML degenerates under multimodal task distributions. The LSTM • N-way K-shot ͷλεΫ • Omniglot, Mini-ImageNet, FC100, CUB, AIRCRAFT ͷ 5 ͭͷσʔλ ηοτ࢖༻ • 2 छྨ, 3 छྨ, 5 छྨΛ࢖༻ͨ͠৔߹ͰͦΕͧΕϕʔεϥΠϯͱൺֱ • Modulation Network ʹ͸ CNN, τ ͷੜ੒ʹ͸ MLP • 5-way ͷ 1-shot, 5-shot Ͱ͸ڧ͍͕ 20-way Ͱ͸ۤઓ • ೉͍͠λεΫʹ͸ରԠ͖͠Ε͍ͯͳ͍ʁ 78 / 90

݁Ռ : λεΫຒΊࠐΈͷՄࢹԽ (a) Regression (b) Image classification (c) RL
Reacher (d) Figure 3: tSNE plots of the task embeddings produced by our model from randomly sa color indicates different modes of a task distribution. The plots (b) and (d) reveal a clear to different task modes, which demonstrates that MMAML is able to identify the task and produce a meaningful embedding . (a) Regression: the distance between modes alig of the similarity of functions (e.g. a quadratic function can sometimes be similar to a s function while a sinusoidal function is usually different from a linear function) (b) Few-shot each dataset (i.e. mode) forms its own cluster. (c-d) Reinforcement learning: The number different modes of the task distribution. The tasks from different modes are clearly clus embedding space. meta-dataset following the train/test splits used in the prior work, similar to [53] the datasets can be found in the supplementary material. We train models on the meta-datasets with two modes (OMNIGLOT and MINI-I modes (OMNIGLOT, MINI-IMAGENET, and FC100), and five modes (all the five d • ճؼ, ը૾෼ྨͷͦΕͧΕͰλεΫຒΊࠐΈͨ݁͠ՌΛ tSNE ͰՄࢹԽ • ճؼ • ࣅͨυϝΠϯ (ೋ࣍ؔ਺ͱҰ࣍ؔ਺ͷઈର஋) ͸͓ޓ͍ۙ͘ʹ͋Δ • ଞͱ͸ҟͳΔυϝΠϯ (sin ΍ tanh) ͸ͦΕ୯ମͰΫϥελʔΛ࡞ͬͯ ͍Δ • ෼ྨͰ͸֤υϝΠϯͰ͸͖ͬΓͱ෼͔Ε͍ͯΔ͜ͱ͕֬ೝͰ͖Δ 79 / 90

MAML ͷ໰୊఺ • ͜ͷ࿦จͰ͸ MAML ͷ 6 ͭ໰୊఺Λఏࣔͦ͠ΕΒΛ 1 ͭ
1 ͭղܾͨ͠ ख๏ΛఏҊ (MAML++) • Training Instability • Second Order Derivative Cost • Absence of Batch Normalization Statistic Accumulation • Shared (across step) Batch Normalization Bias • Shared Inner Loop (across step and across parameter) Learning Rate • Fixed Outer Loop Learning Rate • Ҏ߱͜ΕΒͷ໰୊఺ͱͦͷղܾ๏Λड़΂Δ 80 / 90

Training Instability • MAML ͷֶश͸ Inner-loop Ͱෳ਺ճ࠷దԽͨ͠ޙʹ Outer-loop Ͱޯ഑ Λܭࢉ
• MAML ͷ৞ΈࠐΈ૚͸ skip-connection ͕࢖ΘΕ͍ͯͳ͍ ⇒ ޯ഑ͷੵ͕ଟ͍ߏ଄Ͱ͋ΓϞσϧ΍ϋΠύʔύϥϝʔλʹΑͬͯ͸ޯ഑ ͷരൃ΍ফࣦ͕ى͜Γ΍ֶ͘͢श͕ෆ҆ఆ 81 / 90

Training Instability ˠ Multi-Step Loss Optimization (MSL) • ղܾ๏ͱͯ͠ Multi-Step
Loss Optimization (MSL) ΛఏҊ • MAML Ͱ͸ S ճͷ Inner-loop Λճͨ͠ޙʹಘΒΕΔ ϕS Λ༻͍ͯॳظ ஋ θ ͷߋ৽Λ͍ͯͨ͠ • B ͸λεΫͷ਺ θ = θ − β∇θ B ∑ b Lb(ϕS b , Dtest b ) • ͦΕΛ֤ Inner-loop ͰಘΒΕΔ ϕs Ͱܭࢉͨ͠ loss ͷॏΈ෇͖࿨Ͱ θ Λ ߋ৽͢ΔΑ͏ʹมߋ θ = θ − β∇θ B ∑ b S ∑ s vsLb(ϕs b , Dtest b ) 82 / 90

Training Instability ˠ Multi-Step Loss Optimization (MSL) • vs ͸ॏΈͰ͋ΓΞχʔϦϯάͰௐઅ
• ۩ମతʹ͸ • ͸͡Ί͸શͯಉ͡ॏ͞ • ঃʑʹ Inner-loop ͷޙ൒ʹॏ͖Λஔ͘ • ࠷ޙ͸ΦϦδφϧͱಉ༷ʹ S εςοϓ໨͚ͩΛ࢖༻͢Δ • ͜ΕʹΑΓֶश͕҆ఆ (Լਤࢀর) Published as a conference paper at ICLR 2019 Figure 1: Stabilizing MAML: This ﬁgure illustrates 3 seeds of the original strided MAML vs strided MAML++. One can see that 2 out of 3 seeds with the original strided MAML seem to become 83 / 90

Second Order Derivative Cost ˠ Derivative-Order Annealing (DA) • MAML
ͷେ͖ͳ໰୊఺ͱͯ͠ϔγΞϯͷܭࢉ͕ඞཁ • FOMAML ͳͲͷҰ࣍ۙࣅख๏͸ੑೳ͕ѱԽ • ͦ͜Ͱ͸͡Ίͷ 50epoch ͸ FOMAML Λ࢖༻͠Ҏ߱͸ MAML ʹ • ͦ͏͢Δ͜ͱͰߴ଎ԽΛਤΓͭͭ͸͡Ίͷ FOMAML ͕ࣄલֶशͷΑ͏ ͳಇ͖ͱͳΓֶश͕҆ఆ 84 / 90

Absence of Batch Normalization Statistic Accumulation • ΦϦδφϧͷ MAML (࣮૷)
Ͱ͸֤όονͰͷ౷ܭྔΛ Batch Normalization ʹ࢖༻͍ͯ͠Δ • ΦϦδφϧͷ MAML ࣮૷Ͱ͸ Inner-loop Ͱ Batch Normalization ͷֶश ͸ߦΘΕͳ͍ • ͜ΕͰ͸֤όονຖͷฏۉͱ෼ࢄʹରԠ͠ͳ͚Ε͹ͳΒͣޮ཰௿Լ ⇒ ͦ͜Ͱ֤εςοϓͰҠಈ౷ܭྔ (running statistics) Λ࢖༻ • Per-Step Batch Normalization Running Statistics (BNRS) 85 / 90

Shared (across step) Batch Normalization Bias • Inner-loop Ͱ Batch
Normalization ͷόΠΞε͸ֶश͞Εͣಉ͡஋Λ ࢖༻ • ΦϦδφϧͷ MAML ࣮૷Ͱ͸ Inner-loop Ͱ Batch Normalization ͷֶश ͸ߦΘΕͳ͍ • ࣮ࡍ͸ Inner-loop ͰϞσϧύϥϝʔλ͕ߋ৽͞ΕΔͱಛ௃ྔͷ෼෍΋ มԽ ⇒ ֤εςοϓຖʹόΠΞεΛֶश • Per-Step Batch Normalization Weights and Biases (BNWB) 86 / 90

Shared Inner Loop (across step and across parameter) Learning Rate
• MAML ͸ Inner-loop ͷֶश཰͕ݻఆ • ͜ΕͰ͸ద੾ͳֶश཰ΛܾΊΔͷʹίετ͕͔͔Δ • ֤ύϥϝʔλຖͷֶश཰΍ޯ഑ํ޲Λֶश͢Δͱྑ͍ੑೳ͕ग़Δ͜ͱ͕ ใࠂ͞Ε͍ͯΔֶ͕श͢΂͖ύϥϝʔλ͕૿͑ͯ͠·͏ • Li et al., 2017 ⇒ ಉҰϨΠϠʔ಺Ͱಉֶ͡श཰΍ޯ഑ํ޲Λֶशͤ͞Δ͜ͱͰֶश͢΂͖ ύϥϝʔλͷ૿ՃΛ཈͑Δ ⇒ ͞Βʹ Inner-loop ͷ֤εςοϓຖʹҟͳΔֶश཰΍ޯ഑ํ޲Λֶशͤ͞ Δ͜ͱͰաֶशΛ཈͑ΔޮՌ͕ظ଴Ͱ͖Δ • Learning Per-Layer Per-Step Learning Rates and Gradient Directions (LSLR) 87 / 90

Fixed Outer Loop Learning Rate • MAML ͸ Outer-loop ΋ֶश཰͕ݻఆ
• ֶश཰ͷΞχʔϦϯά͸Ϟσϧͷ൚Խੑೳʹد༩͢Δ͜ͱ͕஌ΒΕͯ ͍Δ ⇒ Outer-loop ͷֶश཰ʹ cosine ΞχʔϦϯάΛಋೖ • Cosine Annealing of Meta-Optimizer Learning Rate (CA) 88 / 90

࣮ݧ Table 1: MAML++ Omniglot 20-way Few-Shot Results: Our reproduction
of MAML appears to be replicating all the results except the 20-way 1-shot results. Other authors have come across this problem as well Jamal et al. (2018). We report our own base-lines to provide better relative intuition on how each method impacted the test accuracy of the model. We showcase how our proposed improvements individually improve on the MAML performance. Our method improves on the existing state of the art. Omniglot 20-way Few-Shot Classification Accuracy Approach 1-shot 5-shot Siamese Nets 88.2% 97.0% Matching Nets 93.8% 98.5% Neural Statistician 93.2% 98.1% Memory Mod. 95.0% 98.6% Meta-SGD 95.93+0.38% 98.97+0.19% Meta-Networks 97.00% MAML (original) 95.8+0.3% 98.9+0.2% MAML (local replication) 91.27+1.07% 98.78% MAML++ 97.65+0.05% 99.33+0.03% MAML + MSL 91.53+0.69% - MAML + LSLR 95.77+0.38% - MAML + BNWB + BNRS 95.35+0.23% - MAML + CA 93.03+0.44% - MAML + DA 92.3+0.55% - Table 2: MAML++ Mini-Imagenet Results. MAML++ indicates MAML + all the proposed fixes. Our reproduction of MAML appears to be replicating all the results of the original. Our approach sets a new state of the art across all tasks. It is also worth noting, that our approach, with only 1 inner loop step can already exceed all other methods. Additional steps allow for even better performance. Mini-Imagenet 5-way Few-Shot Classification Inner Steps Accuracy Mini-Imagenet 1-shot 5-shot Matching Nets - 43.56% 55.31% Meta-SGD 1 50.47+1.87% 64.03+0.94% Meta-Networks - 49.21% - MAML (original paper) 5 48.70+1.84% 63.11+0.92% • Omniglot ͷ 20-way ࣮ݧ • શͯͷϕʔεϥΠϯख๏ʹউར • ໰୊఺Λղܾ͢ΔఏҊख๏ΛશͯऔΓೖΕΔ͜ͱͰੑೳ্͕͕Δ͜ͱ͕ ֬ೝͰ͖Δ 89 / 90

࣮ݧ MAML + DA 92.3+0.55% - Table 2: MAML++ Mini-Imagenet
Results. MAML++ indicates MAML + all the proposed ﬁxes. Our reproduction of MAML appears to be replicating all the results of the original. Our approach sets a new state of the art across all tasks. It is also worth noting, that our approach, with only 1 inner loop step can already exceed all other methods. Additional steps allow for even better performance. Mini-Imagenet 5-way Few-Shot Classiﬁcation Inner Steps Accuracy Mini-Imagenet 1-shot 5-shot Matching Nets - 43.56% 55.31% Meta-SGD 1 50.47+1.87% 64.03+0.94% Meta-Networks - 49.21% - MAML (original paper) 5 48.70+1.84% 63.11+0.92% MAML (local reproduction) 5 48.25+0.62% 64.39+0.31% MAML++ 1 51.05+0.31% - MAML++ 2 51.49+0.25% - MAML++ 3 51.11+0.11% - MAML++ 4 51.65+0.34% - MAML++ 5 52.15+0.26% 68.32+0.44% conclusions on the relative performance between our own MAML implementation and the proposed methodologies. Table 2 showcases MAML++on Mini-Imagenet tasks, where MAML++ sets a new state of the art in both the 5-way 1-shot and 5-shot cases where the method achieves 52.15% and 68.32% respectively. More notably, MAML++ can achieve very strong 1-shot results of 51.05% with only a single inner loop step required. Not only is MAML++ cheaper due to the usage of derivative order annealing, but also because of the much reduced inner loop steps. Another notable observation is that MAML++converges to its best generalization performance much faster (in terms of iterations required) when compared to MAML as shown in Figure 1. 8 • Mini-ImageNet ͷ 5-way ࣮ݧ • MAML++͸ Inner-loop ͕ 1 εςοϓͰ΋ϕʔεϥΠϯʹউར • Inner-loop Λ૿΍͢͜ͱͰΑΓྑ͍ύϑΥʔϚϯεʹͳΔ 90 / 90

MAMLとその派生サーベイ

MAMLとその派生サーベイ

More Decks by Yusuke-Takagi-Q

Other Decks in Technology

Featured

Transcript