Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MAMLとその派生サーベイ

 MAMLとその派生サーベイ

MAMLとその派生サーベイ

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML) (ICML 2017)
On First-Order Meta-Learning Algorithms (Reptile) (OpenAI 2018)
RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL BAYES (LLAMA) (ICLR 2018)
Bayesian Model-Agnostic Meta-Learning (BMAML) (NeurIPS 2018)
Probabilistic Model-Agnostic Meta-Learning (PLATIPUS) (NeurIPS 2018)
HOW TO TRAIN YOUR MAML (MAML++) (ICLR 2019)
Meta-Learning with Implicit Gradients (iMAML) (NeurIPS 2019)
Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation (MMAML) (NeurIPS 2019)

Yusuke-Takagi-Q

March 18, 2020
Tweet

More Decks by Yusuke-Takagi-Q

Other Decks in Technology

Transcript

  1. MAML ͱͦͷ೿ੜαʔϕΠ
    ߴ໦༏հ
    Nagoya Institute of Technology
    Takeuchi & Karasuyama Lab
    2020/03/18

    View Slide

  2. Outline
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML
    1 / 90

    View Slide

  3. Next Section
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML

    View Slide

  4. ঺հ࿦จ
    • ϝλֶशͷΞϧΰϦζϜͷ̍ͭͰ͋Δ MAML ʹ͸༷ʑͳ೿ੜ͕͋Δ
    • ͜ͷεϥΠυͰ͸ҎԼͷ࿦จΛ঺հ
    • Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    (MAML) (ICML 2017)
    • On First-Order Meta-Learning Algorithms (Reptile) (OpenAI 2018)
    • RECASTING GRADIENT-BASED META-LEARNING AS
    HIERARCHICAL BAYES (LLAMA) (ICLR 2018)
    • Bayesian Model-Agnostic Meta-Learning (BMAML) (NeurIPS 2018)
    • Probabilistic Model-Agnostic Meta-Learning (PLATIPUS) (NeurIPS
    2018)
    • HOW TO TRAIN YOUR MAML (MAML++) (ICLR 2019)
    • Meta-Learning with Implicit Gradients (iMAML) (NeurIPS 2019)
    • Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    (MMAML) (NeurIPS 2019)
    2 / 90

    View Slide

  5. ঺հ࿦จ
    • MAML Λ֊૚ϕΠζͰଊ͑௚͢
    • LLAMA
    • MAML ͷޯ഑ܭࢉྔΛݮΒ͢
    • (FOMAML), Reptile, iMAML
    • MAML Λ֦ு
    • BMAML, PLATIPUS, MAML++, MMAML
    3 / 90

    View Slide

  6. ஫ҙ
    • ͜ͷεϥΠυͰ͸ը૾෼ྨλεΫΛओ࣠ʹઆ໌
    • ه߸ʹ͍ͭͯ͸શख๏Ͱ΄ͱΜͲ౷Ұͯ͠࢖༻
    • ˞ ֤࿦จͰ࢖༻͞Ε͍ͯΔจࣈͱҟͳΔ৔߹͕ଟ͍Ͱ͢
    • εϥΠυ಺Ͱ࢖༻͞Ε͍ͯΔը૾͸ಛʹ஫ऍ͕ͳ͍ݶΓͦͷ࿦จ಺ or
    ࣗ࡞ͷ΋ͷ
    4 / 90

    View Slide

  7. Next Section
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML

    View Slide

  8. Notation
    ҎԼͷه߸͸Ҏ߱ͷઆ໌Ͱجຊతʹ౷Ұ
    • λεΫ Ti : σʔληοτ Di
    ͱଛࣦؔ਺ Li
    ͕ηοτʹͳͬͨ΋ͷ
    • Di
    = {xik
    , yik
    }K
    k=1
    • ֶश࣌͸ Di
    Λ Dtr
    i
    ͱ Dtest
    i
    ʹ෼͚Δ
    • Li
    (ϕi
    , Di
    )
    • ֤λεΫ Ti
    ͸λεΫ෼෍ P(T ) ͔Βੜ੒͞ΕΔ
    • ը૾෼ྨͳΒ Few-shot learning ͷઃఆ͕ଟ͍
    • θ : ॳظ஋ύϥϝʔλ
    • ϕi : Ti
    ݻ༗ͷύϥϝʔλ
    • Ϟσϧ fϕi
    (x) : X → Y
    • ଞͷه߸ʹ͍ͭͯ͸ͦͷ౎౓આ໌
    5 / 90

    View Slide

  9. MAML ֓ཁ
    nostic Meta-Learning for Fast Adaptation of Deep Networks
    rk is a simple model-
    ta-learning that trains
    mall number of gradi-
    g on a new task. We
    nt model types, includ-
    l networks, and in sev-
    shot regression, image
    rning. Our evaluation
    ithm compares favor-
    ing methods designed
    ion, while using fewer
    dily applied to regres-
    nt learning in the pres-
    y outperforming direct
    rning
    eve rapid adaptation, a
    zed as few-shot learn-
    he problem setup and
    ithm.
    p
    s to train a model that
    only a few datapoints
    ish this, the model or
    arning phase on a set
    l can quickly adapt to
    of examples or trials.
    meta-learning
    learning/adaptation

    rL1
    rL2
    rL3
    ✓⇤
    1 ✓⇤
    2
    ✓⇤
    3
    Figure 1. Diagram of our model-agnostic meta-learning algo-
    rithm (MAML), which optimizes for a representation ✓ that can
    quickly adapt to new tasks.
    In our meta-learning scenario, we consider a distribution
    over tasks p(T ) that we want our model to be able to adapt
    to. In the K-shot learning setting, the model is trained to
    learn a new task T
    i
    drawn from p(T ) from only K samples
    drawn from qi
    and feedback LTi
    generated by T
    i
    . During
    meta-training, a task T
    i
    is sampled from p(T ), the model
    is trained with K samples and feedback from the corre-
    sponding loss LTi
    from T
    i
    , and then tested on new samples
    from T
    i
    . The model f is then improved by considering how
    the test error on new data from qi
    changes with respect to
    the parameters. In effect, the test error on sampled tasks T
    i
    serves as the training error of the meta-learning process. At
    the end of meta-training, new tasks are sampled from p(T ),
    and meta-performance is measured by the model’s perfor-
    mance after learning from K samples. Generally, tasks
    used for meta-testing are held out during meta-training.
    2.2. A Model-Agnostic Meta-Learning Algorithm
    In contrast to prior work, which has sought to train re-
    Figure 1: To compute the meta-gradient
    P
    i
    dLi( i)
    d✓
    , the MAML algorith
    the optimization path, as shown in green, while first-order MAML compu
    approximating d i
    d✓
    as I. Our implicit MAML approach derives an analytic
    meta-gradient without differentiating through the optimization path by esti
    The main contribution of our work is the development of the implicit MAM
    an approach for optimization-based meta-learning with deep neural networ
    for differentiating through the optimization path. Our algorithm aims to
    such that an optimization algorithm that is initialized at and regularized
    leads to good generalization for a variety of learning tasks. By leveraging th
    approach, we derive an analytical expression for the meta (or outer level) g
    on the solution to the inner optimization and not the path taken by the inne
    as depicted in Figure 1. This decoupling of meta-gradient computation a
    optimizer has a number of appealing properties.
    First, the inner optimization path need not be stored nor differentiated t
    implicit MAML memory efficient and scalable to a large number of inner
    ond, implicit MAML is agnostic to the inner optimization method used,
    approximate solution to the inner-level optimization problem. This permit
    methods, and in principle even non-differentiable optimization methods or
    based optimization, line-search, or those provided by proprietary software (
    also provide the first (to our knowledge) non-asymptotic theoretical analy
    tion. We show that an ✏–approximate meta-gradient can be computed vi
    ˜
    O(log(1/✏)) gradient evaluations and ˜
    O(1) memory, meaning the memory
    with number of gradient steps.
    [Fig : Rajeswaran et al. 2019]
    • P(T ) ͔Βੜ੒͞Εͨ৽ͨͳλεΫ Ti
    ʹରͯ͠ग़དྷΔ͚ͩૣ͘, গͳ͍
    αϯϓϧͰֶशͰ͖ΔΑ͏ͳॳظ஋ θ Λֶश
    • ৽ͨͳλεΫ = ֶशͰ࢖ΘΕ͍ͯͳ͍ΫϥεͰ෼ྨ
    • Model-Agnostic
    • ඍ෼ՄೳͰ͋ΔҎ֎, Ϟσϧ΍ଛࣦؔ਺ͷܗࣜΛԾఆ͠ͳ͍
    • Task-Agnostic
    • ճؼ, ෼ྨ, ڧԽֶशͳͲ, ༷ʑͳλεΫʹద༻Ͱ͖Δ
    6 / 90

    View Slide

  10. MAML ΞϧΰϦζϜ
    MAML ΞϧΰϦζϜࣗମ͸ͱͯ΋γϯϓϧͰҎԼͷૢ࡞Λ܁Γฦ͢
    • P(T ) ͔Βෳ਺ͷλεΫ Ti
    Λαϯϓϧ
    • ϕ0
    i
    = θ ͱ͠ޯ഑๏Λ༻͍ͯλεΫ Ti
    ݻ༗ͷύϥϝʔλ ϕi
    Λֶश
    • s = 1, ..., S
    ϕs
    i
    = ϕs−1
    i
    − α∇
    ϕs−1
    i
    Li(ϕs−1
    i
    , Dtr
    i
    )
    • ֤λεΫͷςετޡࠩΛԼ͛ΔΑ͏ʹ θ ΛҎԼͷࣜͰߋ৽
    θ = θ − β∇θ

    i
    Li(ϕS
    i
    , Dtest
    i
    )
    λεΫΛαϯϓϧͯ͠ θ Λߋ৽͢ΔҰ࿈ͷྲྀΕΛ Outer-loop
    ݸผλεΫͷֶशΛ Inner-loop ͱݺͿ
    7 / 90

    View Slide

  11. FOMAML
    • MAML ͷ໰୊఺ ⇒ Outer-loop ͷޯ഑ܭࢉ͕ॏ͍
    • ϔγΞϯͷܭࢉ͕ඞཁ
    • ޯ഑Λ 1 ࣍ۙࣅͯ͠࢖༻͢Δํ๏΋ MAML ࿦จ಺ͰఏҊ͞Ε͍ͯΔ
    (FOMAML)
    • Inner-loop ࣌ͷޯ഑Λอଘ͓ͯ͘͠ඞཁ͕ͳ͍ͨΊ͍ܰ
    θ = θ − β

    i
    ∇ϕS
    i
    Li(ϕS
    i
    , Dtest
    i
    )
    Figure 1: To compute the meta-gradient
    P
    i
    dLi( i)
    d✓
    , the MAML algorithm differentiates through
    the optimization path, as shown in green, while first-order MAML computes the meta-gradient by
    [Fig : Rajeswaran et al. 2019] 8 / 90

    View Slide

  12. ࣮ݧ
    • Few-shot learning ͱݺ͹ΕΔ໰୊ઃఆ
    • N-way K-shot ͳΒ N Ϋϥε෼ྨ, ֤Ϋϥε K ຕͷը૾͕ 1 ͭͷλεΫ
    σʔληοτ
    • N Ϋϥεͷ෼ྨ಺༰͸λεΫຖʹมΘΔ
    • ϝλςετ࣌͸ֶश࣌ʹ͸ଘࡏ͠ͳ͍Ϋϥε෼ྨΛߦ͏͜ͱʹͳΔ
    • 2-way ͳΒҎԼͷΑ͏ͳײ͡
    • ܇࿅λεΫ : ݘ, ೣ, Ԑ, ௗ ͔Β 2 Ϋϥεબͼֶश (ݘ vs ೣ) (ೣ vs ௗ) ...
    • ϝλςετλεΫ : (അ vs ދ)
    • ະ஌ͷΫϥε෼ྨΛߴ଎ʹͰ͖Δ͜ͱ͕ٻΊΒΕΔ
    • ࢖༻σʔλ͸ Omniglot, Mini-ImageNet
    • Omniglot ͸खॻ͖จࣈ
    • Mini-ImageNet ͸ࣗવը૾
    9 / 90

    View Slide

  13. ࣮ݧ Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    Table 1. Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results
    that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory
    module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The ± shows 95%
    confidence intervals over tasks. Note that the Omniglot results may not be strictly comparable since the train/test splits used in the prior
    work were not available. The MiniImagenet evaluation of baseline methods and matching networks is from Ravi & Larochelle (2017).
    5-way Accuracy 20-way Accuracy
    Omniglot (Lake et al., 2011) 1-shot 5-shot 1-shot 5-shot
    MANN, no conv (Santoro et al., 2016) 82.8% 94.9% – –
    MAML, no conv (ours) 89.7 ± 1.1% 97.5 ± 0.6% – –
    Siamese nets (Koch, 2015) 97.3% 98.4% 88.2% 97.0%
    matching nets (Vinyals et al., 2016) 98.1% 98.9% 93.8% 98.5%
    neural statistician (Edwards & Storkey, 2017) 98.1% 99.5% 93.2% 98.1%
    memory mod. (Kaiser et al., 2017) 98.4% 99.6% 95.0% 98.6%
    MAML (ours) 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2%
    5-way Accuracy
    MiniImagenet (Ravi & Larochelle, 2017) 1-shot 5-shot
    fine-tuning baseline 28.86 ± 0.54% 49.79 ± 0.79%
    nearest neighbor baseline 41.08 ± 0.70% 51.04 ± 0.65%
    matching nets (Vinyals et al., 2016) 43.56 ± 0.84% 55.31 ± 0.73%
    meta-learner LSTM (Ravi & Larochelle, 2017) 43.44 ± 0.77% 60.60 ± 0.71%
    MAML, first order approx. (ours) 48.07 ± 1.75% 63.15 ± 0.91%
    MAML (ours) 48.70 ± 1.84% 63.11 ± 0.92%
    fewer overall parameters compared to matching networks
    and the meta-learner LSTM, since the algorithm does not
    introduce any additional parameters beyond the weights
    of the classifier itself. Compared to these prior methods,
    memory-augmented neural networks (Santoro et al., 2016)
    specifically, and recurrent meta-learning models in gen-
    eral, represent a more broadly applicable class of meth-
    ods that, like MAML, can be used for other tasks such as
    reinforcement learning (Duan et al., 2016b; Wang et al.,
    Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    Table 1. Few-shot classification on held-out Omniglot characters (top) and the MiniImagenet test set (bottom). MAML achieves results
    that are comparable to or outperform state-of-the-art convolutional and recurrent models. Siamese nets, matching nets, and the memory
    module approaches are all specific to classification, and are not directly applicable to regression or RL scenarios. The ± shows 95%
    confidence intervals over tasks. Note that the Omniglot results may not be strictly comparable since the train/test splits used in the prior
    work were not available. The MiniImagenet evaluation of baseline methods and matching networks is from Ravi & Larochelle (2017).
    5-way Accuracy 20-way Accuracy
    Omniglot (Lake et al., 2011) 1-shot 5-shot 1-shot 5-shot
    MANN, no conv (Santoro et al., 2016) 82.8% 94.9% – –
    MAML, no conv (ours) 89.7 ± 1.1% 97.5 ± 0.6% – –
    Siamese nets (Koch, 2015) 97.3% 98.4% 88.2% 97.0%
    matching nets (Vinyals et al., 2016) 98.1% 98.9% 93.8% 98.5%
    neural statistician (Edwards & Storkey, 2017) 98.1% 99.5% 93.2% 98.1%
    memory mod. (Kaiser et al., 2017) 98.4% 99.6% 95.0% 98.6%
    MAML (ours) 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2%
    5-way Accuracy
    MiniImagenet (Ravi & Larochelle, 2017) 1-shot 5-shot
    fine-tuning baseline 28.86 ± 0.54% 49.79 ± 0.79%
    nearest neighbor baseline 41.08 ± 0.70% 51.04 ± 0.65%
    matching nets (Vinyals et al., 2016) 43.56 ± 0.84% 55.31 ± 0.73%
    meta-learner LSTM (Ravi & Larochelle, 2017) 43.44 ± 0.77% 60.60 ± 0.71%
    MAML, first order approx. (ours) 48.07 ± 1.75% 63.15 ± 0.91%
    MAML (ours) 48.70 ± 1.84% 63.11 ± 0.92%
    fewer overall parameters compared to matching networks
    and the meta-learner LSTM, since the algorithm does not
    introduce any additional parameters beyond the weights
    of the classifier itself. Compared to these prior methods,
    memory-augmented neural networks (Santoro et al., 2016)
    specifically, and recurrent meta-learning models in gen-
    eral, represent a more broadly applicable class of meth-
    ods that, like MAML, can be used for other tasks such as
    reinforcement learning (Duan et al., 2016b; Wang et al.,
    • ଟ͘ͷطଘͷ Few-shot ༻ख๏ʹউར
    • FOMAML ͸ MAML ͱൺ΂ͯ͋·Γਫ਼౓Λམͱͣ͞ʹֶशՄೳ
    • MAML ʹൺ΂ 30%Ҏ্ߴ଎Խ
    10 / 90

    View Slide

  14. Next Section
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML

    View Slide

  15. MAML ͱ֊૚ϕΠζ
    θ − log p( xjn
    | θ )
    pTj
    (x)
    pD
    (T )
    φj
    − log p( xjN+m
    | φj
    ) − log p( X | θ )
    ∇θ
    J
    N M
    θ
    xjn
    φj
    N
    J
    Figure 1: (Left) The computational graph of the MAML (Finn et al., 2017) algorithm covered in Section 2.1.
    Straight arrows denote deterministic computations and crooked arrows denote sampling operations. (Right)
    The probabilistic graphical model for which MAML provides an inference procedure as described in
    Section 3.1. In each figure, plates denote repeated computations (left) or factorization (right) across inde-
    pendent and identically distributed samples.
    θ on which each task-specific parameter is statistically dependent. With this formulation, the mutual
    dependence of the task-specific parameters φj
    is realized only through their individual dependence
    on the meta-level parameters θ As such, estimating θ provides a way to constrain the estimation of
    each of the φj
    .
    Given some data in a multi-task setting, we may estimate θ by integrating out the task-specific
    parameters to form the marginal likelihood of the data. Formally, grouping all of the data from each
    of the tasks as X and again denoting by xj1
    , . . . , xjN
    a sample from task Tj
    , the marginal likelihood
    of the observed data is given by
    p ( X | θ ) =
    j
    p xj1
    , . . . , xjN
    | φj
    p φj
    | θ dφj
    . (2)
    Maximizing (2) as a function of θ gives a point estimate for θ, an instance of a method known as
    empirical Bayes (Bernardo & Smith, 2006; Gelman et al., 2014) due to its use of the data to estimate
    the parameters of the prior distribution.
    Hierarchical Bayesian models have a long history of use in both transfer learning and domain adap-
    • σʔλ xj1
    , . . . , xjN
    , xjN+1
    , . . . , xjN+M
    ∼ pTj
    (x) Λαϯϓϧ
    • લ൒ N ݸ͸܇࿅σʔλ, ޙ൒ M ݸΛςετσʔλ
    • MAML ͸ҎԼͷ໬౓Λ࠷େԽ͢Δ໰୊Ͱ͋ͬͨ
    L(θ) =
    1
    J

    j
    [
    1
    M

    m
    − log p
    (
    xjN+m
    θ − α∇θ
    1
    N

    n
    − log p (xjn
    |θ)
    )]
    (3.1)
    11 / 90

    View Slide

  16. MAML ͱ֊૚ϕΠζ
    j
    (x)
    (T )
    φj
    − log p( xjN+m
    | φj
    ) − log p( X | θ )
    J
    M
    θ
    xjn
    φj
    N
    J
    tational graph of the MAML (Finn et al., 2017) algorithm covered in Section 2.1.
    ministic computations and crooked arrows denote sampling operations. (Right)
    model for which MAML provides an inference procedure as described in
    plates denote repeated computations (left) or factorization (right) across inde-
    buted samples.
    fic parameter is statistically dependent. With this formulation, the mutual
    ecific parameters φj
    is realized only through their individual dependence
    ers θ As such, estimating θ provides a way to constrain the estimation of
    lti-task setting, we may estimate θ by integrating out the task-specific
    rginal likelihood of the data. Formally, grouping all of the data from each
    denoting by xj1
    , . . . , xjN
    a sample from task Tj
    , the marginal likelihood
    en by
    ) = p x , . . . , x | φ p φ | θ dφ . (2)
    • ֤λεΫͷύϥϝʔλ ϕj
    ͸ଞͷλεΫݻ༗ͷύϥϝʔλʹӨڹΛड͚
    ͍ͯΔͱ͢Δ
    • ͜ΕΛϞσϧԽ͢ΔͨΊ֤λεΫݻ༗ͷύϥϝʔλ͸ڞ௨ͷύϥϝʔλ
    θ ʹґଘ͍ͯ͠Δͱ͢Δ
    • ֤λεΫͷ؍ଌσʔλΛ X ͱ͢ΔͱҎԼͷࣜͰදݱՄೳ
    p(X|θ) =

    j
    (∫
    p(xj1
    , . . . , xjN
    |ϕj)p(ϕj|θ)dϕj
    )
    (3.2)
    12 / 90

    View Slide

  17. MAML ͱ֊૚ϕΠζ
    • (3.2) Λ࠷େԽ͢Δ θ ΛٻΊΔ͜ͱ͸ܦݧϕΠζͱݺ͹ΕΔ
    • (3.2) ͸ܭࢉࠔ೉ͳͷͰ఺ਪఆͱͯ͠ ˆ
    ϕj
    ΛαϯϓϦϯά͢Δ͜ͱ͕ଟ͍
    • ͦͷ৔߹, ෛͷର਺पล໬౓͸ҎԼͷࣜ
    − log p(X|θ) ≈

    j
    [
    − log p(xjN+1
    , . . . , xjN+M
    | ˆ
    ϕj)
    ]
    (3.3)
    • ˆ
    ϕj = θ + α∇θ log p(xj1
    , . . . , xjN
    |θ) ͱஔ͚͹ࣜ (3.1) ͱಉ͡ܗʹͳΔ
    • MAML ͷֶश͸֊૚ϕΠζͰ֤λεΫݻ༗ͷύϥϝʔλΛ఺ਪఆͯ͠
    ۙࣅͨ͠৔߹ͷपล໬౓࠷େԽʹରԠ
    13 / 90

    View Slide

  18. MAML ͱ֊૚ϕΠζ
    • Inner-loop ͷεςοϓ਺ͱ֤λεΫ΁ͷదԠ౓͸τϨʔυΦϑͷؔ܎
    • Inner-loop Ͱͷޯ഑߱Լ๏ͷ early stopping ͷҙຯΛߟ͑Δ
    • Ҏ߱λεΫΠϯσοΫε͸লུ
    • ֤λεΫͷෛͷର਺໬౓ ℓ(ϕ) = − log p(xj1
    , . . . , xjN
    |ϕ) Λ࠷খ஋ ϕ∗
    पΓͰ 2 ࣍ۙࣅ͢Δͱ
    ℓ(ϕ) ≈ ˜
    ℓ(ϕ) :=
    1
    2
    ∥ϕ − ϕ∗∥2
    H−1
    + ℓ(ϕ∗)
    • H = ∇2
    ϕ
    ℓ(ϕ∗)
    • ∥z∥Q
    = z⊤Q−1z
    14 / 90

    View Slide

  19. MAML ͱ֊૚ϕΠζ
    • ۂ཰ߦྻ B Λ༻͍ΔͱҎԼͷߋ৽͕ࣜಘΒΕΔ
    • B = (∇2
    ϕ
    ˜
    ℓ(ϕk−1))−1 ͳΒχϡʔτϯ๏
    ϕk = ϕk−1 − B∇ϕ
    ˜
    ℓ(ϕk−1)
    • ϕ0 = θ ͱͯ͠ k εςοϓߋ৽Λߦͳ͏ͱҎԼͷࣜΛղ͘ ϕ ͕ಘΒΕΔ
    min
    (
    ∥ϕ − ϕ∗∥2
    H−1
    + ∥ϕ0 − ϕ∥2
    Q
    )
    (3.9)
    • Q = OΛ−1((I − BΛ)−k − I)O⊤
    • Λ = O⊤HO = diag(λ1
    , . . . , λn
    )
    • B = O⊤B−1O = diag(b1
    , . . . , bn
    )
    • λi
    , bi
    ≥ 0, i = 1, . . . , n
    15 / 90

    View Slide

  20. MAML ͱ֊૚ϕΠζ
    • (3.9) ͸ॳظ஋͔Βͦ͜·Ͱ཭Εͳ͍Α͏ͳ੍໿ (=early stopping) ෇͖
    ࠷খԽ໰୊
    • (3.9) ͸ࣄޙ෼෍ p(ϕ|x1, . . . , xN , θ) ∝ p(x1, . . . , xN |ϕ)p(ϕ|θ) ͷ࠷େ
    ԽʹରԠ
    • ͞Βʹࣄલ෼෍ p(ϕ|θ) ʹฏۉ θ, ڞ෼ࢄߦྻ Q ͷਖ਼ن෼෍Λબ୒͢Δ͜
    ͱʹରԠ (Santos, 1996)
    • Ώ͑ʹ early stopping ͸ࣄલ෼෍ͷબ୒ʹؔ܎͍ͯ͠Δ
    • ޙʹఏҊख๏ʹͯ͜ͷࣄ࣮Λ࢖༻
    16 / 90

    View Slide

  21. LLAMA
    • ϕ ͷࣄޙ෼෍͕؇΍͔ͳ෼෍ͩͬͨ৔߹ʹ఺ਪఆ͕͏·͘ػೳ͠ͳ͍
    • (3.2) Λ MAP ਪఆ͢ΔͷͰ͸ͳ͘ϥϓϥεۙࣅΛ༻͍Δํ๏ΛఏҊ
    • Lightweight Laplace Approximation for Meta-Adaptation (LLAMA)
    17 / 90

    View Slide

  22. LLAMA
    • (3.2) ಺ͷੵ෼ΛϞʔυ ϕ∗ पΓͰϥϓϥεۙࣅ͢Δͱ

    p(Xj|ϕj)p(ϕj|θ)dϕj ≈ p(Xj|ϕ∗
    j
    )p(ϕ∗
    j
    |θ) det (Hj/2π)−1
    2
    • Ϟʔυʹ Inner-loop Ͱͷਪఆ஋ ˆ
    ϕ Λ࢖༻͢Δͱ
    − log p(X|θ) ≈

    j
    [
    log p(Xj| ˆ
    ϕj) − log p( ˆ
    ϕj|θ) +
    1
    2
    log det (Hj)
    ]
    • log p( ˆ
    ϕj
    |θ) ͸ early stopping ʹΑΔ҉ͷ੍໿ʹ૬౰
    • 1
    2
    log det (Hj
    ) ͸Ϟσϧෳࡶ౓ͷ੍໿ʹ૬౰
    18 / 90

    View Slide

  23. LLAMA
    • ࣄޙ෼෍ͷ΁γΞϯ Hj
    ͸ҎԼͷΑ͏ʹॻ͘͜ͱ͕Ͱ͖Δ
    Hj = ∇2
    ϕj
    [log p(Xj|ϕj)] + ∇2
    ϕj
    [log p(ϕj|θ)]
    • ࣄલ෼෍ p(ϕj|θ) ͸ਖ਼ن෼෍ͰۙࣅՄೳͰ͕͋ͬͨର֯Ͱͳ͍ͨΊѻ͍
    ʹ͍͘
    • ࣮ݧͰ͸ਫ਼౓ τ ͷ౳ํڞ෼ࢄߦྻͷਖ਼ن෼෍Ͱۙࣅ
    • τ ͸ΫϩεόϦσʔγϣϯͰܾఆ͢Δ
    19 / 90

    View Slide

  24. LLAMA
    • ର਺໬౓ͷ΁γΞϯ΋ͦͷ··Ͱ͸ܭࢉࠔ೉ͳͨΊۙࣅ͍ͨ͠
    • ϑΟογϟʔ৘ใߦྻΛ༻͍ͯۙࣅ
    • ࣗવޯ഑ֶश๏ͷจ຺ͰϑΟογϟʔ৘ใྔߦྻͷٯߦྻΛۙࣅ͢Δ
    Kronecker-factored approximate curvature (K-FAC) ͱ͍͏ख๏͕͋Δ
    • ϒϩοΫର֯ۙࣅͱΫϩωοΧʔੵͰۙࣅ͢Δख๏
    • K-FAC ͷख๏Λ༻͍Δ͜ͱͰϑΟογϟʔ৘ใߦྻͷߦྻ͕ࣜޮ཰తʹ
    ۙࣅܭࢉՄೳ
    20 / 90

    View Slide

  25. LLAMA
    • Ҏ্ϥϓϥεۙࣅͱ K-FAC Λ༻͍Δ͜ͱͰҎԼͷ LLAMA ΛಘΔ
    • ˆ
    H ͸ H ͷۙࣅ
    • η ͸ϋΠύʔύϥϝʔλ
    Subroutine ML-LAPLACE(θ, T )
    Draw N samples x1
    , . . . , xN
    ∼ pT
    (x)
    Initialize φ ← θ
    for k in 1, . . . , K do
    Update φ ← φ + α ∇φ
    log p( x1
    , . . . , xN
    | φ )
    end
    Draw M samples xN+1
    , . . . , xN+M
    ∼ pT
    (x)
    Estimate quadratic curvature ˆ
    H
    return − log p( xN+1
    , . . . , xN+M
    | φ ) + η log det( ˆ
    H)
    Subroutine 4: Subroutine for computing a Laplace approximation of the marginal likelihood.
    21 / 90

    View Slide

  26. ࣮ݧ
    • ϥϓϥεۙࣅͰλεΫݻ༗ύϥϝʔλ ϕj
    ͷࣄޙ෼෍Λۙࣅ͢Ε͹ෆ֬
    ࣮ੑΛݟΔ͜ͱ͕Ͱ͖Δ
    • ৼ෯ [0.1, 5.0], Ґ૬ [0, π] ͷ sin x ͷճؼΛߦ͏λεΫ
    • ֤λεΫ 10 ఺ͷ؍ଌ஋͕༩͑ΒΕΔ
    −5
    0
    5
    −10 −5 0 5 10
    −5
    0
    5
    MAML
    −10 −5 0 5 10
    MAML with uncertainty
    ground truth func-
    tion
    few-shot training
    examples
    model prediction
    sample from the
    model
    Figure 5: Our method is able to meta-learn a model that can quickly adapt to sinusoids with varying phases
    and amplitudes, and the interpretation of the method as hierarchical Bayes makes it practical to directly sample
    models from the posterior. In this figure, we illustrate various samples from the posterior of a model that
    is meta-trained on different sinusoids, when presented with a few datapoints (in red) from a new, previously
    unseen sinusoid. Note that the random samples from the posterior predictive describe a distribution of functions
    that are all sinusoidal and that there is increased uncertainty when the datapoints are less informative (i.e., when
    the datapoints are sampled only from the lower part of the range input, shown in the bottom-right example).
    ࠨ: ௨ৗͷ MAML. ӈ : ϝλςετ࣌ʹ༩͑ΒΕͨ 10 ఺Ͱ ϕj
    Λֶशޙ
    ϥϓϥεۙࣅͰಘΒΕͨࣄޙ෼෍͔ΒαϯϓϦϯάͨ͠ύϥϝʔλͰճؼͨ͠΋ͷ.
    22 / 90

    View Slide

  27. ࣮ݧ
    5-way acc. (%)
    Model 1-shot
    Fine-tuning∗ 28.86 ± 0.54
    Nearest Neighbor∗ 41.08 ± 0.70
    Matching Networks FCE (Vinyals et al., 2016)∗ 43.56 ± 0.84
    Meta-Learner LSTM (Ravi & Larochelle, 2017)∗ 43.44 ± 0.77
    SNAIL (Anonymous, 2018)∗∗ 45.1 ± ——
    Prototypical Networks (Snell et al., 2017)∗∗∗ 46.61 ± 0.78
    mAP-DLM (Triantafillou et al., 2017) 49.82 ± 0.78
    MAML (Finn et al., 2017) 48.70 ± 1.84
    LLAMA (Ours) 49.40 ± 1.83
    Table 1: One-shot classification performance on the miniImageNet test set, with comparison methods or-
    dered by one-shot performance. All results are averaged over 600 test episodes, and we report 95% confidence
    intervals. ∗Results reported by Ravi & Larochelle (2017). ∗∗We report test accuracy for a comparable architec-
    ture.1∗∗∗We report test accuracy for models matching train and test “shot” and “way”.
    We use a neural network architecture standard to few-shot classification (e.g., Vinyals et al., 2016;
    Ravi & Larochelle, 2017), consisting of 4 layers with 3 × 3 convolutions and 64 filters, followed by
    batch normalization (BN) (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2×2 max-pooling. For
    the scaling variable β and centering variable γ of BN (see Ioffe & Szegedy, 2015), we ignore the fast
    adaptation update as well as the Fisher factors for K-FAC. We use Adam (Kingma & Ba, 2014) as the
    meta-optimizer, and standard batch gradient descent with a fixed learning rate to update the model
    during fast adaptation. LLAMA requires the prior precision term τ as well as an additional parameter
    η ∈ R+ that weights the regularization term log det ˆ
    H contributed by the Laplace approximation.
    We fix τ = 0.001 and selected η = 10−6 via cross-validation; all other parameters are set to the
    values reported in Finn et al. (2017).
    We find that LLAMA is practical enough to be applied to this larger-scale problem. In particular,
    our TensorFlow implementation of LLAMA trains for 60,000 iterations on one TITAN Xp GPU in
    • ఏҊख๏ (LLAMA) Λ༻͍ͯ Mini-ImageNet Ͱ෼ྨ
    • ҎԼ͸ΫϩεόϦσʔγϣϯͰܾఆ
    • τ = 0.001
    • η = 10−6
    • ଞͷϋΠύʔύϥϝʔλ͸ MAML ࿦จͱ౷Ұ
    • MAML ʹউར
    23 / 90

    View Slide

  28. Next Section
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML

    View Slide

  29. ͜ͷ࿦จ͕ओு͢Δ MAML ͷ໰୊఺
    • MAML ͷ໰୊఺ͱͯ͠ Outer-loop Ͱͷඍ෼ܭࢉ͕େม
    • ܭࢉྔ, ϝϞϦڞʹ
    • ܭࢉίετΛ࡟ݮ͢Δ 1 ࣍ۙࣅख๏ΛఏҊ (Reptile)
    • FOMAML ΑΓ௚ײత
    • ڧԽֶशͰͷ࣮ݧ͸ͳ͍
    24 / 90

    View Slide

  30. Reptile
    • MAML ͱಉ༷ʹҎԼͷࣜͰ֤λεΫݻ༗ͷύϥϝʔλ ϕi
    Λֶश
    ϕs = ϕs−1 − α∇ϕs−1
    L(ϕs−1, Dtr)
    • ͦͷޙҎԼͷࣜͰॳظ஋ θ Λֶश
    • ॳظ஋ύϥϝʔλͷಈ͔͠ํ͸ Reptile ͷํ͕ FOMAML ΑΓࣗવ
    θ ← θ + ϵ(ϕS − θ)
    = $ ɾɾɾ
    ∇ℒ((, +,-+)
    /
    0
    (1$
    (
    '0.".-
    3FQUJMF
    25 / 90

    View Slide

  31. Reptile
    • Inner-loop ͷ֤εςοϓͰͷޯ഑Λ gs ͱ͢Δͱ ϕS ͸
    ϕS = θ − α(g1 + · · · + gS−1)
    • ॳظ஋ θ Λֶश͸ҎԼͷΑ͏ʹ΋ղऍՄೳ
    θ ← θ − ϵα(g1 + · · · + gS−1)
    = $
    ɾɾɾ
    3FQUJMF
    −$ −'
    −()'
    −()$
    (
    26 / 90

    View Slide

  32. joint training ͱ Reptile
    • Ұ൪؆୯ͳ meta learning ͷํ๏͸ joint training
    • ຬวͳ͘λεΫΛ͜ͳͤΔΑ͏ʹҎԼͷظ଴஋Λ࠷খʹ͢ΔΑ͏ʹύϥ
    ϝʔλ θ Λֶश
    min Eτ [Lτ (θ, Dtr
    τ
    )]
    • ͨͩ joint training Ͱ͸্खֶ͘शͰ͖ͳ͍
    • ༷ʑͳ sin ؔ਺Λճؼ͢ΔλεΫΛߟ͑Δͱ joint training Ͱ͸ৗʹ 0 Λ
    ฦ͢Α͏ʹͳΔ
    27 / 90

    View Slide

  33. joint training ͱ Reptile
    • Reptile ͸ joint training ͸ࣅ͍ͯΔ
    • Inner-loop ͕ 1 ճ͚ͩͷ৔߹͸ joint training ʹҰக
    • ࣍ϖʔδ͔Βઆ໌͢ΔΑ͏ʹෳ਺εςοϓͱΔͱҧ͍͕ग़ͯ͘Δ
    28 / 90

    View Slide

  34. MAML, FOMAML, Reptile ͷൺֱ
    ˞͜͜ͷٞ࿦Ͱ͸ Notation Λݩ࿦จʹ߹Θͤ·͢
    • ࣜΛ؆қʹ͢ΔͨΊҎԼͷΑ͏ʹఆٛ
    • Li
    : ಉ͡λεΫͷ i ൪໨ͷϛχόονʹ͓͚Δଛࣦؔ਺
    • ϕ1
    : ॳظ஋ύϥϝʔλ (ࠓ·Ͱͷ θ ʹ૬౰)
    • gi
    = L′
    i
    (ϕi
    )
    • ϕi+1
    = ϕi
    − αgi
    • ¯
    gi
    = L′
    i
    (ϕ1
    )
    • ¯
    Hi
    = L′′
    i
    (ϕ1
    )
    • i ∈ [1, k]
    29 / 90

    View Slide

  35. MAML, FOMAML, Reptile ͷൺֱ
    • ޯ഑ gi
    Λ ϕ1
    पΓͰۙࣅ
    gi = L′
    i
    (ϕi) = L′
    i
    (ϕ1) + L′′
    i
    (ϕ1)(ϕi − ϕ1) + O(α2)
    = ¯
    gi − α ¯
    Hi
    i−1

    j
    gj + O(α2)
    = ¯
    gi − α ¯
    Hi
    i−1

    j
    ¯
    gj + O(α2)
    • 1 ߦ໨͔Β 2 ߦ໨ : ϕi
    − ϕ1
    = −α

    i−1
    j
    gj
    • 2 ߦ໨͔Β 3 ߦ໨ : gi
    = ¯
    gi
    + O(α)
    30 / 90

    View Slide

  36. MAML, FOMAML, Reptile ͷൺֱ
    • Ui(ϕ) = ϕ − αL′
    i
    (ϕ) ͱఆٛ͢Δͱ, MAML ʹ͓͚Δ Outer-loop Ͱͷޯ
    ഑ gMAML
    ͸
    gMAML =

    ∂ϕ1
    Lkϕk
    =

    ∂ϕ1
    Lk(Uk−1(Uk−2(. . . (U1(ϕ1)))))
    = U′
    1
    (ϕ1) · · · U′
    k−1
    (ϕk−1)L′
    k
    (ϕk)
    = (I − αL′′
    1
    (ϕ1)) · · · (I − αL′′
    k−1
    (ϕk−1))L′
    k
    (ϕk)
    =


    k−1

    j=1
    (I − αL′′
    j
    (ϕj)

     gk
    31 / 90

    View Slide

  37. MAML, FOMAML, Reptile ͷൺֱ
    • gk = ¯
    gk − α ¯
    Hk

    i−1
    k
    ¯
    gk + O(α2), L′′
    j
    (ϕj) = ¯
    Hj + O(α) ΑΓ
    gMAML =


    k−1

    j=1
    (I − α ¯
    Hj)


    (
    ¯
    gk − α ¯
    Hk
    i−1

    k
    ¯
    gk
    )
    + O(α2)
    =

    I − α
    k−1

    j=1
    ¯
    Hj


    (
    ¯
    gk − α ¯
    Hk
    i−1

    k
    ¯
    gk
    )
    + O(α2)
    = ¯
    gk − α
    k−1

    j=1
    ¯
    Hj ¯
    gk − α ¯
    Hk
    k−1

    j=1
    ¯
    gj + O(α2)
    32 / 90

    View Slide

  38. MAML, FOMAML, Reptile ͷൺֱ
    • k = 2 ͱ͢Δͱ MAML, FOMAML, Reptile ͷޯ഑͸ҎԼͷΑ͏ʹͳΔ
    gMAML = ¯
    g2 − α ¯
    H2¯
    g1 − α ¯
    H1¯
    g2 + O(α2)
    gFOMAML = g2 = ¯
    g2 − α ¯
    H2¯
    g1 + O(α2)
    gReptile = g1 + g2 = ¯
    g1 + ¯
    g2 − α ¯
    H2¯
    g1 + O(α2)
    • ͜ΕΒͷޯ഑Λෳ਺λεΫͰظ଴஋ΛऔΔ͜ͱΛߟ͑Δ
    • Ҏ߱ग़ͯ͘Δ Eτ,1,2
    [. . . ] ͸֤λεΫ τ ͱϛχόον L1
    , L2
    ͦΕͧΕͰظ
    ଴஋Λऔͬͨ΋ͷ
    33 / 90

    View Slide

  39. MAML, FOMAML, Reptile ͷൺֱ
    • AvgGrad ͱݺ͹ΕΔ΋ͷΛҎԼͷࣜͰఆٛ
    AvgGrad = Eτ,1[¯
    g1]
    • ॳظ஋ύϥϝʔλΛͲͪΒʹಈ͔ͤ͹ޡࠩΛখ͘͞Ͱ͖Δ͔Λදͯ͠
    ͍Δ
    • joint training Ͱͷ࠷খԽ໰୊ͱಉ͡
    34 / 90

    View Slide

  40. MAML, FOMAML, Reptile ͷൺֱ
    • AvgGradInner ͱݺ͹ΕΔ΋ͷ΋ҎԼͷࣜͰఆٛ
    AvgGradInner = Eτ,1,2[ ¯
    H2¯
    g1]
    = Eτ,1,2[ ¯
    H1¯
    g2]
    =
    1
    2
    Eτ,1,2[ ¯
    H2¯
    g1 + ¯
    H1¯
    g2]
    =
    1
    2
    Eτ,1,2
    [

    ∂ϕ1

    g1 · ¯
    g2)
    ]
    • ͜Ε͸ϛχόονؒͷޯ഑ͷ಺ੵΛ૿Ճͤ͞Δํ޲Λද͍ͯ͠Δ
    • ͲͷλεΫͰ΋֤ϛχόονͰͷߋ৽͕ಉ͡Α͏ͳํ޲ʹͳΔ (≒ ্ख͘
    ֶश͕ਐΉ) ॳظ஋ΛಘΔ໾ׂ
    • λεΫ൚Խʹ໾ཱͭ
    35 / 90

    View Slide

  41. MAML, FOMAML, Reptile ͷൺֱ
    • AvgGrad ͱ AvgGradInner Λ༻͍ͯ k = 2 Ͱͷ MAML, FOMAML,
    Reptile ͷޯ഑ͷظ଴஋Λද͢ͱ
    E[gMAML] = (1)AveGrad − (2α)AvgGradInner + O(α2)
    E[gFOMAML] = (1)AveGrad − (α)AvgGradInner + O(α2)
    E[gReptile] = (2)AveGrad − (α)AvgGradInner + O(α2)
    36 / 90

    View Slide

  42. MAML, FOMAML, Reptile ͷൺֱ
    • k > 2 ͷ৔߹͸
    E[gMAML] = (1)AveGrad − (2(k − 1)α)AvgGradInner + O(α2)
    E[gFOMAML] = (1)AveGrad − ((k − 1)α)AvgGradInner + O(α2)
    E[gReptile] = (k)AveGrad − (
    1
    2
    k(k − 1)α)AvgGradInner + O(α2)
    • AvgGradInner ͱ AveGrad ͷൺ͸ MAML > FOMAML > Reptile
    • MAML ʹൺ΂ FOMAML ͸ AvgGradInner ͕൒෼ͳͨΊੑೳ͸མͪΔ
    • Reptile ͸ AveGrad ͕ଟ͍ͨΊߴ଎ʹޡ͕ࠩݮΔ͜ͱ͕ظ଴͞ΕΔ
    37 / 90

    View Slide

  43. Reptile ͱλεΫ࠷దղͷଟ༷ମͷؔ܎
    *
    1
    *
    2
    ϕ
    ure 2: The above illustration shows the sequence of iterates obtained by moving alternately towards t
    imal solution manifolds W1
    and W2
    and converging to the point that minimizes the average squar
    tance. One might object to this picture on the grounds that we converge to the same point regardless
    ether we perform one step or multiple steps of gradient descent. That statement is true, however, no
    t minimizing the expected distance objective E

    [D( , W⌧
    )] is di↵erent than minimizing the expected lo
    ective E

    [L⌧
    (f )]. In particular, there is a high-dimensional manifold of minimizers of the expected lo
    (e.g., in the sine wave case, many neural network parameters give the zero function f( ) = 0), but t
    nimizer of the expected distance objective is typically a single point.
    2 Finding a Point Near All Solution Manifolds
    re, we argue that Reptile converges towards a solution that is close (in Euclidean distance)
    h task ⌧’s manifold of optimal solutions. This is a informal argument and should be taken mu
    • Reptile ͸ॳظ஋ θ ͕֤λεΫͷ࠷దղͷଟ༷ମ W∗
    τ
    ʹ͍ۙղʹ޲͔ͬ
    ͯऩଋ
    • ϢʔΫϦουڑ཭Ͱͷ࿩
    • ଟ༷ମΛߟ͑Δͷ͸࠷దղ͕ແ਺ʹ͋Δͱߟ͑ΒΕΔ͔Β
    • ͨͩ࿦จ಺Ͱ͸ informal ͳओுͱݴ͍ͬͯΔ
    • ৄࡉ͸লུ
    38 / 90

    View Slide

  44. ࣮ݧ
    • Mini-ImageNet ͱ Omniglot Ͱ Few-shot
    • ্ : Mini-ImageNet
    • Լ : Omniglot
    • MAML ΍ FOMAML ͱଝ৭ͳ͍ਫ਼౓
    • Transduction ʹ͍ͭͯ͸ޙड़
    classification, then you would show it 25 examples (5 per class) and ask it to classify a 26 example.
    In addition to the above setup, we also experimented with the transductive setting, where the
    model classifies the entire test set at once. In our transductive experiments, information was shared
    between the test samples via batch normalization [9]. In our non-transductive experiments, batch
    normalization statistics were computed using all of the training samples and a single test sample.
    We note that Finn et al. [4] use transduction for evaluating MAML.
    For our experiments, we used the same CNN architectures and data preprocessing as Finn et
    al. [4]. We used the Adam optimizer [10] in the inner loop, and vanilla SGD in the outer loop,
    throughout our experiments. For Adam we set
    1
    = 0 because we found that momentum reduced
    performance across the board.1 During training, we never reset or interpolated Adam’s rolling
    moment data; instead, we let it update automatically at every inner-loop training step. However,
    we did backup and reset the Adam statistics when evaluating on the test set to avoid information
    leakage.
    The results on Omniglot and Mini-ImageNet are shown in Tables 1 and 2. While MAML,
    FOMAML, and Reptile have very similar performance on all of these tasks, Reptile does slightly
    better than the alternatives on Mini-ImageNet and slightly worse on Omniglot. It also seems that
    transduction gives a performance boost in all cases, suggesting that further research should pay
    close attention to its use of batch normalization during testing.
    Algorithm 1-shot 5-way 5-shot 5-way
    MAML + Transduction 48.70 ± 1.84% 63.11 ± 0.92%
    1st-order MAML + Transduction 48.07 ± 1.75% 63.15 ± 0.91%
    Reptile 47.07 ± 0.26% 62.74 ± 0.37%
    Reptile + Transduction 49.97 ± 0.32% 65.99 ± 0.58%
    Table 1: Results on Mini-ImageNet. Both MAML and 1st-order MAML results are from [4].
    Algorithm 1-shot 5-way 5-shot 5-way 1-shot 20-way 5-shot 20-way
    MAML + Transduction 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2%
    1st-order MAML + Transduction 98.3 ± 0.5% 99.2 ± 0.2% 89.4 ± 0.5% 97.9 ± 0.1%
    Reptile 95.39 ± 0.09% 98.90 ± 0.10% 88.14 ± 0.15% 96.65 ± 0.33%
    Reptile + Transduction 97.68 ± 0.04% 99.48 ± 0.06% 89.43 ± 0.14% 97.12 ± 0.32%
    between the test samples via batch normalization [9]. In our non-transductive experiments, batch
    normalization statistics were computed using all of the training samples and a single test sample.
    We note that Finn et al. [4] use transduction for evaluating MAML.
    For our experiments, we used the same CNN architectures and data preprocessing as Finn et
    al. [4]. We used the Adam optimizer [10] in the inner loop, and vanilla SGD in the outer loop,
    throughout our experiments. For Adam we set
    1
    = 0 because we found that momentum reduced
    performance across the board.1 During training, we never reset or interpolated Adam’s rolling
    moment data; instead, we let it update automatically at every inner-loop training step. However,
    we did backup and reset the Adam statistics when evaluating on the test set to avoid information
    leakage.
    The results on Omniglot and Mini-ImageNet are shown in Tables 1 and 2. While MAML,
    FOMAML, and Reptile have very similar performance on all of these tasks, Reptile does slightly
    better than the alternatives on Mini-ImageNet and slightly worse on Omniglot. It also seems that
    transduction gives a performance boost in all cases, suggesting that further research should pay
    close attention to its use of batch normalization during testing.
    Algorithm 1-shot 5-way 5-shot 5-way
    MAML + Transduction 48.70 ± 1.84% 63.11 ± 0.92%
    1st-order MAML + Transduction 48.07 ± 1.75% 63.15 ± 0.91%
    Reptile 47.07 ± 0.26% 62.74 ± 0.37%
    Reptile + Transduction 49.97 ± 0.32% 65.99 ± 0.58%
    Table 1: Results on Mini-ImageNet. Both MAML and 1st-order MAML results are from [4].
    Algorithm 1-shot 5-way 5-shot 5-way 1-shot 20-way 5-shot 20-way
    MAML + Transduction 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2%
    1st-order MAML + Transduction 98.3 ± 0.5% 99.2 ± 0.2% 89.4 ± 0.5% 97.9 ± 0.1%
    Reptile 95.39 ± 0.09% 98.90 ± 0.10% 88.14 ± 0.15% 96.65 ± 0.33%
    Reptile + Transduction 97.68 ± 0.04% 99.48 ± 0.06% 89.43 ± 0.14% 97.12 ± 0.32%
    Table 2: Results on Omniglot. MAML results are from [4]. 1st-order MAML results were generated by the
    code for [4] with the same hyper-parameters as MAML. 39 / 90

    View Slide

  45. ࣮ݧ
    • transductive learning ͸ڭࢣͳֶ͠शͷҰछ
    • ςετσʔλ (ϥϕϧͳ͠) ͕લ΋ͬͯ༩͑ΒΕֶ͍ͯͯशʹ࢖͑Δઃఆ
    • Transduction ͋Γ : Test ࣌ʹ Test set Λશ෦࢖༻ͯ͠ϥϕϧΛ༧ଌ
    • Transduction ͳ͠ : Test set Λ 1 αϯϓϧͣͭ༧ଌ
    • MAML Ͱ͸ batch normalization ͷ౷ܭྔʹৗʹ batch ͷ஋Λ࢖༻
    ⇒ શ Test set Λ batch ʹͯ͠༧ଌ͢Δͱશ Test set ͷ৘ใΛ༧ଌʹ࢖༻Մೳ
    ⇒ ਫ਼౓޲্
    • MAML ࿦จͷ࣮ݧઃఆͰ͸҉ʹ Transduction learning Λ͍ͯ͠ΔΑ͏ͳઃ
    ఆʹͳͬͯ͠·͍ͬͯΔ
    • ࿦จ಺Ͱ͸ batch normalization ͷѻ͍ʹ஫ҙΛ෷ͬͨํ͕ྑ͍ͱड़΂
    ͍ͯΔ
    40 / 90

    View Slide

  46. ࣮ݧ
    (a) Final test performance vs.
    number of inner-loop iterations.
    (b) Final test performance vs.
    inner-loop batch size.
    (c) Final
    outer-loop
    tail FOMA
    100 (full b
    Figure 4: The results of hyper-parameter sweeps on 5-shot 5-way Omn
    6.3 Overlap Between Inner-Loop Mini-Batches
    Both Reptile and FOMAML use stochastic optimization in their inner-loops
    this optimization procedure can lead to large changes in final performance. T
    the sensitivity of Reptile and FOMAML to the inner loop hyperparameters,
    FOMAML’s performance significantly drops if mini-batches are selected the w
    The experiments in this section look at the di↵erence between shared-tai
    • Inner-loop ͷΠςϨʔγϣϯճ਺΍ mini batch ਺Λม࣮͑ͨݧ
    • shared-tail : ֶशσʔλ͔Βద౰ʹαϯϓϧͯ͠ Inner-loop ͷςετ
    • separate-tail : Inner-loop Ͱ train ͱ test ͷσʔλ෼ׂΛߦ͏
    • replacement / cycling : Inner-loop ࣌ͷ mini batch Λຖճ࡞Γ௚͔͢൱͔
    • Reptile ͸ɾɾɾ
    • Inner-loop Ͱͷσʔλ෼ׂෆཁ
    • mini batch Λຖճ࡞Γ௚͢ඞཁ΋ͳ͍
    41 / 90

    View Slide

  47. Next Section
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML

    View Slide

  48. ͜ͷ࿦จ͕ओு͢Δ MAML ͷ໰୊఺
    • MAML ͷ໰୊఺ͱͯ͠ Outer-loop Ͱͷඍ෼ܭࢉ͕େม
    • ܭࢉྔ, ϝϞϦڞʹ
    • ӄؔ਺ඍ෼ͱڞ໾ޯ഑๏Λ༻͍ͯܭࢉίετΛ࡟ݮ͢Δख๏ΛఏҊ
    (iMAML)
    • FOMAML ΍ Reptile ͸ 2 ֊ඍ෼Λܭࢉͤͣʹۙࣅ
    • iMAML ͸ 2 ֊ඍ෼Λۙࣅͯ͠ٻΊΔ͜ͱͰਫ਼౓Λ޲্
    • ڧԽֶशͰͷ࣮ݧ͸ͳ͍
    42 / 90

    View Slide

  49. iMAML
    • MAML ͷֶश͸ҎԼͷΑ͏ʹද͢͜ͱ͕Ͱ͖ͨ
    θ ← θ − η
    1
    M
    M

    i=1
    ∇θLi(Algi(θ))
    • Algi
    (θ) = Algi
    (θ, Dtr
    i
    ) = ϕi
    • Li
    (ϕi
    ) = Li
    (ϕi
    , Dtest
    i
    )
    • ͜͜ͰҎԼͰఆٛ͞ΕΔ Inner-loop ͷ࠷దղ ϕ′ ͕ಘΒΕͨͱ͢Δ
    Alg⋆
    i
    (θ) := arg min
    ϕ′∈Φ
    Gi(ϕ′, θ), where Gi(ϕ′, θ) = ˆ
    Li(ϕ′)+
    λ
    2
    ∥ϕ′ −θ∥2
    • ˆ
    Li
    (ϕ) := Li
    (ϕ, Dtr
    i
    )
    43 / 90

    View Slide

  50. iMAML
    • ͜ͷ࣌ θ ͷߋ৽ࣜ͸ chain rule ΑΓ
    θ ← θ − η
    1
    M
    M

    i=1
    dAlg⋆
    i
    (θ)

    ∇ϕLi(Alg⋆
    i
    (θ))
    • dAlg⋆
    i
    (θ)

    ͷܭࢉ͕ॏ͍ͷͰۙࣅ͍ͨ͠
    ⇒ ӄؔ਺ඍ෼Λ༻͍Δ
    44 / 90

    View Slide

  51. iMAML
    • inner-loop ͰಘΒΕͨ ϕi
    ͕࠷దղ ϕ′ Ͱ͋ͬͨͱ͢Δͱ
    ∇ϕ′ G(ϕ′, θ)|ϕ′=ϕi
    = 0 =⇒ ∇ϕi
    ˆ
    Li(ϕi) + λ(ϕi − θ) = 0
    =⇒ ϕi = θ −
    1
    λ
    ∇ϕi
    ˆ
    Li(ϕi)
    • ӄؔ਺ඍ෼Λߦ͏ͱ
    dϕi

    = I −
    1
    λ
    ∇2 ˆ
    Li(ϕi)
    dϕi

    =⇒
    dAlg⋆
    i
    (θ)

    =
    (
    I +
    1
    λ
    ∇2 ˆ
    Li(ϕi)
    )
    −1
    • Inner-loop ࠷ޙͷύϥϝʔλ ϕi
    ͚ͩ͋Ε͹ܭࢉՄೳ
    45 / 90

    View Slide

  52. iMAML

    (
    I + 1
    λ
    ∇2 ˆ
    Li(ϕi)
    )
    −1
    ͷܭࢉʹ͸ 2 ͭͷ໰୊఺
    1. ࣮ࡍʹಘΒΕΔ ϕi
    ͸ۙࣅղ
    2. େ͖͍ϞσϧͰ͸ٯߦྻͷܭࢉ͕ѻ͑ͳ͍
    • 1 ʹؔͯ͠͸࿦จͷ Appendix ʹͯޡࠩʹ͍ͭͯͷٞ࿦͋Γ
    • 2 ͸ڞ໾ޯ഑๏Λ༻͍ͯղܾ
    • ରশਖ਼ఆ஋ߦྻΛ܎਺ͱ͢Δ࿈ཱҰ࣍ํఔࣜΛղͨ͘ΊͷΞϧΰϦζϜ
    46 / 90

    View Slide

  53. iMAML
    • θ ͷߋ৽ࣜʹग़ͯ͘ΔҎԼͷࣜΛ gi
    ͱ͓͘
    (
    I +
    1
    λ
    ∇2 ˆ
    Li(ϕi)
    )
    −1
    ∇ϕLi(Alg⋆
    i
    (θ)) = gi
    • ٻΊ͍ͨ gi
    ͸ҎԼͷઢܕํఔࣜͷղʹͳΔ
    (
    I +
    1
    λ
    ∇2 ˆ
    Li(ϕi)
    )
    gi = ∇ϕLi(Alg⋆
    i
    (θ))
    • ͜Ε͸ҎԼͷ࠷খԽ໰୊Λڞ໾ޯ഑๏Ͱղ͚͹ٻΊΒΕΔ
    min
    w
    w⊤
    (
    I +
    1
    λ
    ∇2 ˆ
    Li(ϕi)
    )
    w − w⊤∇ϕLi(Alg⋆
    i
    (θ))
    47 / 90

    View Slide

  54. iMAML
    • ࣮ݧతʹ͸ڞ໾ޯ഑๏͸ 5 ΠςϨʔγϣϯճͤ͹΄΅ऩଋ͢Δ
    • ڞ໾ޯ഑๏ͷ 1 εςοϓ͋ͨΓͷܭࢉίετ͸ inner-loop Ͱͷ GD ͷ 1
    εςοϓͱಉ͘͡Β͍
    ⇒ MAML ΑΓ΋ܭࢉίετ͕௿͍
    • ҎԼ iMAML ͷΞϧΰϦζϜ
    Algorithm 1 Implicit Model-Agnostic Meta-Learning (iMAML)
    1: Require: Distribution over tasks P(T ), outer step size ⌘, regularization strength ,
    2: while not converged do
    3: Sample mini-batch of tasks {Ti
    }B
    i=1
    ⇠ P(T )
    4: for Each task Ti
    do
    5: Compute task meta-gradient gi = Implicit-Meta-Gradient(Ti, ✓, )
    6: end for
    7: Average above gradients to get ˆ
    rF(✓) = (1/B)
    PB
    i=1
    gi
    8: Update meta-parameters with gradient descent: ✓ ✓ ⌘
    ˆ
    rF(✓) // (or Adam)
    9: end while
    Algorithm 2 Implicit Meta-Gradient Computation
    1: Input: Task Ti
    , meta-parameters ✓, regularization strength
    2: Hyperparameters: Optimization accuracy thresholds and 0
    3: Obtain task parameters i
    using iterative optimization solver such that: k i
    Alg?
    i
    (✓)k 
    4: Compute partial outer-level gradient vi = r LT ( i)
    5: Use an iterative solver (e.g. CG) along with reverse mode differentiation (to compute Hessian
    vector products) to compute gi
    such that: kgi I + 1 r2 ˆ
    Li( i) 1
    vi
    k  0
    6: Return: gi
    48 / 90

    View Slide

  55. iMAML ͷར఺
    • Inner-loop ͷճ਺Λ૿΍͢͜ͱ͕Մೳ
    • MAML Ͱ͸ܭࢉάϥϑอ࣋ͷͨΊʹϝϞϦ͕ඞཁͰ͋·Γ૿΍ͤͳ͍
    • Inner-loop ͷ࠷దԽʹ 2 ࣍ޯ഑ͷख๏͕࢖༻Մೳ
    • Hessian-free ΍ Newton-CG ͳͲ
    • MAML Ͱ࢖༻͢Δʹ͸ 3 ࣍ޯ഑Λܭࢉ͢Δඞཁ
    49 / 90

    View Slide

  56. ࣮ݧ Finally, we study empirical performance of iMAML on the Om
    Following the few-shot learning protocol in prior work [57], w
    (a)
    Figure 2: Accuracy, Computation, and Memory tradeoffs of iMA
    gradient accuracy level in synthetic example. Computed gradients are
    per Def 3. (b) Computation and memory trade-offs with 4 layer CN
    implemented iMAML in PyTorch, and for an apples-to-apples compa
    of MAML from: https://github.com/dragen1860/MAML-Pytor
    7
    • ύϥϝʔλͷઢܗͰදݱ͞ΕΔؔ਺ͷճؼ໰୊Ͱϝλޯ഑ͷਫ਼౓Λൺֱ
    • ਅͷ࠷దղ Alg⋆
    i
    ͕ݟ͔ͭͬͨࡍͷϝλޯ഑ dθ
    Li
    (Alg⋆
    i
    (θ)) ͱ inner-loop
    ͷ֤ GD ճ਺ͰಘΒΕΔ ϕi
    Ͱͷϝλޯ഑ͱͷࠩΛൺֱ
    • iMAML ͕ MAML ΑΓ΋ྑ͍
    50 / 90

    View Slide

  57. ࣮ݧ
    pirical performance of iMAML on the Omniglot and Mini-ImageNet domains.
    hot learning protocol in prior work [57], we run the iMAML algorithm on the
    (b)
    Computation, and Memory tradeoffs of iMAML, MAML, and FOMAML. (a) Meta-
    in synthetic example. Computed gradients are compared against the exact meta-gradient
    tation and memory trade-offs with 4 layer CNN on 20-way-5-shot Omniglot task. We
    in PyTorch, and for an apples-to-apples comparison, we use a PyTorch implementation
    s://github.com/dragen1860/MAML-Pytorch
    7
    • Omniglot ͷ 20-way, 5-shot ͷઃఆͰ GPU ϝϞϦޮ཰ͱܭࢉ࣌ؒͷൺֱ
    • GPU ϝϞϦ͸ FOMAML ͱಉ͘͡ Inner-loop ͷ GD ճ਺ʹґଘ͠ͳ͍
    • ܭࢉ࣌ؒ͸ MAML ΑΓ΋؇΍͔
    • ڞ໾ޯ഑๏ͷܭࢉ͕͋ΔͷͰ FOMAML ʹ͸উͯͳ͍
    51 / 90

    View Slide

  58. ࣮ݧ
    Table 2: Omniglot results. MAML results are taken from the original work of Finn et al. [15], and first-order
    MAML and Reptile results are from Nichol et al. [43]. iMAML with gradient descent (GD) uses 16 and 25 steps
    for 5-way and 20-way tasks respectively. iMAML with Hessian-free uses 5 CG steps to compute the search
    direction and performs line-search to pick step size. Both versions of iMAML use = 2.0 for regularization,
    and 5 CG steps to compute the task meta-gradient.
    Algorithm 5-way 1-shot 5-way 5-shot 20-way 1-shot 20-way 5-shot
    MAML [15] 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2%
    first-order MAML [15] 98.3 ± 0.5% 99.2 ± 0.2% 89.4 ± 0.5% 97.9 ± 0.1%
    Reptile [43] 97.68 ± 0.04% 99.48 ± 0.06% 89.43 ± 0.14% 97.12 ± 0.32%
    iMAML, GD (ours) 99.16 ± 0.35% 99.67 ± 0.12% 94.46 ± 0.42% 98.69 ± 0.1%
    iMAML, Hessian-Free (ours) 99.50 ± 0.26% 99.74 ± 0.11% 96.18 ± 0.36% 99.14 ± 0.1%
    dataset for different numbers of class labels and shots (in the N-way, K-shot setting), and compare
    two variants of iMAML with published results of the most closely related algorithms: MAML,
    FOMAML, and Reptile. While these methods are not state-of-the-art on this benchmark, they pro-
    vide an apples-to-apples comparison for studying the use of implicit gradients in optimization-based
    meta-learning. For a fair comparison, we use the identical convolutional architecture as these prior
    works. Note however that architecture tuning can lead to better results for all algorithms [27].
    The first variant of iMAML we consider involves solving the inner level problem (the regularized
    objective function in Eq. 4) using gradient descent. The meta-gradient is computed using conjugate
    gradient, and the meta-parameters are updated using Adam. This presents the most straightforward
    comparison with MAML, which would follow a similar procedure, but backpropagate through the
    path of optimization as opposed to invoking implicit differentiation. The second variant of iMAML
    uses a second order method for the inner level problem. In particular, we consider the Hessian-free
    or Newton-CG [44, 36] method. This method makes a local quadratic approximation to the objective
    function (in our case, G( 0
    , ✓) and approximately computes the Newton search direction using CG.
    Since CG requires only Hessian-vector products, this way of approximating the Newton search di-
    rection is scalable to large deep neural networks. The step size can be computed using regularization,
    damping, trust-region, or linesearch. We use a linesearch on the training loss in our experiments to
    also illustrate how our method can handle non-differentiable inner optimization loops. We refer the
    readers to Nocedal & Wright [44] and Martens [36] for a more detailed exposition of this optimiza-
    tion algorithm. Similar approaches have also gained prominence in reinforcement learning [52, 47].
    e first variant of iMAML we consider involves solving the inner level problem (the regularized
    ective function in Eq. 4) using gradient descent. The meta-gradient is computed using conjugate
    dient, and the meta-parameters are updated using Adam. This presents the most straightforward
    mparison with MAML, which would follow a similar procedure, but backpropagate through the
    h of optimization as opposed to invoking implicit differentiation. The second variant of iMAML
    s a second order method for the inner level problem. In particular, we consider the Hessian-free
    Newton-CG [44, 36] method. This method makes a local quadratic approximation to the objective
    ction (in our case, G( 0
    , ✓) and approximately computes the Newton search direction using CG.
    ce CG requires only Hessian-vector products, this way of approximating the Newton search di-
    ion is scalable to large deep neural networks. The step size can be computed using regularization,
    mping, trust-region, or linesearch. We use a linesearch on the training loss in our experiments to
    o illustrate how our method can handle non-differentiable inner optimization loops. We refer the
    ders to Nocedal & Wright [44] and Martens [36] for a more detailed exposition of this optimiza-
    n algorithm. Similar approaches have also gained prominence in reinforcement learning [52, 47].
    Table 3: Mini-ImageNet 5-way-1-shot accuracy
    Algorithm 5-way 1-shot
    MAML 48.70 ± 1.84 %
    first-order MAML 48.07 ± 1.75 %
    Reptile 49.97 ± 0.32 %
    iMAML GD (ours) 48.96 ± 1.84 %
    iMAML HF (ours) 49.30 ± 1.88 %
    les 2 and 3 present the results on Omniglot
    Mini-ImageNet, respectively. On the Om-
    lot domain, we find that the GD version of
    AML is competitive with the full MAML algo-
    m, and substatially better than its approxima-
    ns (i.e., first-order MAML and Reptile), espe-
    ly for the harder 20-way tasks. We also find that
    AML with Hessian-free optimization performs
    stantially better than the other methods, suggest-
    that powerful optimizers in the inner loop can of-
    benifits to meta-learning. In the Mini-ImageNet
    main, we find that iMAML performs better than MAML and FOMAML. We used = 0.5 and 10
    dient steps in the inner loop. We did not perform an extensive hyperparameter sweep, and expect
    the results can improve with better hyperparameters. 5 CG steps were used to compute the
    a-gradient. The Hessian-free version also uses 5 CG steps for the search direction. Additional
    erimental details are Appendix F.
    Related Work
    r work considers the general meta-learning problem [51, 55, 41], including few-shot learning [30,
    . Meta-learning approaches can generally be categorized into metric-learning approaches that
    n an embedding space where non-parametric nearest neighbors works well [29, 57, 54, 45, 3],
    ck-box approaches that train a recurrent or recursive neural network to take datapoints as input
    • Omniglot(্ஈ), Mini-ImageNet(Լஈ) Ͱͷਫ਼౓ͷൺֱ
    • Inner ͷ࠷దԽʹ Hessian-free Λ࢖༻ͨ͠ iMAML ͕ڧ͍
    • ಉۙ͡ࣅख๏Ͱ͋Δ FOMAML ΍ Reptile ͸೉͍͠λεΫ (20-way
    1-shot) Ͱେ͖͘ਫ਼౓͕Լ͕Δ
    52 / 90

    View Slide

  59. Next Section
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML

    View Slide

  60. ͜ͷ࿦จ͕ओு͢Δϝλֶश (ओʹ MAML) ͷ໰୊఺
    • few examples ͳλεΫ͸ᐆດੑ͕େ͖͍
    • աֶश͢Δ৔߹΋͋Δ
    ⇒ ϩόετͳख๏͕ඞཁͱओு
    • ϕΠζख๏Λ༻͍Δ͜ͱͰ͜ΕΒͷ໰୊ʹରԠ (BMAML)
    • Stein Variational Gradient Descent (SVGD)
    • Chaser loss
    53 / 90

    View Slide

  61. Stein Variational Gradient Descent (SVGD)
    • BMAML ͷ Inner-loop Ͱ͸ SVGD Λ༻ֶ͍ͯश
    • ม෼ۙࣅʹൺ΂ SVGD ͸ਅͷࣄޙ෼෍ʹରͯ͠ύϥϝτϦοΫͳ֬཰෼
    ෍΍Ҽࢠ෼ղΛԾఆ͢Δඞཁ͕ͳ͍
    • SVGD Ͱ͸ particles ͱݺ͹ΕΔύϥϝʔλͷू߹ Θ = {θm}M
    m=1
    ʹର
    ͯ͠ t εςοϓ࣌ͷ֤ύϥϝʔλ θt ∈ Θt
    ΛҎԼͷࣜͰߋ৽
    θt+1 ← θt + ϵtϕ (θt)
    where ϕ (θt) =
    1
    M
    M

    j=1
    [
    k
    (
    θj
    t
    , θt
    )

    θj
    t
    log p
    (
    θj
    t
    )
    + ∇
    θj
    t
    k
    (
    θj
    t
    , θt
    )]
    • ϵt
    ͸εςοϓαΠζ
    • k(x, x′) ͸ਖ਼ఆ஋Χʔωϧ
    54 / 90

    View Slide

  62. BMAML
    • MAML ͸ҎԼͷ֊૚ϕΠζͱͯ͠ߟ͑Δ͜ͱ͕Ͱ͖ͨ
    p
    (
    Dtest
    T
    |θ0, Dtr
    T
    )
    =

    τ∈T
    (∫
    p
    (
    Dtest
    τ
    |ϕτ
    )
    p
    (
    ϕτ |Dtr
    τ
    , θ0
    )
    dϕτ
    )
    • ͜ΕΛ SVGD ͕࢖͑Δܗʹ֦ு (ॳظ஋ θ0
    ͕ M ݸͷ θm
    0
    Λ࣋ͬͨ Θ0
    ΁)
    p
    (
    Dtest
    T
    |Θ0, Dtr
    T
    )


    τ∈T
    (
    1
    M
    M

    m=1
    p
    (
    Dtest
    τ
    |ϕm
    τ
    )
    )
    where ϕm
    τ
    ∼ p
    (
    ϕτ |Dtr
    τ
    , Θ0
    )
    • ϕm
    τ
    ͸ SVGD Ͱֶश͞ΕΔ
    55 / 90

    View Slide

  63. BMAML
    • ΑͬͯҎԼͷࣜͰ Θ0
    ΛֶशՄೳ
    Θ0 ← Θ0 − β∇Θ0

    τ∈T
    log
    [
    1
    M
    M

    m=1
    p(Dtest
    τ
    |ϕm
    τ
    )
    ]
    • ͔͜͠͠ͷֶश๏͸ෆ҆ఆ + աֶश͠΍͍͢
    • Inner ͸ϕΠζख๏Ͱ΋ Outer ͕ϕΠζख๏Ͱͳ͍ͷ͸ɾɾɾ
    • Outer ʹ΋ᐆດੑΛอ࣋Ͱ͖Δख๏Λ࠾༻
    56 / 90

    View Slide

  64. BMAML
    • Inner-loop ͰಘΒΕΔۙࣅͷλεΫࣄޙ෼෍ : pn
    τ
    ≡ pn(ϕτ |Dtr
    τ
    ; Θ0)
    • n ͸ Inner-loop ͷεςοϓ਺
    • ਅͷλεΫࣄޙ෼෍ : p∞
    τ
    ≡ pn(ϕτ |Dtr
    τ
    ∪ Dtest
    τ
    )
    • pn
    τ
    ͕ p∞
    τ
    ʹۙ͘ͳΔ Θ0
    ͕ཉ͍͠ ⇒ ҎԼͷ໰୊Λղ͘
    arg min
    Θ0

    τ
    dp(pn
    τ
    ∥p∞
    τ
    ) ≈ arg min
    Θ0

    τ
    ds(Θn
    τ
    (Θ0)∥Θ∞
    τ
    )
    • Θn
    τ
    , Θ∞
    τ
    ͸ͦΕͧΕ pn
    τ
    , p∞
    τ
    ͔ΒαϯϓϦϯά͞Εͨύϥϝʔλ
    • (̎ͭͱ΋ Θ ͷه߸Λ࿦จʹ߹Θͤ࢖ͬͯ͸͍Δ͕ Inner-loop ͷֶशͰύ
    ϥϝʔλͳͷͰத਎͸ ϕm)
    • dp
    (p∥q) ͸ 2 ͭͷ෼෍ؒͷ dissimilarity
    • ds
    (s1
    ∥s2
    ) ͸ 2 ͭͷू߹ؒͷڑ཭
    57 / 90

    View Slide

  65. BMAML
    • ໰୊఺ͱͯ͠ p∞
    τ
    ΍ Θ∞
    τ
    ͸෼͔Βͳ͍
    • ͦ͜Ͱ Θ∞
    τ
    Λ Θn+s
    τ
    Ͱ୅༻
    • Θn+s
    τ
    ͸ҎԼͰಘΒΕΔ
    1. ॳظ஋ Θ0
    Λ Dtr
    τ
    Λ༻͍ͯ n εςοϓ SVGD Ͱֶश͠ Θn
    τ
    ΛಘΔ
    2. ֶशσʔλʹ Dtest
    τ
    ΛՃ͑ Θn
    τ
    Λ s εςοϓ SVGD Ͱֶश
    58 / 90

    View Slide

  66. BMAML
    • Ҏ্ΑΓ Outer ͷ loss ͸ҎԼͷࣜ
    LBMAML(Θ0) =

    τ∈Tt
    ds(Θn
    τ
    ∥Θn+s
    τ
    ) =

    τ∈Tt
    M

    m=1
    ∥θn,m
    τ
    − θn+s,m
    τ
    ∥2
    2
    • ࿦จͰ͸ n = s = 1 ͷΑ͏ͳখ͍͞஋Ͱྑ͍ಇ͖Λͨ͠ͱॻ͔Ε͍ͯΔ
    • ͜ͷख๏Λେ͖͍ϞσϧͰ࢖༻͢Δʹ͸อ࣋͢Δύϥϝʔλ਺͕େ͖͘
    ͳͬͯ͠·͏໰୊
    ⇒ େ͖͍ϞσϧΛ࢖༻͢Δࡍ͸ feature extractor ෦෼ͷύϥϝʔλΛશ
    particles Ͱڞ༗͠ classifier ͸ M ݸͷ particles ͱ͢Δ͜ͱͰରԠ
    59 / 90

    View Slide

  67. ࣮ݧ
    (a) (b) (c)
    Figure 2: Experimental results in miniImagenet dataset: (a) few-shot image classification using differen
    of particles, (b) using different number of tasks for meta-training, and (c) active learning setting.
    particles is slightly lower than having 5 particles2. Because a similar instability is also o
    in the SVPG paper (Liu & Wang, 2016), we presume that one possible reason is the instab
    SVGD such as sensitivity to kernel function parameters. To increase the inherent uncertainty
    in Fig. 2 (b), we reduced the number of training tasks |T | from 800K to 10K. We see that B
    provides robust predictions even for such a small number of training tasks while EMAML
    easily.
    Active Learning: In addition to the ensembled prediction accuracy, we can also evaluate
    fectiveness of the measured uncertainty by applying it to active learning. To demonstrate,
    • Mini-ImageNet Ͱ࣮ݧ
    • (a) : Particle ਺Λม͑ͯਫ਼౓ൺֱ
    • EMAML ͸ಠཱͳෳ਺ͷ MAML ͷΞϯαϯϒϧख๏
    • (b) : meta-training ࣌ͷλεΫ਺ͱΠςϨʔγϣϯͷؔ܎
    60 / 90

    View Slide

  68. Next Section
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML

    View Slide

  69. ͜ͷ࿦จ͕ओு͢Δϝλֶश (ओʹ MAML) ͷ໰୊఺
    • few-shot learning ͷΑ͏ͳ໰୊ઃఆ͸ᐆດੑ͕େ͖͍
    • ՄೳͳݶΓ࠷ߴͷॳظ஋͕ meta train ͰಘΒΕ͍ͯͨͱͯ͠΋৽ͨͳλ
    εΫΛղͨ͘Ίͷे෼ͳ৘ใ͕͋Δͱ͸ݶΒͳ͍
    ⇒ ෳ਺ͷղΛఏҊͰ͖Δख๏͕๬·͍͠
    ⇒ εέʔϥϏϦςΟͱෆ࣮֬ੑΛߟྀͨ͠ख๏ͷఏҊ (PLATIPUS)
    • amortized variational inference
    • ڧԽֶशͰͷ࣮ݧ͸ͳ͍
    61 / 90

    View Slide

  70. MAML ෮श
    • MAML ͸֊૚ϕΠζͷ࿮૊ΈͰߟ͑Δ͜ͱ͕Ͱ͖ͨ
    p(ytest
    i
    |xtr
    i
    , ytr
    i
    , xtest
    i
    ) =

    p(ytest
    i
    |xtest
    i
    , ϕi)p(ϕi|xtr
    i
    , ytr
    i
    , θ)dϕi
    ≈ p(ytest
    i
    |xtest
    i
    , ϕ⋆
    i
    )
    • ϕ⋆
    i
    ͸ MAP ਪఆ஋
    • MAML Ͱ͸ Inner Ͱ GD Λߦ͍ਪఆ
    62 / 90

    View Slide

  71. PLATIPUS
    • MAML Ͱ͸ॳظ஋ θ ͸ܾఆతͰ͋ͬͨ
    • ఏҊख๏Ͱ͸ॳظ஋ͷ෼෍ p(θ) Λߟ͑ VAE ͳͲͰ࢖༻͞Ε͍ͯΔ
    amortized variational inference Λ༻ֶ͍ͯश
    • VAE ͸େ·͔ʹҎԼͷྲྀΕͰ͋ͬͨ
    • જࡏม਺ z Λೖྗ x ͔ΒωοτϫʔΫ qψ
    Λ༻͍ͯ Encode
    • αϯϓϧ͞Εͨ z Λ༻͍ͯ x Λ Decode
    • ෮ݩޡࠩΛখͭͭۙ͘͞͠ࣅ෼෍ qψ
    (z|x) ͱ z ͷࣄલ෼෍ p(z) Λ͚ۙͮ
    ֶͯश
    • ͜ͷྲྀΕΛ MAML ʹద༻͢Δ͜ͱΛߟ͑Δ
    63 / 90

    View Slide

  72. PLATIPUS
    • ॳظ஋ θ Λજࡏม਺ͱߟ͑ࣄલ෼෍ p(θ) Λઃఆ
    • p(θ) ͸ฏۉ µθ
    , ର֯ͷڞ෼ࢄߦྻ σ2
    θ
    ͷਖ਼ن෼෍ͱ͢Δ
    • µθ
    ͱ σ2
    θ
    ͸ֶशՄೳύϥϝʔλ
    • VAE ͱҟͳΓࣄલ෼෍ͷύϥϝʔλ΋ֶश͞ΕΔ
    • MAML Ͱ͸λεΫݻ༗ύϥϝʔλ ϕi
    ͸ MAP ਪఆͰܾΊΒΕΔ
    • ͜͜Ͱਅͷ MAP ਪఆ஋ ϕ⋆
    i
    ͕ಘΒΕΔͱ͢Δͱ θ ͱ xtr
    i
    , ytr
    i
    ͸ಠཱ
    ⇒ ۙࣅ෼෍͸ xtest
    i
    , ytest
    i
    ͕༩͑ΒΕͨݩͰͷ෼෍ qψ(θ|xtest
    i
    , ytest
    i
    ) Λߟ
    ͑Δ
    64 / 90

    View Slide

  73. PLATIPUS
    • p(θ) ͷۙࣅ෼෍ qψ(θ|xtest
    i
    , ytest
    i
    ) ΛҎԼͷΑ͏ʹఆٛ
    qψ(θ|xtest
    i
    , ytest
    i
    ) = N(µθ + γq∇ log p(ytest
    i
    |xtest
    i
    , µθ); vq)
    • qψ
    ͸ύϥϝʔλ ψ Λ࣋ͭωοτϫʔΫ
    • vq
    ͸ର֯ͷڞ෼ࢄߦྻͰ͋ΓֶशՄೳύϥϝʔλ
    • γq
    ͸ learning rate
    65 / 90

    View Slide

  74. PLATIPUS
    • ݱ࣮ʹ͸ਅͷ MAP ਪఆ஋͸ಘΒΕͳ͍ͨΊ θ ͱ xtr
    i
    , ytr
    i
    ͸ಠཱͰ͸
    ͳ͍
    • ͜ΕΒͷґଘؔ܎Λߟྀ͢ΔͨΊʹ ”ࣄલ෼෍” ΛҎԼͷΑ͏ʹमਖ਼
    pi(θi|xtr
    i
    , ytr
    i
    ) = N(µθ + γp∇ log p(ytr
    i
    |xtr
    i
    , µθ); σ2
    θ
    )
    • γp
    ͸ learning rate
    • ࣮ݧతʹ΋͜ͷΑ͏ͳิਖ਼Λߦͳͬͨํ͕ྑ͍݁Ռ͕ಘΒΕͨΒ͍͠
    66 / 90

    View Slide

  75. PLATIPUS
    • ֶश͸ҎԼͷۙࣅ໬౓ͷม෼ԼݶΛ࠷େԽ͢Δ
    • ෼ྨਫ਼౓Λେ͖ͭͭۙ͘͠ࣅ෼෍ͱࣄલ෼෍Λ͚ۙͮΔ
    log p(ytest
    i
    |xtest
    i
    , xtr
    i
    , ytr
    i
    ) ≥ Eθ∼qψ
    [
    log p(ytest
    i
    |xtest
    i
    , ϕ⋆
    i
    )
    ]
    +
    DKL(qψ(θ|xtest
    i
    , ytest
    i
    )∥p(θi|xtr
    i
    , ytr
    i
    ))
    • ҎԼ, ఏҊख๏ͷΞϧΰϦζϜ
    Algorithm 1 Meta-training, differences from MAML in red
    Require: p(T ): distribution over tasks
    1: initialize ⇥ := {µ✓
    , 2

    , vq, p, q
    }
    2: while not done do
    3: Sample batch of tasks Ti
    ⇠ p(T )
    4: for all Ti do
    5: Dtr, Dtest = Ti
    6: Evaluate rµ✓
    L(µ✓
    , Dtest)
    7: Sample ✓ ⇠ q = N(µ✓ q
    rµ✓
    L(µ✓
    , Dtest), vq)
    8: Evaluate r✓
    L(✓, Dtr)
    9: Compute adapted parameters with gradient descent:
    i = ✓ ↵r✓
    L(✓, Dtr)
    10: Let p(✓|Dtr) = N(µ✓ p
    rµ✓
    L(µ✓
    , Dtr), 2

    ))
    11: Compute r⇥
    P
    Ti
    L( i, Dtest)
    +DKL(q(✓|Dtest) || p(✓|Dtr))
    12: Update ⇥ using Adam
    Algorithm 2 Meta-testing
    Require: training data Dtr
    T
    for new task T
    Require: learned ⇥
    1: Sample ✓ from the prior p(✓|Dtr)
    2: Evaluate r✓
    L(✓, Dtr)
    3: Compute adapted parameters with gra-
    dient descent:
    i = ✓ ↵r✓
    L(✓, Dtr)
    67 / 90

    View Slide

  76. ෼ྨ࣮ݧ
    • ᐆດੑͷར఺Λࣔͨ͢Ίʹਫ਼౓ʹՃ͑ Coverage ͱݺ͹ΕΔ਺Λܭࢉ
    • ࢖༻σʔληοτ͸ celebA
    • ਖ਼ྫͱෛྫΛ෼ྨ͢Δ͕ meta-test ࣌ͷֶशσʔλͷਖ਼ྫʹ͸ 3 ͭͷਖ਼ղ
    ཁૉؚ͕·Ε͍ͯΔ
    • ྫ) 1 : ๧ࢠΛඃ͍ͬͯΔ 2: ޱΛ։͚͍ͯΔ 3: ए͍ 3 ͭΛશͯຬ͍ͨͯ͠
    Δը૾͕ਖ਼ྫ, ͲΕ΋ຬ͍ͨͯ͠ͳ͍ͷ͕ෛྫ
    • meta-test ࣌ͷςετσʔλ͸্ͷ 3 ͭͷ͏ͪ 2 ͭΛຬ͍ͨͯ͠Δը૾Λ
    ਖ਼ྫͱ͢Δ (ᐆດੑͷ͋ΔλεΫ)
    • 3 छྨͷ෼ྨςετ͕ଘࡏ (1 ͱ 2, 1 ͱ 3, 2 ͱ 3)
    • ֶशͨ͠ࣄલ෼෍͔Βॳظ஋ΛαϯϓϦϯάͯ͠ 3 छྨͷ෼ྨςετͷର
    ਺໬౓Λܭࢉ͠࠷େͱͳͬͨςετʹͦͷαϯϓϧΛׂΓ౰ͯΔ
    • ͭ·ΓͲͷςετʹ༗༻ͳαϯϓϦϯά͔Λܭࢉ
    • ͦΕΛԿճ͔܁Γฦ͠ 1 ͭҎ্ͷαϯϓϧׂ͕Γ౰ͯΒΕͨςετͷ਺ͷ
    ฏۉΛܭࢉ (͜Ε͕ Coverage)
    • ࠷খ͕ 1, ࠷େ͕ 3
    68 / 90

    View Slide

  77. ෼ྨ࣮ݧ
    Mouth Open
    Young
    Wearing Hat
    Mouth Open
    Young
    Wearing Hat
    Mouth Open
    Young
    Wearing Hat
    Mouth Open
    Young
    Wearing Hat
    Figure 6: Sampled classifiers for an ambiguous meta-test task. In the meta-test training set (a), PLATIPUS
    observes five positives that share three attributes, and five negatives. A classifier that uses any two attributes
    can correctly classify the training set. On the right (b), we show the three possible two-attribute tasks that the
    training set can correspond to, and illustrate the labels (positive indicated by purple border) predicted by the
    best sampled classifier for that task. We see that different samples can effectively capture the three possible
    explanations, with some samples paying attention to hats (2nd and 3rd column) and others not (1st column).
    Ambiguous celebA (5-shot)
    Accuracy Coverage (max=3) Average NLL
    MAML 89.00 ± 1.78% 1.00 ± 0.0 0.73 ± 0.06
    MAML + noise 84.3 ± 1.60 % 1.89 ± 0.04 0.68 ± 0.05
    PLATIPUS (ours) (KL weight = 0.05) 88.34 ± 1.06 % 1.59 ± 0.03 0.67± 0.05
    PLATIPUS (ours) (KL weight = 0.15) 87.8 ± 1.03 % 1.94 ± 0.04 0.56 ± 0.04
    Table 1: Our method covers almost twice as many modes compared to MAML, with comparable
    accuracy. MAML + noise is a method that adds noise to the gradient, but does not perform variational
    inference. This improves coverage, but results in lower accuracy average log likelihood. We bold
    results above the highest confidence interval lowerbound.
    gradient descent with injected noise. During meta-training, the model parameters are optimized with
    respect to a variational lower bound on the likelihood for the meta-training tasks, so as to enable
    this simple adaptation procedure to produce approximate samples from the model posterior when
    conditioned on a few-shot training set. This approach has a number of benefits. The adaptation
    procedure is exceedingly simple, and the method can be applied to any standard model architecture.
    The algorithm introduces a modest number of additional parameters: besides the initial model weights,
    we must learn a variance on each parameter for the inference network and prior, and the number of
    parameters scales only linearly with the number of model weights. Our experimental results show that
    our method can be used to effectively sample diverse solutions to both regression and classification
    tasks at meta-test time, including with task families that have multi-modal task distributions. We
    additionally showed how our approach can be applied in settings where uncertainty can directly guide
    data acquisition, leading to better few-shot active learning.
    Although our approach is simple and broadly applicable, it has potential limitations that could be
    • ਫ਼౓Ͱ͸ MAML ʹෛ͚͍ͯΔ͕ Coverage ͸ 2 ʹ͍ۙ஋Λऔ͍ͬͯͯᐆ
    ດੑΛߟྀͰ͖͍ͯΔ
    • Coverage ͕ 2 ʹ͍ۙ ⇒ ֶश࣌ͱগ͠ҟͳΔςετ͕དྷͯ΋༗༻ͳॳظ
    ஋ͷαϯϓϦϯά͕Ͱ͖ΔՄೳੑ͕͋Δ
    • ͨͩਫ਼౓Ͱ͸ෛ͚͍ͯΔͷͰ୯७ͳը૾෼ྨʹ͓͍ͯར఺͕͋Δ͔͸ෆ໌
    • MAML ͸ܾఆతʹॳظ஋͕ܾ·ΔͷͰ Coverage ͕ৗʹ 1
    69 / 90

    View Slide

  78. ճؼ࣮ݧ
    Figure 2: Samples from PLATIPUS trained for 5-shot regression, shown as colored dotted lines. The tasks
    consist of regressing to sinusoid and linear functions, shown in gray. MAML, shown in black, is a deterministic
    procedure and hence learns a single function, rather than reasoning about the distribution over potential functions.
    As seen on the bottom row, even though PLATIPUS is trained for 5-shot regression, it can effectively reason
    over its uncertainty when provided variable numbers of datapoints at test time (left vs. right).
    • ఏҊख๏Ͱ͸ෳ਺ͷαϯϓϧ͕͞Ε͍ͯΔ͜ͱ͕Θ͔Δ
    • MAML ͸ܾఆతʹ 1 ຊʹܾ·Δ
    • ࿦จͰ͸ Active learning ͷ࣮ݧ΋͞Ε͍ͯΔ
    70 / 90

    View Slide

  79. Next Section
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML

    View Slide

  80. ͜ͷ࿦จ͕ओு͢Δ MAML ͷ໰୊఺
    • MAML Ͱ͸།Ұͷॳظ஋Λݟ͚ͭΔ͜ͱ͕໨ඪ
    • ͨͩͦΕͰ͸֤λεΫͷ໨ඪύϥϝʔλ͕͓ޓ͍ʹ͍ۙඞཁ͕͋Δ
    • λεΫ෼෍͕όϥόϥͰԕ͘཭Ε͍ͯΔ৔߹͸ෳ਺ͷॳظ஋͕͋ͬͨํ
    ͕ྑ͍
    ⇒ λεΫຒΊࠐΈΛ࢖༻ͯ͠ෳ਺ͷσʔληοτ͔ΒͷλεΫʹରͯ͠ਝ
    ଎ʹదԠͰ͖Δख๏ͷఏҊ (MMAML)
    71 / 90

    View Slide

  81. MMAML
    Figure 1: Model overview. The modulation net-
    work produces a task embedding , which is used
    to generate parameters {⌧i
    } that modulates the
    task network. The task network adapts modulated
    parameters to fit to the target task.
    Algorithm 1 MMAML META-TRAINING PROCEDURE.
    1: Input: Task distribution P(T ), Hyper-parameters ↵ and
    2: Randomly initialize ✓ and !.
    3: while not DONE do
    4: Sample batches of tasks Tj
    ⇠ P(T )
    5: for all j do
    6: Infer = h({x, y}K; !h) with K samples from Dtrain
    Tj
    .
    7: Generate parameters ⌧ = {gi( ; !g) | i = 1, · · · , N}
    to modulate each block of the task network f.
    8: Evaluate r✓
    LTj
    (f(x; ✓, ⌧); Dtrain
    Tj
    ) w.r.t the K samples
    9: Compute adapted parameter with gradient descent:
    ✓0
    Tj
    = ✓ ↵r✓
    LTj
    f(x; ✓, ⌧); Dtrain
    Tj
    10: end for
    11: Update ✓ with r✓
    P
    Tj
    ⇠P (T )
    LTj
    f(x; ✓0, ⌧); Dval
    Tj
    12: Update !g
    with r!g
    P
    Tj
    ⇠P (T )
    LTj
    f(x; ✓0, ⌧); Dval
    Tj
    13: Update !h
    with r!h
    P
    Tj
    ⇠P (T )
    LTj
    f(x; ✓0, ⌧); Dval
    Tj
    14: end while
    not the task-specific parameters from modulation network) is further adapted to target task through
    gradient-based optimization. A conceptual illustration can be found in Figure 1.
    In the rest of this section, we introduce our modulation network and a variety of modulation operators
    in section 4.1. Then we describe our task network and the training details for MMAML in section 4.2.
    4.1 Modulation Network
    As mentioned above, modulation network is responsible for identifying the mode of a sampled task,
    and generate a set of parameters specific to the task. To achieve this, it first takes the given K data
    points and their labels {xk, yk
    }k=1,...,K
    as input to the task encoder f and produces an embedding
    vector that encodes the characteristics of a task:
    = h

    {(xk, yk) | k = 1, · · · , K}; !h

    (1)
    • MMAML ͷྲྀΕ͸
    1. Modulation Network ͰλεΫͷ embedding ϕΫτϧ v Λܭࢉ
    2. v Λ༻͍ͯ Task Network ͷύϥϝʔλΛม׵͠λεΫʹ͋ͬͨॳظ஋Λ
    ੜ੒
    72 / 90

    View Slide

  82. Modulation Network
    • λεΫͱͯ͠ K ݸͷσʔλͱϥϕϧ {xk, yk}k=1,...,K
    ͕༩͑ΒΕͨ࣌
    ʹύϥϝʔλ wh
    Λ࣋ͭωοτϫʔΫͰ embedding ϕΫτϧ v Λܭࢉ
    v = h ({(xk, yk)|k = 1, · · · , K} ; wh)
    • ͜ͷ v Λ༻͍ͯ Task Network ͷ֤ϒϩοΫ (֤৞ΈࠐΈ૚΍શ݁߹૚)
    ʹม׵ΛՃ͑Δ τi
    Λੜ੒
    τi = gi(v; wgi
    ), where i = 1, · · · N
    • N ͸ Task Network ͷ૯ϒϩοΫ਺
    73 / 90

    View Slide

  83. Modulation Network
    • τi
    ʹΑͬͯॳظ஋ͷ i ൪໨ͷϒϩοΫͷύϥϝʔλ θi
    ͸ҎԼͷΑ͏ʹ
    ม׵͞ΕΔ
    ϕi = θi ⊙ τi
    • ⊙ ʹ͸ Attention ϕʔεͷख๏ͱ feature-wise linear modulation (FiLM)
    ͷख๏͕͋Δ͕࿦จͰ͸ޙऀΛબ୒
    • FiLM ͷํ͕ྑ͍ਫ਼౓ͩͬͨ໛༷
    • ੜ੒͞Εͨ τi
    ͸ Inner-loop தͰ͸ fixed
    74 / 90

    View Slide

  84. FiLM
    • FiLM Ͱ͸ϕΫτϧ τ Λ τγ
    ͱ τβ
    ʹ෼ׂ͠ωοτϫʔΫͷϨΠϠ Fθ
    ʹ
    ରͯ͠ҎԼͷม׵Λࢪ͢
    Fϕ = Fθ ⊗ τγ + τβ
    • ͨͩ͠ ⊗ ͸ channel-wise multiplication
    • Attention ϕʔεΛҰൠԽͨ͠Α͏ͳม׵ʹͳ͍ͬͯΔ
    75 / 90

    View Slide

  85. ࣮ݧ : ϕʔεϥΠϯ
    • MAML
    • Multi-MAML
    • MAML ΛυϝΠϯͷ਺༻ҙ֤͠Ϟσϧ͸֤υϝΠϯͷλεΫͷΈͰֶश
    • ςετ࣌͸ͦͷλεΫͷυϝΠϯʹରԠ͢Δ MAML Λ࢖༻
    • Αͬͯ MAML ͷ্քͰ͋Γݱ࣮తʹ͸࢖༻Ͱ͖ͳ͍Ϟσϧ
    • MAML ͕ Multi-MAML ΑΓ͔ͳΓѱ͍৔߹͸ MAML ͕λεΫ෼෍͕ό
    ϥόϥͷঢ়ଶʹରԠͰ͖͍ͯͳ͍ͱ͍͏͜ͱ
    • LSTM Learner (ճؼͷΈ)
    • LSTM Λ༻͍ͯճؼ
    • ճؼͰ͸ Modulation Network ʹ LSTM Λ࢖༻͢ΔͨΊൺֱͱͯ͠
    LSTM ͷΈͷ৔߹Λ༻ҙ
    76 / 90

    View Slide

  86. ݁Ռ : ճؼ
    Table 1: Mean square error (MSE) on the multimodal 5-shot regression with 2, 3, and 5 modes. A Gaussian
    noise with µ = 0 and = 0.3 is applied. Multi-MAML uses ground-truth task modes to select the corresponding
    MAML model. Our method (with FiLM modulation) outperforms other methods by a margin.
    Method
    2 Modes 3 Modes 5 Modes
    Post Modulation Post Adaptation Post Modulation Post Adaptation Post Modulation Post Adaptation
    MAML [8] - 1.085 - 1.231 - 1.668
    Multi-MAML - 0.433 - 0.713 - 1.082
    LSTM Learner 0.362 - 0.548 - 0.898 -
    Ours: MMAML (Softmax) 1.548 0.361 2.213 0.444 2.421 0.939
    Ours: MMAML (FiLM) 2.421 0.336 1.923 0.444 2.166 0.868
    Table 2: Classification testing accuracies on the multimodal few-shot image classification with 2, 3, and 5
    modes. Multi-MAML uses ground-truth dataset labels to select corresponding MAML models. Our method
    outperforms MAML and achieve comparable results with Multi-MAML in all the scenarios.
    Method & Setup 2 Modes 3 Modes 5 Modes
    Way 5-way 20-way 5-way 20-way 5-way 20-way
    Shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot
    MAML [8] 66.80% 77.79% 44.69% 54.55% 67.97% 28.22% 44.09% 54.41% 28.85%
    Multi-MAML 66.85% 73.07% 53.15% 55.90% 62.20% 39.77% 45.46% 55.92% 33.78%
    MMAML (ours) 69.93% 78.73% 47.80% 57.47% 70.15% 36.27% 49.06% 60.83% 33.97%
    output value y, which further increases the difficulty of identifying which function generated the data.
    Please refer to the supplementary materials for details and parameters for regression experiments.
    Baselines and Our Approach. As mentioned before, we have MAML and Multi-MAML as two
    baseline methods, both with MLP task networks. Our method (MMAML) augments the task network
    • 5 छྨͷυϝΠϯ͔ΒճؼλεΫΛੜ੒
    • sin, Ұ࣍ؔ਺, ೋ࣍ؔ਺, Ұ࣍ؔ਺ͷઈର஋, tanh
    • 2 छྨ, 3 छྨ, 5 छྨΛ࢖༻ͨ͠৔߹ͰͦΕͧΕϕʔεϥΠϯͱൺֱ
    • Modulation Network ʹ͸ LSTM Λ࢖༻
    • σʔλΛ x Ͱιʔτͯ͠ॱʹೖྗ
    • FiLM Λ༻͍ͨ MMAML ͕΋ͬͱ΋ྑ͍݁Ռ
    77 / 90

    View Slide

  87. ݁Ռ : ը૾෼ྨ
    Ours: MMAML (FiLM) 2.421 0.336 1.923 0.444 2.166 0.868
    Table 2: Classification testing accuracies on the multimodal few-shot image classification with 2, 3, and 5
    modes. Multi-MAML uses ground-truth dataset labels to select corresponding MAML models. Our method
    outperforms MAML and achieve comparable results with Multi-MAML in all the scenarios.
    Method & Setup 2 Modes 3 Modes 5 Modes
    Way 5-way 20-way 5-way 20-way 5-way 20-way
    Shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot 1-shot 5-shot 1-shot
    MAML [8] 66.80% 77.79% 44.69% 54.55% 67.97% 28.22% 44.09% 54.41% 28.85%
    Multi-MAML 66.85% 73.07% 53.15% 55.90% 62.20% 39.77% 45.46% 55.92% 33.78%
    MMAML (ours) 69.93% 78.73% 47.80% 57.47% 70.15% 36.27% 49.06% 60.83% 33.97%
    output value y, which further increases the difficulty of identifying which function generated the data.
    Please refer to the supplementary materials for details and parameters for regression experiments.
    Baselines and Our Approach. As mentioned before, we have MAML and Multi-MAML as two
    baseline methods, both with MLP task networks. Our method (MMAML) augments the task network
    with a modulation network. We choose to use an LSTM to serve as the modulation network due to its
    nature as good at handling sequential inputs and generate predictive outputs. Data points (sorted by
    x value) are first input to this network to generate task-specific parameters that modulate the task
    network. The modulated task network is then further adapted using gradient-based optimization.
    Two variants of modulation operators – softmax and FiLM are explored to be used in our approach.
    Additionally, to study the effectiveness of the LSTM model, we evaluate another baseline (referred
    to as the LSTM Learner) that uses the LSTM as the modulation network (with FiLM) but does not
    perform gradient-based updates. Please refer to the supplementary materials for concrete specification
    of each model.
    Results. The quantitative results are shown in Table 1. We observe that MAML has the highest
    error in all settings and that incorporating task identity (Multi-MAML) can improve over MAML
    significantly. This suggests that MAML degenerates under multimodal task distributions. The LSTM
    • N-way K-shot ͷλεΫ
    • Omniglot, Mini-ImageNet, FC100, CUB, AIRCRAFT ͷ 5 ͭͷσʔλ
    ηοτ࢖༻
    • 2 छྨ, 3 छྨ, 5 छྨΛ࢖༻ͨ͠৔߹ͰͦΕͧΕϕʔεϥΠϯͱൺֱ
    • Modulation Network ʹ͸ CNN, τ ͷੜ੒ʹ͸ MLP
    • 5-way ͷ 1-shot, 5-shot Ͱ͸ڧ͍͕ 20-way Ͱ͸ۤઓ
    • ೉͍͠λεΫʹ͸ରԠ͖͠Ε͍ͯͳ͍ʁ
    78 / 90

    View Slide

  88. ݁Ռ : λεΫຒΊࠐΈͷՄࢹԽ
    (a) Regression (b) Image classification (c) RL Reacher (d)
    Figure 3: tSNE plots of the task embeddings produced by our model from randomly sa
    color indicates different modes of a task distribution. The plots (b) and (d) reveal a clear
    to different task modes, which demonstrates that MMAML is able to identify the task
    and produce a meaningful embedding . (a) Regression: the distance between modes alig
    of the similarity of functions (e.g. a quadratic function can sometimes be similar to a s
    function while a sinusoidal function is usually different from a linear function) (b) Few-shot
    each dataset (i.e. mode) forms its own cluster. (c-d) Reinforcement learning: The number
    different modes of the task distribution. The tasks from different modes are clearly clus
    embedding space.
    meta-dataset following the train/test splits used in the prior work, similar to [53]
    the datasets can be found in the supplementary material.
    We train models on the meta-datasets with two modes (OMNIGLOT and MINI-I
    modes (OMNIGLOT, MINI-IMAGENET, and FC100), and five modes (all the five d
    • ճؼ, ը૾෼ྨͷͦΕͧΕͰλεΫຒΊࠐΈͨ݁͠ՌΛ tSNE ͰՄࢹԽ
    • ճؼ
    • ࣅͨυϝΠϯ (ೋ࣍ؔ਺ͱҰ࣍ؔ਺ͷઈର஋) ͸͓ޓ͍ۙ͘ʹ͋Δ
    • ଞͱ͸ҟͳΔυϝΠϯ (sin ΍ tanh) ͸ͦΕ୯ମͰΫϥελʔΛ࡞ͬͯ
    ͍Δ
    • ෼ྨͰ͸֤υϝΠϯͰ͸͖ͬΓͱ෼͔Ε͍ͯΔ͜ͱ͕֬ೝͰ͖Δ
    79 / 90

    View Slide

  89. Next Section
    1 MAML ͱͦͷ೿ੜͷؔ܎ੑ
    2 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    3 RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL
    BAYES
    4 On First-Order Meta-Learning Algorithms
    5 Meta-Learning with Implicit Gradients
    6 Bayesian Model-Agnostic Meta-Learning
    7 Probabilistic Model-Agnostic Meta-Learning
    8 Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation
    9 HOW TO TRAIN YOUR MAML

    View Slide

  90. MAML ͷ໰୊఺
    • ͜ͷ࿦จͰ͸ MAML ͷ 6 ͭ໰୊఺Λఏࣔͦ͠ΕΒΛ 1 ͭ 1 ͭղܾͨ͠
    ख๏ΛఏҊ (MAML++)
    • Training Instability
    • Second Order Derivative Cost
    • Absence of Batch Normalization Statistic Accumulation
    • Shared (across step) Batch Normalization Bias
    • Shared Inner Loop (across step and across parameter) Learning Rate
    • Fixed Outer Loop Learning Rate
    • Ҏ߱͜ΕΒͷ໰୊఺ͱͦͷղܾ๏Λड़΂Δ
    80 / 90

    View Slide

  91. Training Instability
    • MAML ͷֶश͸ Inner-loop Ͱෳ਺ճ࠷దԽͨ͠ޙʹ Outer-loop Ͱޯ഑
    Λܭࢉ
    • MAML ͷ৞ΈࠐΈ૚͸ skip-connection ͕࢖ΘΕ͍ͯͳ͍
    ⇒ ޯ഑ͷੵ͕ଟ͍ߏ଄Ͱ͋ΓϞσϧ΍ϋΠύʔύϥϝʔλʹΑͬͯ͸ޯ഑
    ͷരൃ΍ফࣦ͕ى͜Γ΍ֶ͘͢श͕ෆ҆ఆ
    81 / 90

    View Slide

  92. Training Instability ˠ Multi-Step Loss Optimization (MSL)
    • ղܾ๏ͱͯ͠ Multi-Step Loss Optimization (MSL) ΛఏҊ
    • MAML Ͱ͸ S ճͷ Inner-loop Λճͨ͠ޙʹಘΒΕΔ ϕS Λ༻͍ͯॳظ
    ஋ θ ͷߋ৽Λ͍ͯͨ͠
    • B ͸λεΫͷ਺
    θ = θ − β∇θ
    B

    b
    Lb(ϕS
    b
    , Dtest
    b
    )
    • ͦΕΛ֤ Inner-loop ͰಘΒΕΔ ϕs Ͱܭࢉͨ͠ loss ͷॏΈ෇͖࿨Ͱ θ Λ
    ߋ৽͢ΔΑ͏ʹมߋ
    θ = θ − β∇θ
    B

    b
    S

    s
    vsLb(ϕs
    b
    , Dtest
    b
    )
    82 / 90

    View Slide

  93. Training Instability ˠ Multi-Step Loss Optimization (MSL)
    • vs
    ͸ॏΈͰ͋ΓΞχʔϦϯάͰௐઅ
    • ۩ମతʹ͸
    • ͸͡Ί͸શͯಉ͡ॏ͞
    • ঃʑʹ Inner-loop ͷޙ൒ʹॏ͖Λஔ͘
    • ࠷ޙ͸ΦϦδφϧͱಉ༷ʹ S εςοϓ໨͚ͩΛ࢖༻͢Δ
    • ͜ΕʹΑΓֶश͕҆ఆ (Լਤࢀর)
    Published as a conference paper at ICLR 2019
    Figure 1: Stabilizing MAML: This figure illustrates 3 seeds of the original strided MAML vs strided
    MAML++. One can see that 2 out of 3 seeds with the original strided MAML seem to become
    83 / 90

    View Slide

  94. Second Order Derivative Cost ˠ Derivative-Order Annealing (DA)
    • MAML ͷେ͖ͳ໰୊఺ͱͯ͠ϔγΞϯͷܭࢉ͕ඞཁ
    • FOMAML ͳͲͷҰ࣍ۙࣅख๏͸ੑೳ͕ѱԽ
    • ͦ͜Ͱ͸͡Ίͷ 50epoch ͸ FOMAML Λ࢖༻͠Ҏ߱͸ MAML ʹ
    • ͦ͏͢Δ͜ͱͰߴ଎ԽΛਤΓͭͭ͸͡Ίͷ FOMAML ͕ࣄલֶशͷΑ͏
    ͳಇ͖ͱͳΓֶश͕҆ఆ
    84 / 90

    View Slide

  95. Absence of Batch Normalization Statistic Accumulation
    • ΦϦδφϧͷ MAML (࣮૷) Ͱ͸֤όονͰͷ౷ܭྔΛ Batch
    Normalization ʹ࢖༻͍ͯ͠Δ
    • ΦϦδφϧͷ MAML ࣮૷Ͱ͸ Inner-loop Ͱ Batch Normalization ͷֶश
    ͸ߦΘΕͳ͍
    • ͜ΕͰ͸֤όονຖͷฏۉͱ෼ࢄʹରԠ͠ͳ͚Ε͹ͳΒͣޮ཰௿Լ
    ⇒ ͦ͜Ͱ֤εςοϓͰҠಈ౷ܭྔ (running statistics) Λ࢖༻
    • Per-Step Batch Normalization Running Statistics (BNRS)
    85 / 90

    View Slide

  96. Shared (across step) Batch Normalization Bias
    • Inner-loop Ͱ Batch Normalization ͷόΠΞε͸ֶश͞Εͣಉ͡஋Λ
    ࢖༻
    • ΦϦδφϧͷ MAML ࣮૷Ͱ͸ Inner-loop Ͱ Batch Normalization ͷֶश
    ͸ߦΘΕͳ͍
    • ࣮ࡍ͸ Inner-loop ͰϞσϧύϥϝʔλ͕ߋ৽͞ΕΔͱಛ௃ྔͷ෼෍΋
    มԽ
    ⇒ ֤εςοϓຖʹόΠΞεΛֶश
    • Per-Step Batch Normalization Weights and Biases (BNWB)
    86 / 90

    View Slide

  97. Shared Inner Loop (across step and across parameter) Learning Rate
    • MAML ͸ Inner-loop ͷֶश཰͕ݻఆ
    • ͜ΕͰ͸ద੾ͳֶश཰ΛܾΊΔͷʹίετ͕͔͔Δ
    • ֤ύϥϝʔλຖͷֶश཰΍ޯ഑ํ޲Λֶश͢Δͱྑ͍ੑೳ͕ग़Δ͜ͱ͕
    ใࠂ͞Ε͍ͯΔֶ͕श͢΂͖ύϥϝʔλ͕૿͑ͯ͠·͏
    • Li et al., 2017
    ⇒ ಉҰϨΠϠʔ಺Ͱಉֶ͡श཰΍ޯ഑ํ޲Λֶशͤ͞Δ͜ͱͰֶश͢΂͖
    ύϥϝʔλͷ૿ՃΛ཈͑Δ
    ⇒ ͞Βʹ Inner-loop ͷ֤εςοϓຖʹҟͳΔֶश཰΍ޯ഑ํ޲Λֶशͤ͞
    Δ͜ͱͰաֶशΛ཈͑ΔޮՌ͕ظ଴Ͱ͖Δ
    • Learning Per-Layer Per-Step Learning Rates and Gradient Directions
    (LSLR)
    87 / 90

    View Slide

  98. Fixed Outer Loop Learning Rate
    • MAML ͸ Outer-loop ΋ֶश཰͕ݻఆ
    • ֶश཰ͷΞχʔϦϯά͸Ϟσϧͷ൚Խੑೳʹد༩͢Δ͜ͱ͕஌ΒΕͯ
    ͍Δ
    ⇒ Outer-loop ͷֶश཰ʹ cosine ΞχʔϦϯάΛಋೖ
    • Cosine Annealing of Meta-Optimizer Learning Rate (CA)
    88 / 90

    View Slide

  99. ࣮ݧ
    Table 1: MAML++ Omniglot 20-way Few-Shot Results: Our reproduction of MAML appears to
    be replicating all the results except the 20-way 1-shot results. Other authors have come across
    this problem as well Jamal et al. (2018). We report our own base-lines to provide better relative
    intuition on how each method impacted the test accuracy of the model. We showcase how our
    proposed improvements individually improve on the MAML performance. Our method improves
    on the existing state of the art.
    Omniglot 20-way Few-Shot Classification
    Accuracy
    Approach 1-shot 5-shot
    Siamese Nets 88.2% 97.0%
    Matching Nets 93.8% 98.5%
    Neural Statistician 93.2% 98.1%
    Memory Mod. 95.0% 98.6%
    Meta-SGD 95.93+0.38% 98.97+0.19%
    Meta-Networks 97.00%
    MAML (original) 95.8+0.3% 98.9+0.2%
    MAML (local replication) 91.27+1.07% 98.78%
    MAML++ 97.65+0.05% 99.33+0.03%
    MAML + MSL 91.53+0.69% -
    MAML + LSLR 95.77+0.38% -
    MAML + BNWB + BNRS 95.35+0.23% -
    MAML + CA 93.03+0.44% -
    MAML + DA 92.3+0.55% -
    Table 2: MAML++ Mini-Imagenet Results. MAML++ indicates MAML + all the proposed fixes.
    Our reproduction of MAML appears to be replicating all the results of the original. Our approach
    sets a new state of the art across all tasks. It is also worth noting, that our approach, with only 1 inner
    loop step can already exceed all other methods. Additional steps allow for even better performance.
    Mini-Imagenet 5-way Few-Shot Classification
    Inner Steps Accuracy
    Mini-Imagenet 1-shot 5-shot
    Matching Nets - 43.56% 55.31%
    Meta-SGD 1 50.47+1.87% 64.03+0.94%
    Meta-Networks - 49.21% -
    MAML (original paper) 5 48.70+1.84% 63.11+0.92%
    • Omniglot ͷ 20-way ࣮ݧ
    • શͯͷϕʔεϥΠϯख๏ʹউར
    • ໰୊఺Λղܾ͢ΔఏҊख๏ΛશͯऔΓೖΕΔ͜ͱͰੑೳ্͕͕Δ͜ͱ͕
    ֬ೝͰ͖Δ
    89 / 90

    View Slide

  100. ࣮ݧ
    MAML + DA 92.3+0.55% -
    Table 2: MAML++ Mini-Imagenet Results. MAML++ indicates MAML + all the proposed fixes.
    Our reproduction of MAML appears to be replicating all the results of the original. Our approach
    sets a new state of the art across all tasks. It is also worth noting, that our approach, with only 1 inner
    loop step can already exceed all other methods. Additional steps allow for even better performance.
    Mini-Imagenet 5-way Few-Shot Classification
    Inner Steps Accuracy
    Mini-Imagenet 1-shot 5-shot
    Matching Nets - 43.56% 55.31%
    Meta-SGD 1 50.47+1.87% 64.03+0.94%
    Meta-Networks - 49.21% -
    MAML (original paper) 5 48.70+1.84% 63.11+0.92%
    MAML (local reproduction) 5 48.25+0.62% 64.39+0.31%
    MAML++ 1 51.05+0.31% -
    MAML++ 2 51.49+0.25% -
    MAML++ 3 51.11+0.11% -
    MAML++ 4 51.65+0.34% -
    MAML++ 5 52.15+0.26% 68.32+0.44%
    conclusions on the relative performance between our own MAML implementation and the proposed
    methodologies.
    Table 2 showcases MAML++on Mini-Imagenet tasks, where MAML++ sets a new state of the art
    in both the 5-way 1-shot and 5-shot cases where the method achieves 52.15% and 68.32% respec-
    tively. More notably, MAML++ can achieve very strong 1-shot results of 51.05% with only a single
    inner loop step required. Not only is MAML++ cheaper due to the usage of derivative order an-
    nealing, but also because of the much reduced inner loop steps. Another notable observation is
    that MAML++converges to its best generalization performance much faster (in terms of iterations
    required) when compared to MAML as shown in Figure 1.
    8
    • Mini-ImageNet ͷ 5-way ࣮ݧ
    • MAML++͸ Inner-loop ͕ 1 εςοϓͰ΋ϕʔεϥΠϯʹউར
    • Inner-loop Λ૿΍͢͜ͱͰΑΓྑ͍ύϑΥʔϚϯεʹͳΔ
    90 / 90

    View Slide