Meta-learning from Tasks with Heterogeneous Attribute Spaces

Slide 1

Slide 1 text

Meta-learning from Tasks with Heterogeneous Attribute Spaces Tomoharu Iwata, Atsutoshi Kumagai, NeurIPS 2020 December 16, 2021

Slide 2

Slide 2 text

1 Preface 2 Propose Method 3 Training 4 Experiments & Results 2 / 34 Contents

Slide 3

Slide 3 text

So far, many meta-learning methods have been proposed. • Model-Agnostic Meta-learning (MAML) • Probabilistic Active Meta-learning (PAML) • Meta Reinforcement Learning etc ... However, any methods assume that all training and target tasks share the same attribute (feature) space, and they are inapplicable when attribute sizes are different across tasks. Tasks whose attribute sizes are different: • y = wx & y = w1 x1 + w2 x2 3 / 34 Preface

Slide 4

Slide 4 text

We propose a heterogeneous meta-learning method that train a model on tasks with various attribute spaces It enables to solve unseen tasks whose attribute spaces are different from the training tasks given few labeled instances 4 / 34 Preface

Slide 5

Slide 5 text

Training phase {Dd }D d=1 : Datasets in multiple tasks with heterogeneous attribute spaces • Dd = {(xdn , ydn )}Nd n=1 : The set of the pairs of observed attribute vectors xdn ∈ RId and response vectors ydn ∈ RJd of the nth instance in task d • Nd is the number of instances, Id is the number of attributes, and Jd is the number of responses • The numbers of instances, attributes and responses can be different across tasks Nd = Nd , Id = Id , and Jd = Jd 5 / 34 Propose Method: Dataset

Slide 6

Slide 6 text

Test phase Dd∗ = {(xd∗n , yd∗n )}Nd∗ n=1 : Dataset on a target task (Support set) • Nd∗ : The number of instances which is small • The target task isn’t contained in the given training tasks d∗ ∈ {1, . . . , D} We want to predict response yd∗n for observed attribute vector xd∗n (Query) in the target task 6 / 34 Propose Method: Dataset

Slide 7

Slide 7 text

Our method is composed of two networks: Inference Network Infer latent representations of each attribute and each response from a few labeled instances Prediction Network Responses of unlabeled instances are predicted with the inferred representations 7 / 34 Proposed Method: Model

Slide 8

Slide 8 text

S = {(xn , yn )}N n=1 : Support set in a task • xn = (xni )I i=1 : I-dimensional observed attribute vector • yn = (ynj )J j=1 : J-dimensional observed response vector Use categorical values using one-hot encoding 8 / 34 Proposed Method: Model

Slide 9

Slide 9 text

V = {vi }I i=1 : Latent attribute vectors • vi : The representation of the ith attribute C = {cj }J j=1 : Latent response vectors • cj : The representation of the jth response z: Latent instance vector 9 / 34 Proposed Method: Model

Slide 10

Slide 10 text

Input: Support set in a task S Output: Latent attribute vectors V & Latent response vectors C 10 / 34 Inference Network

Slide 11

Slide 11 text

First, we calculate initial attribute representation ¯ vi and initial response representation ¯ cj • f¯ v , g¯ v , f¯ c , g¯ c : Feed-forward neural networks Initial attribute / response representation ¯ vi / ¯ cj ¯ vi = g¯ v 1 N N n=1 f¯ v (xni ) , ¯ cj = g¯ c 1 N N n=1 f¯ c (ynj ) (1) 11 / 34 Inference Network

Slide 12

Slide 12 text

The initial attribute ¯ vi and response representations ¯ cj don’t contain information about the relationship with other attributes and responses ⇒ By concatenating the representations and their values, we incorporate information on each attribute and response with a permutation invariant neural network 12 / 34 Inference Network

Slide 13

Slide 13 text

Next, we calculate the representation for the nth instance un • fu , gu : Feed-forward neural networks The representation for the nth instance un un = gu 1 I I i=1 fu ([¯ vi , xni ]) + 1 J J j=1 fu ([¯ cj , ynj ]) (2) 13 / 34 Inference Network

Slide 14

Slide 14 text

Finally, we calculate attribute representation vi and response representation cj • fv , gv , fc , gc : Feed-forward neural networks Attribute / Response representation ¯ vi / ¯ cj vi = gv 1 N N n=1 fv ([un , xni ]) , cj = gc 1 N N n=1 fc ([un , ynj ]) (3) 14 / 34 Inference Network

Slide 15

Slide 15 text

Input: Query x, Latent attribute spaces V & Latent response spaces C Output: The predicted response ˆ y 15 / 34 Prediction Network

Slide 16

Slide 16 text

We obtain latent instance vector z given observed attribute vector x = (xi )I i=1 and latent attribute vectors V • fz , gz : Feed-forward neural networks Latent instance vector z z = gz 1 I I i=1 fz ([vi , xi ]) (4) 16 / 34 Prediction Network

Slide 17

Slide 17 text

We predict response ˆ y on query x with latent instance vector z and latent response vectors C • fy : Feed-forward neural networks • Φ : Parameters The predicted response ˆ y ˆ yj (x, S; Φ) = fy ([cj , z]) (5) The prediction depends on support set S and parameters Φ of the following all neural networks 17 / 34 Prediction Network

Slide 18

Slide 18 text

We estimate neural network parameters Φ by minimizing the loss • Generated support set S and query set Q given training datasets {Dd }D d=1 • NQ , JQ : The number of instances / responses in query set Q Updated parameters ˆ Φ ˆ Φ = arg min Φ EDd [E(S,Q)∼Dd [E(Q|S; Φ)]] (6) where E(Q|S; Φ) = 1 NQ JQ (x,y)∈Q JQ j=1 yj − ˆ yj (x, S; Φ) (7) 18 / 34 Training

Slide 19

Slide 19 text

Algorithm: 19 / 34 Training

Slide 20

Slide 20 text

We first evaluated the proposed method on simple synthetic regression tasks with one- or two-dimensional attribute spaces and one-dimensional response spaces • One third of tasks were generated from a one-dimensional linear model y = wd x • One third of tasks were generated from a one-dimensional sine curve y = sin(x + 3wd ) • The remaining of tasks were generated from the following two-dimensional model y = wd1 x1 + sin(x2 + 3wd2 ) y = sin(x + 3wd ) y = wd1 x1 + sin(x2 + 3wd2 ) y = wd x All Tasks 20 / 34 Experiments: Synthetic Data

Slide 21

Slide 21 text

Attributes & Parameters Attributes x, x1 , x2 : The uniform randomly generated from [-3, 3] Task-specific model parameters wd , wd1 , wd2 : The uniform randomly generated from [-1, 1] 21 / 34 Experiments: Synthetic Data

Slide 22

Slide 22 text

Tasks Training tasks : 10,000 Validation tasks : 30 Target tasks : 300 Instances The number of support instances NS : 5 The number of query instances NQ : 27 22 / 34 Experiments: Synthetic Data

Slide 23

Slide 23 text

Networks (f¯ v , f¯ c ), (g¯ v , g¯ c ), (fv , fc ), (gv , gc ) : Three-layered feed-forward neural networks with 32 hidden units for all neural networks fy : Three-layered feed-forward neural networks with 1 units for the output layer and 32 hidden units for other neural networks Activation function : ReLU(x) = max(0, x) 23 / 34 Experiments: Proposed Method Settings

Slide 24

Slide 24 text

Other Adam with learning rate 10−3 Dropout rate 0.1 Batch size B = 256 24 / 34 Experiments: Proposed Method Settings

Slide 25

Slide 25 text

The result by the proposed method with the 6 target tasks • Two-dimensional linear (a), (b) relationships • Two-dimensional nonlinear (c), (d) relationships • Three-dimensional relationship with a single model Red circles : Five target support instances Blue crosses : True target query instances Green crosses : The predicted target query instances with proposed method 25 / 34 Results

Slide 26

Slide 26 text

t-SNE visualization of latent attribute vectors vdi for target support sets in the synthetic datasets Red : x in y = wd x Green : x in y = sin(x + 3wd ) Blue and Magenta : x1 and x2 in y = wd1 x1 + sin(x2 + 3wd2 ) The latent attribute vectors with the same attribute property were closely located to each other 26 / 34 Results

Slide 27

Slide 27 text

Data OpenML: Open online platform for machine learning Instances : 10 ∼ 300 Attributes : 2 ∼ 30 Tasks Training tasks : 37 Validation tasks : 5 Target tasks : 17 —————————— Total tasks : 59 27 / 34 Experiments: OpenML

Slide 28

Slide 28 text

Instances The number of support instances : NS : 3 The number of query instances : NQ : 29 Other Batch size B = 37 Other settings are the same as the simple synthetic regression tasks 28 / 34 Experiments: OpenML

Slide 29

Slide 29 text

Comparing methods • DS (deep set) • FT (fine-tuning) • MAML (model-agnostic meta-learning) • NP (conditional neural process) • Ridge (linear regression with L2 regularization) • Lasso (linear regression with L1 regularization) • BR (Bayesian ridge regression) • KR (kernel ridge regression with a linear kernel) • GP (Gaussian process regression with an RBF kernel) • NN (neural net) 29 / 34 Results

Slide 30

Slide 30 text

The mean squared error averaged over 30 experiments with different training, validation, and target splits The proposed method achieved the lowest error compared with existing meta-learning and regression methods 30 / 34 Results

Slide 31

Slide 31 text

Left : The averaged mean squared errors when changing the number of instances in a support set at a test phase Right : The averaged mean squared errors with different number of training tasks 31 / 34 Results

Slide 32

Slide 32 text

Training computational time in hours The training time of the proposed method was shorter than MAML since the proposed method does not require iterative gradient descent steps for adapting to a support set In the test phase, the proposed method efficiently predicted responses without optimization by feeding the support and query sets into the trained neural networks 32 / 34 Results

Slide 33

Slide 33 text

We proposed a neural network-based meta-learning method that learns from multiple tasks with different attribute spaces, and predicts a response given a few instances in unseen tasks In the experiments with synthetic datasets and 59 datasets in OpenML, we demonstrate that our proposed method can predict the responses given a few labeled instances in new tasks after being trained with tasks with heterogeneous attribute spaces 33 / 34 Conclusion

Slide 34

Slide 34 text

1 Improve the efficiency of the training procedure 2 Investigate different types of neural networks with variable length inputs for inferring latent attribute and response vectors, such as attentions 3 Use prior knowledge about attributes, such as correspondence information across tasks and descriptions on attributes 34 / 34 Future Work