転移学習とドメイン適応の基礎

Slide 1

Slide 1 text

転移学習とドメイン適応の基礎第 17 回日本統計学会春季集会 2023 年 3 月 4 日 @ 東京都立大学松井孝太名古屋大学大学院医学系研究科生物統計学分野

Slide 41

Slide 41 text

事前学習済みモデルの転移可能性 ResN et-18 ResN et-34 ResN et-50 M obileN et0.25 M obileN et0.5 M obileN et0.75 M obileN et1.0 0.4 0.5 0.6 0.7 0.8 0.9 1 Full Target Data Scarce Target Data Pre-trained Models Test Accuracy (a) Test accuracy for pre-training Ima- geNet with different model architectures. 0 1 2 3 4 5 6 7 8 0.4 0.5 0.6 0.7 0.8 0.9 1 Full Target Data Scarce Target Data # of Layers NOT Being Transferred Test Accuracy (b) Test accuracy for transferring different layers of the pre-trained ResNet34 model. Transferring from ImageNet to CIFAR-100. For “Full Target Data”, all target data are used, while get Data”, only 50 target samples per class are used in training. learning to measure the transferability, and thereby require fine-tuning on a target task w parameter optimization. Though their follow-ups [7, 8] alleviate the need of fine-tuning, t 異なるアーキテクチャの事前学習モデルをImageNetからCIFAR-100へ転移したときのテスト精度 ResN et-18 ResN et-34 ResN et-50 M obileN et0.25 M obileN et0.5 M obileN et0.75 M obileN et1.0 0.4 0.5 0.6 0.7 0.8 0.9 1 Full Target Data Scarce Target Data Pre-trained Models Test Accuracy (a) Test accuracy for pre-training Ima- geNet with different model architectures. 0 1 2 3 4 5 6 7 8 0.4 0.5 0.6 0.7 0.8 0.9 1 Full Target Data Scarce Target Data # of Layers NOT Being Transferred Test Accuracy (b) Test accuracy for transferring different layers of the pre-trained ResNet34 model. 1: Transferring from ImageNet to CIFAR-100. For “Full Target Data”, all target data are used Target Data”, only 50 target samples per class are used in training. sfer learning to measure the transferability, and thereby require fine-tuning on a target ive parameter optimization. Though their follow-ups [7, 8] alleviate the need of fine-tun ResNet34の異なる層をImageNet からCIFAR-100へ転移したときのテスト精度 Full Target Data: ⽬標ドメインデータを全て使う Scarce Target Data: ⽬標ドメインデータは50点のみ全層転移最終層は転移しない Fig: [Huang+ 2022] Figure 1 ▶ 上位層では元ドメインのタスクに特化した特徴が得られやすい一方，下位層では汎用的な特徴が得られやすい松井 (名古屋大) 転移学習の基礎転移学習の方法 35 / 54

Slide 42

Slide 42 text

事前学習済みモデルの転移可能性を評価する例: Taskonomy [Zamir+ ’18] 26 種類の画像関連タスクの全ての組合せで網羅的に転移学習を実行，親和性の高いタスクのペアを見つける Denoising Autoencoding 2D Edges 2D Keypoints 2D Segm. Normals Object Class. (1000) Scene Class. Curvature Occlusion Edges Egomotion Cam. Pose (fix) 3D Keypoints Cam. Pose (nonfix) Matching Reshading Z-Depth Distance Layout 2.5D Segm. Semantic Segm. Vanishing Pts. Colo o o oriza a a ation In-paint t t ti ing g g g Jig g g gsaw w w w Rando o o om P P Proj o . 3D Keypoints Autoencoding Object Class. (1000) Denoising 2D Keypoints Matching Normals Z-Depth Distance 2.5D Segm. Scene Class. 2D Edges Egomotion Cam. Pose (fix) Curvature Occlusion Edges Reshading Semantic Segm. Cam. Pose (nonfix) Layout 2D Segm. Vanishing Pts. Color ri i i izat t t ti i i ion In-p p p pain n n nting Ji i i ig g g gsaw w w w Rand d d dom Proj o . Scene Class. Object Class. (1000) Denoising Autoencoding 2D Keypoints 2D Segm. In-painting 2D Edges Normals Curvature Occlusion Edges Egomotion Cam. Pose (fix) 3D Keypoints Cam. Pose (nonfix) Matching Reshading Z-Depth Distance Layout 2.5D Segm. Semantic Segm. Vanishing Pts. Colo o o oriza a a ation J g ig g g gsaw w w w Rando o o om Proj o . Scene Class. Object Class. (1000) Denoising Autoencoding 2D Segm. 2D Edges 2D Keypoints In-painting 3D Keypoints Matching Normals Z-Depth Distance 2.5D Segm. Egomotion Cam. Pose (fix) Curvature Occlusion Edges Reshading Semantic Segm. Cam. Pose (nonfix) Layout Vanishing Pts. Colo o o or r riza a a ation Jigsa a a aw Rando o o om m m m Proj o . Object Class. (1000) Curvature Scene Class. Semantic Segm. Normals 3D Keypoints Denoising Autoencoding 2D Segm. 2D Edges 2D Keypoints In-painting Colorization Matching 2.5D Segm. Z-Depth Distance Egomotion Cam. Pose (fix) Occlusion Edges Reshading Cam. Pose (nonfix) Layout Vanishing Pts. i i ig g g gsa a a aw R R R Ra a a a an n n ndom Proj o . A C Supervision Budget 2 Transfer Order 1 Transfer Order 2 Transfer Order 4 Supervision Budget 8 3D Keypoints Autoencoding Object Class. (1000) Denoising 2D Keypoints Matching Normals Z-Depth Distance 2.5D Segm. Scene Class 2D Edges Egomotion Cam. Pose (fix) Curvature Occlusion Edges Reshading Semantic Segm. Cam. Pose (nonfix) Layout 2D Segm. Vanishing Pts. Color ri i i izat t t ti i i ion In-p p p pain n n nting Rand d d dom Proj o . Ji i i ig g g gsaw w w w c Object Class. (1000) Curvature Scene Class. Semantic Segm. Normals 3D Keypoints Denoising Autoencoding 2D Segm. 2D Edges 2D Keypoints In-painting Colorization Matching 2.5D Segm. Z-Depth Distance Egomotion Cam. Pose (fix) Occlusion Edges Reshading Cam. Pose (nonfix) Layout Vanishing Pts. Jigs s s saw R R R Ra an n n ndom Proj o . 松井 (名古屋大) 転移学習の基礎転移学習の方法 36 / 54

Slide 43

Slide 43 text

事前学習済みモデルの転移可能性を評価する例: Taskonomy [Zamir+ ’18] タスク間のエッジ生成までの流れ 2nd 3rd Frozen 1st Order Order Order Task-specific 2D Segm. 3D Keypoints 2.5D Segm Normals Reshading Layout 2D Segm. 3D Keypoints 2.5D Segm Normals Reshading Layout (I) Task-specific Modeling (II) Transfer Modeling (III) Task Affinity Normalization (IV) Compute Taxonomy Output space Task Space (representation) Input space Object Class. (100) Object Class. (1000) Curvature Scene Class. 0) Semantic Segm. Normals 3D Keypoints Denoising Autoencoding 2D Segm. 2D Edges 2D Keypoints In-painting Colorization Matching 2.5D Segm. Z-Depth Distance s. Egomotion Cam. Pose (fix) Occlusion Edges Reshading Cam. Pose (nonfix) Layout ose Vanishing Pts. Jigsaw Jigs Jigs Jigs Jigsa Jigs Jigs Jigs Random m Proj. Binary Integer Program AHP task affinities . . . . . . Figure 2: Computational modeling of task relations and creating the taxonomy. From left to right: I. Train task-specific networks. II. Train (first order and higher) transfer functions among tasks in a latent space. III. Get normalized transfer affinities using AHP (Analytic Hierarchy Process). IV. Find global transfer taxonomy using BIP (Binary Integer Program). we want solved but cannot train (“target-only”), T ∩ S are the tasks that we want solved but could play as source too, and S − T ∩ S are the “source-only” tasks which we may not directly care about to solve (e.g. jigsaw puzzle) but can be optionally used if they increase the performance on T . The task taxonomy (taskonomy) is a computationally found directed hypergraph that captures the notion of task transferability over any given task dictionary. An edge between a group of source tasks and a target task represents a シンプルな教師あり学習全タスクの組合せで転移（mul6-sourceの場合も含む） cally compute the ground truth for many tasks without hu- man labeling. For the tasks that still require labels (e.g. scene classes), we generate them using Knowledge Distil- lation [41] from known methods [101, 55, 54, 75]. See the supplementary material for full details of the process and a user study on the final quality of labels generated using Knowledge Distillation (showing < 7% error). 3.1. Step I: Task-Specific Modeling We train a fully supervised task-specific network for each task in S. Task-specific networks have an encoder- decoder architecture homogeneous across all tasks, where the encoder is large enough to extract powerful representations, and the decoder is large enough to achieve a good performance but is much smaller than the encoder. 3.2. Step II: Transfer Modeling Given a source task s and a target task t, where s ∈ S and t ∈ T , a transfer network learns a small readout function for t given a statistic computed for s (see Fig 4). The statistic is the representation for image I from the encoder of s: Es (I). The readout function (Ds→t) is parameterized by θs→t minimizing the loss Lt: Ds→t := arg min θ EI∈D Lt Dθ Es (I) , ft (I) , (1) where ft (I) is ground truth of t for image I. Es (I) may or may not be sufficient for solving t depending on the relation between t and s (examples in Fig. 5). Thus, the performance of Ds→t is a useful metric as task affinity. We train transfer functions for all feasible source-target combinations. Accessibility: For a transfer to be successful, the latent representation of the source should both be inclusive of sufficient information for solving the target and have the information accessible, i.e. easily extractable (otherwise, the raw image or its compression based representations would contain complementary information for solving a target task (see examples in Fig 6). We include higher-order transfer which are the same as first order but receive multiple rep resentations in the input. Thus, our transfers are function D : ℘(S) → T , where ℘ is the powerset operator. As there is a combinatorial explosion in the number o feasible higher-order transfers (|T | × |S| k for kth order) we employ a sampling procedure with the goal of filtering out higher-order transfers that are less likely to yield good results, without training them. We use a beam search: fo transfers of order k ≤ 5 to a target, we select its 5 bes sources (according to 1st order performances) and include all of their order-k combination. For k ≥ 5, we use a beam of size 1 and compute the transfer from the top k sources. We also tested transitive transfers (s → t1 → t2) which showed they do not improve the results, and thus, were no include in our model (results in supplementary material). 3.3. Step III: Ordinal Normalization using Analytic Hierarchy Process (AHP) We want to have an affinity matrix of transferabilitie across tasks. Aggregating the raw losses/evaluations Ls→ from transfer functions into a matrix is obviously problem atic as they have vastly different scales and live in differen spaces (see Fig. 7-left). Hence, a proper normalization i needed. A naive solution would be to linearly rescale each row of the matrix to the range [0, 1]. This approach fail when the actual output quality increases at different speed w.r.t. the loss. As the loss-quality curve is generally un known, such approaches to normalization are ineffective. Instead, we use an ordinal approach in which the outpu quality and loss are only assumed to change monotonically For each t, we construct Wt a pairwise tournament matrix between all feasible sources for transferring to t. The ele ment at (i, j) is the percentage of images in a held-out tes set, D , on which s transfered to t better than s did (i.e Source Task Encoder Target Task Output (e.g., curvature) Frozen Representation Transfer Function 2nd order 3rd order ... (e.g., surface normal) I s E s→t D s E (I) Figure 4: Transfer Function. A small readout function is trained to map Layout Layout Reshade Input Ground Truth Task Specific Reshade Input Ground Truth Task Specific 2.5D Segmentation Surface Normal Estimation Figure 5: Transfer results to normals an vs の勝率から類似度⾏列を構成（各タスクが転移でどれだけ改善したかを⽐較） Image GT (Normals) Fully Supervised Image GT (Reshade) Fully Supervised { 3D Keypoints Surface Normals } 2nd order transfer + = { Occlusion Edges Curvature } 2nd order transfer + = Figure 6: Higher-Order Transfers. Representations can contain complementary information. E.g. by transferring simultaneously from 3D Edges and Curvature individual stairs were brought out. See our publicly available interactive transfer visualization page for more examples. wi,j = EI∈Dtest [Dsi →t (I) > Dsj →t (I)] EI∈Dtest [Dsi →t (I) < Dsj →t (I)] . (2) We quantify the final transferability of si to t as the cor- responding (ith) component of the principal eigenvector of Wt (normalized to sum to 1). The elements of the principal eigenvector are a measure of centrality, and are proportional to the amount of time that an infinite-length random walk on Wt will spend at any given source [59]. We stack the principal eigenvectors of Wt for all t ∈ T , to get an affinity matrix P (‘p’ for performance)—see Fig. 7, right. This approach is derived from Analytic Hierarchy Pro- cess [76], a method widely used in operations research to create a total order based on multiple pairwise comparisons. 3.4. Step IV: Computing the Global Taxonomy Autoencoding Scene Class Curvature Denoising 2D Edges Occlusion Edges 2D Keypoint 3D Keypoint Reshading Z-Depth Distance Normals Egomotion Vanishing Pts. 2D Segm. 2.5D Segm. Cam. Pose (fix) Cam. Pose (nonfix) Layout Matching Semantic Segm. Jigsaw In-Painting Colorization Random Proj. Task-Specific Object Class (100) Autoencoding Scene Class Object Class (100) Colorization Curvature Denoising Occlusion Edges 2D Edges Egomotion Cam. Pose (fix) In-Painting Jigsaw 2D Keypoint 3D Keypoint Cam. Pose (nonfix) Matching Random Proj. Reshading Z-Depth Distan N Autoencoding Object Class. (1000) Scene Class Curvature Occlusion Edges Egomotion Cam. Pose (fix) 2D Keypoint Layout Matching 2D Segm. Distance 2.5D Segm. Z-Depth Normals 3D Keypoint Denoising 2D Edges Cam. Pose (nonfix) Reshading Semantic Segm. Vanishing Pts. Object Class. (1000) Object Class. (1000) Figure 7: First-order task affinity matrix before (left) and Analytic Hierarchy Process (AHP) normalization. Lower m transfered. For visualization, we use standard affinity-distan dist = e−β·P (where β = 20 and e is element-wise matrix ex See supplementary material for the full matrix with higher-orde the relative importance of each target task and i sp the relative cost of acquiring labels for each task. The BIP is parameterized by a vector x where ea fer and each task is represented by a binary variabl cates which nodes are picked to be source and wh fers are selected. The canonical form for a BIP is: maximize cT x , subject to Ax b and x ∈ {0, 1}|E|+|V| . Each element ci for a transfer is the product o portance of its target task and its transfer performa ci := rtarget(i) · pi . Hence, the collective performance on all targets is mation of their individual AHP performance, pi, by the user specified importance, r . { 3D Keypoints Surface Normals } 2nd order transfer + = n Edges Curvature } 2nd order transfer + = e 6: Higher-Order Transfers. Representations can contain com- ntary information. E.g. by transferring simultaneously from 3D and Curvature individual stairs were brought out. See our publicly ble interactive transfer visualization page for more examples. wi,j = EI∈Dtest [Dsi →t (I) > Dsj →t (I)] EI∈Dtest [Dsi →t (I) < Dsj →t (I)] . (2) e quantify the final transferability of si to t as the cor- nding (ith) component of the principal eigenvector of normalized to sum to 1). The elements of the principal vector are a measure of centrality, and are proportional amount of time that an infinite-length random walk on will spend at any given source [59]. We stack the prin- eigenvectors of Wt for all t ∈ T , to get an affinity x P (‘p’ for performance)—see Fig. 7, right. his approach is derived from Analytic Hierarchy Pro- [76], a method widely used in operations research to e a total order based on multiple pairwise comparisons. Step IV: Computing the Global Taxonomy Autoencoding Scene Class Curvature Denoising 2D Edges Occlusion Edges 2D Keypoint 3D Keypoint Reshading Z-Depth Distance Normals Egomotion Vanishing Pts. 2D Segm. 2.5D Segm. Cam. Pose (fix) Cam. Pose (nonfix) Layout Matching Semantic Segm. Jigsaw In-Painting Colorization Random Proj. Task-Specific Object Class (100) Autoencoding Scene Class Object Class (100) Colorization Curvature Denoising Occlusion Edges 2D Edges Egomotion Cam. Pose (fix) In-Painting Jigsaw 2D Keypoint 3D Keypoint Cam. Pose (nonfix) Matching Random Proj. Reshading Z-Depth Distance Nor Curvature Occlusion Edges Egomotion Cam. Pose (fix) 2D Keypoint Layout Matching 2D Segm. Distance 2.5D Segm. Z-Depth Normals 3D Keypoint Denoising 2D Edges Cam. Pose (nonfix) Reshading Semantic Segm. Vanishing Pts. Object Class. (1000) Object Class. (1000) Figure 7: First-order task affinity matrix before (left) and a Analytic Hierarchy Process (AHP) normalization. Lower me transfered. For visualization, we use standard affinity-distan dist = e−β·P (where β = 20 and e is element-wise matrix ex See supplementary material for the full matrix with higher-orde the relative importance of each target task and i sp the relative cost of acquiring labels for each task. The BIP is parameterized by a vector x where ea fer and each task is represented by a binary variable cates which nodes are picked to be source and whi fers are selected. The canonical form for a BIP is: maximize cT x , subject to Ax b and x ∈ {0, 1}|E|+|V| . Each element ci for a transfer is the product of portance of its target task and its transfer performa ci := rtarget(i) · pi . Hence, the collective performance on all targets is mation of their individual AHP performance, pi, w by the user specified importance, ri. タスク重要度 AHPスコア and 3つの制約整数計画を解いてエッジを張る Fig: [Zamir+ ‘18] Figure 2 Fig: [Zamir+ ‘18] Figure 4 松井 (名古屋大) 転移学習の基礎転移学習の方法 37 / 54

Slide 44

Slide 44 text

事前学習済みモデルの転移可能性を評価する例: Task2Vec [Achille+ ’19] § 訓練済みNN: § タスクに対するNNの重みの重要性をKLで測る: 2次近似摂動させたパラメータ⼊⼒の経験分布ベクトル化（Task2Vec） • の対⾓成分のみ取り出す • 同⼀フィルタの全ての重みについて平均をとる Task Embeddings embedding across a large library of tasks (best seen magnified). (Left) T-SNE visualization of the embed- tracted from the iNaturalist, CUB-200, iMaterialist datasets. Colors indicate ground-truth grouping of tasks mic or semantic types. Notice that the bird classification tasks extracted from CUB-200 embed near the bird sk from iNaturalist, even though the original datasets are different. iMaterialist is well separated from iNat- ails very different tasks (clothing attributes). Notice that some tasks of similar type (such as color attributes) but attributes of different task types may also mix when the underlying visual semantics are correlated. For ks of jeans (clothing type), denim (material) and ripped (style) recognition are close in the task embedding. visualization of the domain embeddings (using mean feature activations) for the same tasks. Domain em- tinguish iNaturalist tasks from iMaterialist tasks due to differences in the two problem domains. However, bute tasks on iMaterialist all share the same domain and only differ in their labels. In this case, the domain apse to a region without recovering any sensible structure. neric model trained on ImageNet and ob- ground-truth optimal selection. We discuss original output distribution pw(y|x) and the perturbed one pw0 (y|x). To second-order approximation, this is Domain Embeddings Figure 1: Task embedding across a large library of tasks (best seen magnified). (Left) T-SNE visualization of t ding of tasks extracted from the iNaturalist, CUB-200, iMaterialist datasets. Colors indicate ground-truth groupi based on taxonomic or semantic types. Notice that the bird classification tasks extracted from CUB-200 embed ne classification task from iNaturalist, even though the original datasets are different. iMaterialist is well separated uralist, as it entails very different tasks (clothing attributes). Notice that some tasks of similar type (such as color cluster together but attributes of different task types may also mix when the underlying visual semantics are corr example, the tasks of jeans (clothing type), denim (material) and ripped (style) recognition are close in the task e (Right) T-SNE visualization of the domain embeddings (using mean feature activations) for the same tasks. D bedding can distinguish iNaturalist tasks from iMaterialist tasks due to differences in the two problem domains. the fashion attribute tasks on iMaterialist all share the same domain and only differ in their labels. In this case, t embeddings collapse to a region without recovering any sensible structure. fine-tuning a generic model trained on ImageNet and ob- taining close to ground-truth optimal selection. We discuss original output distribution pw(y|x) and the per pw0 (y|x). To second-order approximation, this is ← Task2Vecによる埋め込みをT-SNEで可視化（⾊分けはタスクの真の分類を表す）同じタスクをNNの活性化出⼒間の共分散で埋め込んだものをT-SNEで可視化（⾊分けはタスクの真の分類を表す） → Fig: [Achille+ ‘19] Figure 1 Fisher情報⾏列 • 同時分布に対して特定のパラメータが持つ情報量を表す • あるタスクの性能がパラメータに強く依存しない場合，対応するの要素は⼩さくなる松井 (名古屋大) 転移学習の基礎転移学習の方法 38 / 54

Slide 45

Slide 45 text

事前学習済みモデルの転移可能性を評価する例: TransRate [Huang+ ’22] • 問題設定 • 元ドメインから⽬標ドメインへの転移．Cクラスの分類問題 • 転移対象：事前学習モデル．元ドメインのデータへのアクセスは認めない（転移しない） • 事前学習モデル：（L層の特徴抽出器と1層の識別器） • 転移する層の数： i.e. pre-trained feature extractor を⽬標ドメインに転移 • ⽬標タスクのデータ：最適な⽬標ドメインモデル・・・⼊⼒事前学習モデルの1~K層・・・事前学習モデルと同じ構造の(K+1)~L層識別層出⼒ We consider the knowledge transfer from a source task Ts to a target task Tt of C-category classification. widely accepted, only the model that is pre-trained on the source task, instead of source data, is access The pre-trained model, denoted by F = fL+1 ... (f2 f1), consists of an L-layer feature extractor an 1-layer classifier fL+1 . Here, fl is the mapping function at the l-th layer. The target task is represented n labeled data samples {(xi, yi)}n i=1 . Afterwards, we denote the number of layers to be transferred by (K L). These K layers of the model are named as the pre-trained feature extractor g =fK ... (f2 The feature of xi extracted by g is denoted as zi =g(xi). Building on the feature extractor, we const the target model denoted by w to include 1) the same structure as the (K + 1)-th to (L)-th layers of source model and 2) a new classifier ft L+1 for the target task. We also refer w as the head of the target mo Following the standard practice of fine-tuning, both the feature extractor g and the head w will be trained the target task. We consider the optimal model for the target task as g ⇤ , w ⇤ = arg max ˜ g2G,w2W L(˜ g, w)= arg max ˜ g2G,w2W 1 n n X i=1 log p(yi|zi; ˜ g, w) subject to ˜ g(0) = g, where L denotes the log-likelihood, and G and W are the spaces of all possible feat extractors and heads, respectively. We define the transferability as the expected log-likelihood of the opti model w ⇤ g ⇤ on test samples in the target task: Definition 1 (Transferability). The transferability of a pre-trained feature extractor g from a source task to a target task Tt , denoted by TrfTs !Tt (g), is measured by the expected log-likelihood of the optimal m w ⇤ g ⇤ on a random test sample (x, y) of Tt : TrfTs !Tt (g) := [log p(y|z ⇤; g ⇤ , w ⇤)] where z ⇤ = g ⇤(x). This definition of transferability can be used for 1) selection of a pre-trained feature extractor am a model zoo {gm}M m=1 for a target task, where M pre-trained models could be in different architectu and trained on different source tasks in a supervised or unsupervised manner, and 2) selection of a laye transfer among all configurations {gl m}K l=1 given a pre-trained model gm and a target task. 構造対数尤度 Def 1 (Transferability) = 最適な⽬標ドメインモデルの⽬標ドメインのデータ分布に関する期待対数尤度元ドメインの事前学習済み特徴抽出器の⽬標ドメインへのTransferability § model zoo (各モデルは異なる構造で異なるタスクで事前学習されていても良い) からの特徴抽出器選択の指標に使える § 事前学習モデルのどの層を転移するか選ぶ指標に使える松井 (名古屋大) 転移学習の基礎転移学習の方法 39 / 54

Slide 47

Slide 47 text

事前学習と fine-tuning のスケーリング則 [Mikami+ 2022] 事前学習と fine-tuning の汎化誤差への影響がデータ数に応じてどう変化するかを表す scaling 則を導出 § n を事前学習のデータサイズ，s を fine-tuning のデータサイズとしたとき，経験的に以下が成り⽴つ § fine-tuning 時のデータサイズ s が固定のときの scaling 則 § NTKに基づく汎化誤差の理論的な bound から同様の scaling 則を導出（上の経験版とも整合する） 1 L(n, s) = 0. 1 L(n, s) = const. = Bs . gest the dependency of n is embedded in the coefficient B = g(n), i.e., the ng effects interact multiplicatively. To satisfy Requirement 2, a reasonable effect is g(n) = n ↵ + ; the error decays polynomially with respect to n combining these, we obtain L(n, s) = ( + n ↵)s , (2) rates for pre-training and fine-tuning, respectively, 0 is a constant, and DUCTION OF SCALING LAW -tuning error from a purely theoretical point of view. To incorporate the is given as an initialization, we need to analyze the test error during the ning algorithm such as SGD. We apply the recent development by Nitanda er learning. The study successfully analyzes the generalization of neural 汎化誤差事前学習による減衰率 fine-tuning による減衰率転移ギャップ 0 0 1 output noise for brevity; the task types are identical sharing the same input-output form ilarity is controlled by 1 . yze the situation where the effect of pre-training remains in the fine-tuning even for (s ! 1). More specifically, the theoretical analysis assumes a regularization term nce between the weights and the initial values, and a smaller learning rate than const uning. Hence we control how the pre-training effect is preserved through the regulari ning rate. Other assumptions made for theoretical analysis concern the model and lea m; a two-layer neural network having M hidden units with continuous nonlinear activ ed; for optimization, the averaged SGD (Polyak & Juditsky, 1992), an online algorit a technical reason. owing is an informal statement of the theoretical result. See Appendix E for detail ze that our result holds not only for syn2real transfer but also for transfer learning in g m 1 (Informal). Let ˆ fn,s(x) be a model of width M pre-trained by n sa . . . , (xn , yn) and fine-tuned by s samples (x0 1 , y0 1 ), . . . , (x0 s , y0 s ) where inputs x, x0 ⇠ with the input distribution p(x) and y = 0(x) and y0 = '(x0) = 0(x0) + 1(x0). ralization error of the squared loss L(n, s) = | ˆ fn,s(x) '(x)|2 is bounded from abov bability as Ex L(n, s)  A1(cM + A0 n ↵)s + "M . c can be arbitrary small for large M; A and A are constants; the exponents ↵ e object detection networks for autonomous driving developed by Tesla w images generated by simulation (Karpathy, 2021). e of syn2real transfer depends on the similarity between synthetic and re e similar they are, the stronger the effect of pre-training will be. On th ficant gap, increasing the number of synthetic data may be completely aste time and computational resources. A distinctive feature of syn2real process of generating data by ourselves. If a considerable gap exists, w e data with a different setting. But how do we know that? More specifi g setting without transfer, a “power law”-like relationship called a scaling ata size and generalization errors (Rosenfeld et al., 2019; Kaplan et al. for pre-training? find that the generalization error on fine-tuning is explained by a simple s test error ' Dn ↵ + C, D > 0 and pre-training rate ↵ > 0 describe the convergence speed of pr p C 0 determines the lower limit of the error. We can predict how should be to achieve the desired accuracy by estimating the parameters ults. Additionally, we analyze the dynamics of transfer learning using s based on the neural tangent kernel (Nitanda & Suzuki, 2021) and confir 松井 (名古屋大) 転移学習の基礎転移学習の方法 41 / 54

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text