Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Trends on Distribution Shift in NeurIPS2022

Hiroki Naganuma
February 17, 2023

Trends on Distribution Shift in NeurIPS2022

本発表では、深層学習における分布シフトの問題に焦点を当て、そのトピックを紹介します。従来の学習フレームワークでは、訓練データとテストデータは i.i.d. であると仮定され、経験損失を最小化することで期待損失を最小化することが期待されます(Empirical Risk Minimization)。しかし、現実世界ではテストデータの分布が訓練データの分布と異なることが一般的であり、学習・推論時の環境変化は、教師あり学習の前提に反するものです。従って、分布シフトの問題に取り組むことが、深層学習の実応用に不可欠です。前半で個々の論文の詳細ではなく、分野全体の概観・トレンドを紹介し、後半ではいくつかの論文について簡潔に紹介します。

Hiroki Naganuma

February 17, 2023
Tweet

More Decks by Hiroki Naganuma

Other Decks in Research

Transcript

  1. 1 Hiroki Naganuma1,2 
 1Université de Montréal, 2Mila 
 [email protected]

    Out-of-Distribution Generalization, Calibration, Uncertainty and Optimization - Survey of these topics from the bird's-eye view - ʲRICOS×ZOZOʳNeurIPS2022࿦จಡΈձ 
 February 17th, 2023 Trends on Distribution Shift in NeurIPS2022 @Zoom / YouTube Live
  2. About Me Hiroki Naganuma / ௕প େथ Ph.D. student in

    Computer Science 
 @Université de Montréal, Mila - Quebec Arti fi cial Intelligence Institute ֎෦ݚڀһ 
 @ZOZO Research 
 Research Interest • ෼෍֎൚Խ (Out-of-Distribution Generalization) • ෆ࣮֬ੑͱΩϟϦϒϨʔγϣϯ (Calibration) • υϝΠϯ൚Խʹ͓͚Δ࠷ద༌ૹ (Optimal Transport) • Ϟσϧબ୒ࢦඪ (Information Criteria) • ࠷దԽΞϧΰϦζϜ • GANs ʹ͓͚Δ࠷దԽख๏ (Generative Adversarial Networks) • ࠷దԽͱؼೲతόΠΞε (Implicit Bias and Optimization) 2
  3. 3 Introduction and Motivation Problem Setting / Training Error and

    Generalization Error Notation Generalization Error Training Error - what we want to minimize - Inaccessible - what to minimize instead - assume that the training data follows the same distribution as the test data (i.i.d) ≈ prediction performance Introduction
  4. 4 Limitation of IID Assumption Introduction and Motivation Spurious Invariant

    Cow Camel Desert Grass Classi fi er based on ERM under IID Assumption Data from different distribution 
 is predicted as Cow Data from different distribution 
 is predicted as Camel IID Data (Majority) IID Data (Majority) Introduction
  5. 5 Introduction and Motivation IID Data The generalization performance to

    predict out-of-distribution data is called OOD-Generalization. Figure take from [Robert Geirhos et al 2019] Introduction Out-of-Distribution Generalization Example
  6. 7 Difference between Domain Generalization and Domain Adaptation Introduction and

    Motivation Introduction Domain Generalization Domain Adaptation
  7. 8 Introduction and Motivation Out-of-Distribution Generalization Introduction ROOD(f ) =

    max e∈ℰall Re(f ) Re(f ) := 𝔼 Xe,Ye∼ℙe [ℓ (f (Xe), Ye )] where Assumption unknown Problem If we have access to a prior distribution over potential test environments There are two problems in minimizing the following A) Calculation 1. Can the prior distribution be explicitly expressed? 2. It is doubtful that the posterior distribution can be integrated B) Even if the same performance is the same in the worst case, performance cannot be compared in the simplest case.
  8. 9 Introduction and Motivation What Out-of-Distribution Generalization aims for Introduction

    Assumption unknown Aim 1 Obtain strong guarantees where where is not tending to in fi nity but is actually rather small Aim 2 Obtain at least empirical performance
  9. 10 Introduction and Motivation Introduction Empirical Risk Minimization The ERM

    uses too strong of an iid assumption 
 •Interested in average case OOD performance •Test data is drawn from IID •No knowledge of the data Robust Optimization Equivalent to the weighted average of errors in each environment 
 Possibility that there is a better algorithm because Robust Optimization does not take into account the structure between distributions of Robustness at training time does not in general imply robustness at test time Adversarial Examples → → → Case like the cow and its background Hint for OOD Generalization(1/2)
  10. 11 Introduction and Motivation Introduction Domain Adaptation Assumption Hint for

    OOD Generalization(2/2) we can extract some variable that will be useful to predict across environments doesn’t change across environments Idea looking for features where doesn’t change across (Invariant prediction)
  11. 12 Introduction and Motivation Introduction How we fi nd Invariant

    prediction ? Strategy Invariant Risk Minimization A prediction with the following characteristics is useful for OOD •prediction does not change over •Error becomes small over
  12. 13 Introduction and Motivation Out-of-Distribution Generalization Introduction ROOD(f ) =

    max e∈ℰall Re(f ) Re(f ) := 𝔼 Xe,Ye∼ℙe [ℓ (f (Xe), Ye )] where Empirical Risk Minimization Robust Optimization Invariant Risk Minimization
  13. 14 Trend: NeurIPS 2022 Workshop on Distribution Shifts Scope •Examples

    of real-world distribution shifts in various application areas •Methods for improving robustness to distribution shifts •Empirical and theoretical characterization of distribution shifts •Benchmarks and evaluations 95 Accepted Papers
  14. 15 Trend: Distribution Shifts (Principle and Methods) How to generalize

    under distribnution shift (including both principle clari fi cation and proposed method) •Shortcut Learning in Deep Neural Networks: 
 ໰୊Λղͨ͘Ίʹ࢖ͬͯ͸͍͚ͳ͍ผͷ৘ใΛ࢖ͬͯ”ͣΔ”Λ͢ΔγϣʔτΧοτֶश͸ಈ෺Ͱ΋ΈΒΕɺ ݱࡏͷML/DLͰ΋޿͘ΈΒΕΔɻ͜ΕʹΑΓML͸ҧ͏ํ޲ʹ൚Խ͠ɺֶश෼෍֎ʢo.o.d)ʹ֎ૠͰ͖ͳ͍ •Towards a Theoretical Framework of Out-of-Distribution Generalization: 
 OODͷ൚ԽޡࠩڥքΛূ໌ •An Information-theoretic Approach to Distribution Shifts: 
 Ͳͷಛ௃બͿͱOOD൚Խ͢Δͷ͔ʁͦΕ͕৘ใྔͱ͔ͷΞϓϩʔνͰ΍Δͱ͏·͍͘͘ΑͱݴͬͯΔ •Fishr: Invariant Gradient Variances for Out-of-distribution Generalization: ଛࣦͷޯ഑ͷۭؒʹ͓͍ͯυϝ ΠϯෆมੑΛڧ੍͢Δਖ਼ଇԽΛఏҊ •Predicting Unreliable Predictions by Shattering a Neural Network: ׆ੑԽྖҬͷ਺͕গͳ͍Ϟσϧͷํ͕ ҰൠԽ͠΍͘͢ɺ஌ࣝͷந৅Խ͕ਐΜͩϞσϧͷํ͕ҰൠԽ͠΍͍͢ Connection between invariance and causality
  15. 16 Trend: Distribution Shifts (Datasets) Dataset •Noise or Signal: The

    Role of Image Backgrounds in Object Recognition: 
 ը૾ͷഎܠґଘੑ / Background Challenge •On the Impact of Spurious Correlation for Out-of-distribution Detection: 
 ෆมతಛ௃ͱεϓϦΞεಛ௃ͷ྆ํΛߟྀͨ͠σʔλγϑτͷϞσϧΛఏࣔ •WILDS: A Benchmark of in-the-Wild Distribution Shifts: 
 ը૾ɾݴޠɾάϥϑͷ OOD σʔληοτ܈ •In Search of Lost Domain Generalization: DomainBed: 
 υϝΠϯ൚Խͷσʔληοτ܈
  16. 17 Trend: Distribution Shifts (Empirical Evaluation) Empirical Evaluation •OoD-Bench: Benchmarking

    and Understanding Out-of-Distribution Generalization Datasets and Algorithms: 
 OODσʔληοτΛDiversity ShiftͱCorrelation Shiftͷ࣠ͰΧςΰϦ෼͚ͨ͠ •A Fine-Grained Analysis on Distribution Shift: 
 ෳ਺ͷҟͳΔ෼෍γϑτʹΘͨͬͯΞϧΰϦζϜͷϩόετੑΛධՁ •Understanding and Testing Generalization of Deep Networks on Out-of-Distribution Data: 
 γϑτΛࡾͭͷλΠϓʹ෼ׂɺΞʔΩςΫνϟʹΑΔIDੑೳͱOODੑೳͷൺֱΛͨ͠ •How does a neural network’s architecture impact its robustness to noisy labels?: 
 ωοτϫʔΫͷΞʔΩςΫνϟ͕ɺϊΠζͷଟ͍ϥϕϧʹର͢Δؤ݈ੑʹͲͷΑ͏ʹӨڹ͢Δ͔Λ୳Δ •On Calibration and Out-of-domain Generalization: 
 OODͷੑೳͱϞσϧͷΩϟϦϒϨʔγϣϯͱͷؒʹؔ࿈ੑΛݟग़ͨ͠
  17. 18 01 Introduction and Motivation / Trend 02 Empirical Study

    on Optimizer Selection for Out-of-Distribution Generalization 03 Diverse Weights Averaging for Out-of-Distribution Generalization 04 Assaying Out-Of-Distribution Generalization in Transfer Learning Outline
  18. Background: Inductive Bias of Optimization in Deep Learning • In

    general deep learning problem setting, there are many global (=local) minima • The generalization performance of each global minima is different • Different learning methods, such as hyperparameters and optimizer, converge to different global minima Figureɿ[Matthew Hutson 2018] Figureɿ[Naresh Kumar 2019] 20 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
  19. Background: Intuition of Characteristic of Optimizers Figure ɿ[Difan Zou et

    al] • Different Optimizers have different convergence rates and generalization performance • Some experiments implies that Adam memorise the noise of train data 21 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
  20. 22 Related Work: Optimizer Comparison on IID Setting With suf

    fi cient tuning, the Adaptive optimization method slightly outperforms the Non-Adaptive optimization method, but not by much. Comprehensive Experiments Under IID Assumption Figureɿ[Guodong Zhang et al 2019]
  21. Out-of-Distribution Generalization Datasets (Computer Vision) DomainBed Background Challenge Figureɿ[Ishaan Gulrajani

    and David Lopez-Paz 2020] Figureɿ[Kai Xiao et al, 2020] 23 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
  22. Out-of-Distribution Generalization Datasets (NLP) Civil Comments Amazon Figureɿ[Pang Wei Koh,

    2020] 24 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
  23. Optimizers Subjected to Our Analysis 25 Empirical Study on Optimizer

    Selection for Out-of-Distribution Generalization
  24. 35 Diverse Weights Averaging for Out-of-Distribution Generalization Theoretical Contribution: Bias-Variance

    Analysis in OOD (1/2) Variance Bias Model θ* ̂ θ ˇ θ Bias and Correlation Shift > Bias in OOD increases when the class posteriors mismatch Assumption: Large NNs
  25. 36 Diverse Weights Averaging for Out-of-Distribution Generalization Theoretical Contribution: Bias-Variance

    Analysis in OOD (2/2) Variance Bias Model θ* ̂ θ ˇ θ Variance and Divershity Shift > Variance in OOD increases when the input marginals mismatch Assumption: NNs with diagonallly dominant NTK
  26. 37 Diverse Weights Averaging for Out-of-Distribution Generalization Theoretical Contribution: Apply

    Following Theory to OOD 
 Bias-Variance-Covariance Decomposition for Ensembling (NTTݚ ্ాमޭઌੜ1996) where Increasing M does not lower error -> Correlation Shift cannot be dealt with Error goes down as M goes up, can handle Diversity Shift Should be controlled for low generalization error
  27. 41 Assaying Out-Of-Distribution Generalization in Transfer Learning Motivation •Different communities

    (e.g., calibration, adversarial robustness, algorithmic corruptions, invariance across shifts) are considering the same things and drawing different conclusions •They evaluated with the same exhaustive benchmark for a uni fi ed understanding Overview •Large-scale Experiments!
  28. 42 Assaying Out-Of-Distribution Generalization in Transfer Learning Background (ECE: Expacted

    Calibration Error) Modern DNNs do well on ranking performance (such as accuracy, AUC), but known to be bad in uncertainty (calibration, ECE). This prevents the use of DNNs in automated driving, medical image diagnostics, and recommendation systems. Figure Taken From [Chuan et al 2017]
  29. 43 Assaying Out-Of-Distribution Generalization in Transfer Learning Full Results The

    results presented hereafter are based on this correlation coef fi cient aggregated on the horizontal axis, etc.
  30. 44 Assaying Out-Of-Distribution Generalization in Transfer Learning Experimental Results (Results

    can signi fi cantly change for different shift types ) Takeaway: ID and OOD accuracy only show a linear trend on speci fi c tasks. We observe three additional settings: underspeci fi cation (vertical line), no generalization (horizontal line), and random generalization (large point cloud). We did not observe any trade-off between accuracy and robustness, where more accurate models would over fi t to “spurious features” that do not generalize. Robustness methods have to be tested in many different settings. Currently, there seems to be no single method that is superior in all OOD settings.
  31. 45 Assaying Out-Of-Distribution Generalization in Transfer Learning Experimental Results (What

    are good proxies to measuring robustness to distribution shifts?) Takeaway: Accuracy is the strongest ID predictor of OOD robustness and models that generalize well in distribution tend to also be more robust. Evaluating accuracy on additional held-out OOD data is an even stronger predictor.
  32. 46 Assaying Out-Of-Distribution Generalization in Transfer Learning Experimental Results (On

    the transfer of metrics from ID to OOD data) Takeaway: Among all metrics adversarial robustness transfers best from ID to OOD data, which suggests that models respond similarly to adversarial attacks on ID and OOD data. Calibration transfers worst, which means that models that are well calibrated on ID data are not necessarily well calibrated on OOD data.
  33. 47 05 Acknowledgement Kartik Ahuja Rio Yokota Ioannis Mitliagkas Kohta

    Ishikawa Ikuro Sato Tetsuya Motokawa Shiro Takagi Kilian Fatras Masanari Kimura Charles Guille-Escuret
  34. 49 Reference (1/5) • Estimating and Explaining Model Performance When

    Both Covariates and Labels Shift • Diverse Weights Averaging for Out-of-Distribution Generalization • Domain-Adjusted Regression or: ERM May Already Learn Features Suf fi cient for Out-of-Distribution Generalization • Meta-DMoE: Adapting to Domain Shift by Meta-Distillation from Mixture-of-Experts • Assaying Out-Of-Distribution Generalization in Transfer Learning • Improving Multi-Task Generalization via Regularizing Spurious Correlation • MEMO: Test Time Robustness via Adaptation and Augmentation • When are Local Queries Useful for Robust Learning? • The Missing Invariance Principle found-the Reciprocal Twin of Invariant Risk Minimization • Adapting to Online Label Shift with Provable Guarantees • Hard ImageNet: Segmentations for Objects with Strong Spurious Cues • Invariance Learning based on Label Hierarchy • Hyperparameter Sensitivity in Deep Outlier Detection: Analysis and a Scalable Hyper-Ensemble Solution • Domain Generalization without Excess Empirical Risk • Representing Spatial Trajectories as Distributions • Multitasking Models are Robust to Structural Failure: A Neural Model for Bilingual Cognitive Reserve • SPD domain-speci fi c batch normalization to crack interpretable unsupervised domain adaptation in EEG • Task Discovery: Finding the Tasks that Neural Networks Generalize on • Domain Adaptation under Open Set Label Shift • Explicit Tradeoffs between Adversarial and Natural Distributional Robustness • When does dough become a bagel? Analyzing the remaining mistakes on ImageNet • NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation • RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness • Unsupervised Learning under Latent Label Shift Distribution Shift
  35. 50 Reference (2/5) • Towards Impoving Calibration in Object Detection

    Under Domain Shift • Single Model Uncertainty Estimation via Stochastic Data Centering • Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors • On Uncertainty, Tempering, and Data Augmentation in Bayesian Classi fi cation • Joint Entropy Search for Maximally-Informed Bayesian Optimization • Scalable Sensitivity and Uncertainty Analysis for Causal-Effect Estimates of Continuous-Valued Interventions Calibration Generalization • Neuron with Steady Response Leads to Better Generalization • Chroma-VAE: Mitigating Shortcut Learning with Generative Classi fi ers • Redundant representations help generalization in wide neural networks • What is Good Metric to Study Generzalitation of Maximum Learners? • Rethinking Generalization in Few-Shot Classi fi cation • Generalization for multi class classi fi cation in overparameterized linear models • LISA: Learning Interpretable Skill Abstractions from Language • Geoclidean: Few-Shot Generalization in Euclidean Geometry
  36. 51 Reference (3/5) • On the Interpretability of Regularisation for

    Neural Networks Through Model Gradient Similarity • The Effects of Regularisation and Data Augmentation are Class Dependent • Tikhonov Regularization is Optimal Transport Robust under Martingale Constraints • Feature Learning in L2-regularized DNNs: Attraction/Repulsion and Sparsity Regularization Adversarial Robustness • Noise attention learning: enhancing noise robustness by gradient scaling • Why do arti fi cially generated data help adversarial robustness? • On the Adversarial Robustness of Mixture of Experts Optimizer • Target-based Surrogates for Stochastic Optimization • Sharper Convergence Guarantees for Asynchronos SGD for Distributed Federated Learning • First-Order Algorithms for Min-Max Optimization in Geodesic Metric Spaces • Generalization Bounds with Minimal Dependency on Hypothesis Class via Distributionally Robust Optimization • Adam Can Converge Without Any Modi fi cation On Update Rules • Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum Minimization • A consistently adaptive trust-region method • Better SGD using Second-order Momentum
  37. 52 Reference (4/5) Theory • Memorization and Optimization in Deep

    Neural Networks with Minimum Over-parameterization • Effects of Data Geometry in Early Deep Learning • Benign, Tempered, or Catastrophic: A Taxonomy of Over fi tting • Identifying good directions to escape the NTK regime and ef fi ciently learn low-degree plus sparse polynomials • What Can Transformers Learn In-Context? A Case Study of Simple Function Classes • On Margin Maximization in Linear and ReLU Networks • A Combinatorial Perspective on the Optimization of Shallow ReLU Networks • Bridging the Gap: Unifying the Training and Evaluation of Neural Network Binary Classi fi ers • On the Double Descent of Random Features Models Trained with SGD • Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization) • Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions • Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks • Chaotic Dynamics are Intrinsic to Neural Network Training with SGD • On the non-universality of deep learning: quantifying the cost of symmetry • High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation • Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit Robustness • Where do Models go Wrong? Parameter-Space Saliency Maps for Explainability • What Can Transformers Learn In-Context? A Case Study of Simple Function Classes • Is one annotation enough? A data-centric image classi fi cation benchmark for noisy and ambiguous label estimation • To be Robust or to be Fair: Towards Fairness in Adversarial Training
  38. 53 Reference (5/5) Others • BLaDE: Robust Exploration via Diffusion

    Models • Active Learning Classi fi ers with Label and Seed Queries • Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel • Beyond Not-Forgetting: Continual Learning with Backward Knowledge Transfer • Learning Options via Compression • A Simple Decentralized Cross-Entropy Method • Turbocharging Solution Concepts: Solving NEs, CEs and CCEs with Neural Equilibrium Solvers • De fi ning and Characterizing Reward Hacking • Exploiting the Relationship Between Kendall's Rank Correlation and Cosine Similarity for Attribution Protection • Not All Bits have Equal Value: Heterogeneous Precisions via Trainable Noise • AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Ef fi cient Language Models • TVLT: Textless Vision-Language Transformer • SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections • TokenMixup: Ef fi cient Attention-guided Token-level Data Augmentation for Transformers • MoCoDA: Model-based Counterfactual Data Augmentation • Explainability Via Causal Self-Talk • Memorization Without Over fi tting: Analyzing the Training Dynamics of Large Language Models • Dataset Inference for Self-Supervised Models • PDEBENCH: An Extensive Benchmark for Scienti fi c Machine Learning • Pruning has a disparate impact on model accuracy