Slide 1

Slide 1 text

1 Hiroki Naganuma1,2 
 1Université de Montréal, 2Mila 
 [email protected] Out-of-Distribution Generalization, Calibration, Uncertainty and Optimization - Survey of these topics from the bird's-eye view - ʲRICOS×ZOZOʳNeurIPS2022࿦จಡΈձ 
 February 17th, 2023 Trends on Distribution Shift in NeurIPS2022 @Zoom / YouTube Live

Slide 2

Slide 2 text

About Me Hiroki Naganuma / ௕প େथ Ph.D. student in Computer Science 
 @Université de Montréal, Mila - Quebec Arti fi cial Intelligence Institute ֎෦ݚڀһ 
 @ZOZO Research 
 Research Interest • ෼෍֎൚Խ (Out-of-Distribution Generalization) • ෆ࣮֬ੑͱΩϟϦϒϨʔγϣϯ (Calibration) • υϝΠϯ൚Խʹ͓͚Δ࠷ద༌ૹ (Optimal Transport) • Ϟσϧબ୒ࢦඪ (Information Criteria) • ࠷దԽΞϧΰϦζϜ • GANs ʹ͓͚Δ࠷దԽख๏ (Generative Adversarial Networks) • ࠷దԽͱؼೲతόΠΞε (Implicit Bias and Optimization) 2

Slide 3

Slide 3 text

3 Introduction and Motivation Problem Setting / Training Error and Generalization Error Notation Generalization Error Training Error - what we want to minimize - Inaccessible - what to minimize instead - assume that the training data follows the same distribution as the test data (i.i.d) ≈ prediction performance Introduction

Slide 4

Slide 4 text

4 Limitation of IID Assumption Introduction and Motivation Spurious Invariant Cow Camel Desert Grass Classi fi er based on ERM under IID Assumption Data from different distribution 
 is predicted as Cow Data from different distribution 
 is predicted as Camel IID Data (Majority) IID Data (Majority) Introduction

Slide 5

Slide 5 text

5 Introduction and Motivation IID Data The generalization performance to predict out-of-distribution data is called OOD-Generalization. Figure take from [Robert Geirhos et al 2019] Introduction Out-of-Distribution Generalization Example

Slide 6

Slide 6 text

6 Introduction and Motivation Invariant Feature and Shortcut Feature Introduction

Slide 7

Slide 7 text

7 Difference between Domain Generalization and Domain Adaptation Introduction and Motivation Introduction Domain Generalization Domain Adaptation

Slide 8

Slide 8 text

8 Introduction and Motivation Out-of-Distribution Generalization Introduction ROOD(f ) = max e∈ℰall Re(f ) Re(f ) := 𝔼 Xe,Ye∼ℙe [ℓ (f (Xe), Ye )] where Assumption unknown Problem If we have access to a prior distribution over potential test environments There are two problems in minimizing the following A) Calculation 1. Can the prior distribution be explicitly expressed? 2. It is doubtful that the posterior distribution can be integrated B) Even if the same performance is the same in the worst case, performance cannot be compared in the simplest case.

Slide 9

Slide 9 text

9 Introduction and Motivation What Out-of-Distribution Generalization aims for Introduction Assumption unknown Aim 1 Obtain strong guarantees where where is not tending to in fi nity but is actually rather small Aim 2 Obtain at least empirical performance

Slide 10

Slide 10 text

10 Introduction and Motivation Introduction Empirical Risk Minimization The ERM uses too strong of an iid assumption 
 •Interested in average case OOD performance •Test data is drawn from IID •No knowledge of the data Robust Optimization Equivalent to the weighted average of errors in each environment 
 Possibility that there is a better algorithm because Robust Optimization does not take into account the structure between distributions of Robustness at training time does not in general imply robustness at test time Adversarial Examples → → → Case like the cow and its background Hint for OOD Generalization(1/2)

Slide 11

Slide 11 text

11 Introduction and Motivation Introduction Domain Adaptation Assumption Hint for OOD Generalization(2/2) we can extract some variable that will be useful to predict across environments doesn’t change across environments Idea looking for features where doesn’t change across (Invariant prediction)

Slide 12

Slide 12 text

12 Introduction and Motivation Introduction How we fi nd Invariant prediction ? Strategy Invariant Risk Minimization A prediction with the following characteristics is useful for OOD •prediction does not change over •Error becomes small over

Slide 13

Slide 13 text

13 Introduction and Motivation Out-of-Distribution Generalization Introduction ROOD(f ) = max e∈ℰall Re(f ) Re(f ) := 𝔼 Xe,Ye∼ℙe [ℓ (f (Xe), Ye )] where Empirical Risk Minimization Robust Optimization Invariant Risk Minimization

Slide 14

Slide 14 text

14 Trend: NeurIPS 2022 Workshop on Distribution Shifts Scope •Examples of real-world distribution shifts in various application areas •Methods for improving robustness to distribution shifts •Empirical and theoretical characterization of distribution shifts •Benchmarks and evaluations 95 Accepted Papers

Slide 15

Slide 15 text

15 Trend: Distribution Shifts (Principle and Methods) How to generalize under distribnution shift (including both principle clari fi cation and proposed method) •Shortcut Learning in Deep Neural Networks: 
 ໰୊Λղͨ͘Ίʹ࢖ͬͯ͸͍͚ͳ͍ผͷ৘ใΛ࢖ͬͯ”ͣΔ”Λ͢ΔγϣʔτΧοτֶश͸ಈ෺Ͱ΋ΈΒΕɺ ݱࡏͷML/DLͰ΋޿͘ΈΒΕΔɻ͜ΕʹΑΓML͸ҧ͏ํ޲ʹ൚Խ͠ɺֶश෼෍֎ʢo.o.d)ʹ֎ૠͰ͖ͳ͍ •Towards a Theoretical Framework of Out-of-Distribution Generalization: 
 OODͷ൚ԽޡࠩڥքΛূ໌ •An Information-theoretic Approach to Distribution Shifts: 
 Ͳͷಛ௃બͿͱOOD൚Խ͢Δͷ͔ʁͦΕ͕৘ใྔͱ͔ͷΞϓϩʔνͰ΍Δͱ͏·͍͘͘ΑͱݴͬͯΔ •Fishr: Invariant Gradient Variances for Out-of-distribution Generalization: ଛࣦͷޯ഑ͷۭؒʹ͓͍ͯυϝ ΠϯෆมੑΛڧ੍͢Δਖ਼ଇԽΛఏҊ •Predicting Unreliable Predictions by Shattering a Neural Network: ׆ੑԽྖҬͷ਺͕গͳ͍Ϟσϧͷํ͕ ҰൠԽ͠΍͘͢ɺ஌ࣝͷந৅Խ͕ਐΜͩϞσϧͷํ͕ҰൠԽ͠΍͍͢ Connection between invariance and causality

Slide 16

Slide 16 text

16 Trend: Distribution Shifts (Datasets) Dataset •Noise or Signal: The Role of Image Backgrounds in Object Recognition: 
 ը૾ͷഎܠґଘੑ / Background Challenge •On the Impact of Spurious Correlation for Out-of-distribution Detection: 
 ෆมతಛ௃ͱεϓϦΞεಛ௃ͷ྆ํΛߟྀͨ͠σʔλγϑτͷϞσϧΛఏࣔ •WILDS: A Benchmark of in-the-Wild Distribution Shifts: 
 ը૾ɾݴޠɾάϥϑͷ OOD σʔληοτ܈ •In Search of Lost Domain Generalization: DomainBed: 
 υϝΠϯ൚Խͷσʔληοτ܈

Slide 17

Slide 17 text

17 Trend: Distribution Shifts (Empirical Evaluation) Empirical Evaluation •OoD-Bench: Benchmarking and Understanding Out-of-Distribution Generalization Datasets and Algorithms: 
 OODσʔληοτΛDiversity ShiftͱCorrelation Shiftͷ࣠ͰΧςΰϦ෼͚ͨ͠ •A Fine-Grained Analysis on Distribution Shift: 
 ෳ਺ͷҟͳΔ෼෍γϑτʹΘͨͬͯΞϧΰϦζϜͷϩόετੑΛධՁ •Understanding and Testing Generalization of Deep Networks on Out-of-Distribution Data: 
 γϑτΛࡾͭͷλΠϓʹ෼ׂɺΞʔΩςΫνϟʹΑΔIDੑೳͱOODੑೳͷൺֱΛͨ͠ •How does a neural network’s architecture impact its robustness to noisy labels?: 
 ωοτϫʔΫͷΞʔΩςΫνϟ͕ɺϊΠζͷଟ͍ϥϕϧʹର͢Δؤ݈ੑʹͲͷΑ͏ʹӨڹ͢Δ͔Λ୳Δ •On Calibration and Out-of-domain Generalization: 
 OODͷੑೳͱϞσϧͷΩϟϦϒϨʔγϣϯͱͷؒʹؔ࿈ੑΛݟग़ͨ͠

Slide 18

Slide 18 text

18 01 Introduction and Motivation / Trend 02 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization 03 Diverse Weights Averaging for Out-of-Distribution Generalization 04 Assaying Out-Of-Distribution Generalization in Transfer Learning Outline

Slide 19

Slide 19 text

19 02 Empirical Study on Optimizer Selection 
 for Out-of-Distribution Generalization 55

Slide 20

Slide 20 text

Background: Inductive Bias of Optimization in Deep Learning • In general deep learning problem setting, there are many global (=local) minima • The generalization performance of each global minima is different • Different learning methods, such as hyperparameters and optimizer, converge to different global minima Figureɿ[Matthew Hutson 2018] Figureɿ[Naresh Kumar 2019] 20 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Slide 21

Slide 21 text

Background: Intuition of Characteristic of Optimizers Figure ɿ[Difan Zou et al] • Different Optimizers have different convergence rates and generalization performance • Some experiments implies that Adam memorise the noise of train data 21 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Slide 22

Slide 22 text

22 Related Work: Optimizer Comparison on IID Setting With suf fi cient tuning, the Adaptive optimization method slightly outperforms the Non-Adaptive optimization method, but not by much. Comprehensive Experiments Under IID Assumption Figureɿ[Guodong Zhang et al 2019]

Slide 23

Slide 23 text

Out-of-Distribution Generalization Datasets (Computer Vision) DomainBed Background Challenge Figureɿ[Ishaan Gulrajani and David Lopez-Paz 2020] Figureɿ[Kai Xiao et al, 2020] 23 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Slide 24

Slide 24 text

Out-of-Distribution Generalization Datasets (NLP) Civil Comments Amazon Figureɿ[Pang Wei Koh, 2020] 24 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Slide 25

Slide 25 text

Optimizers Subjected to Our Analysis 25 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Slide 26

Slide 26 text

26 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization Experimental Results (CV+NLP) 55

Slide 27

Slide 27 text

27 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization Experimental Results (CV+NLP) 55

Slide 28

Slide 28 text

Experimental Results (NLP) 28 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization 55

Slide 29

Slide 29 text

29 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization Experimental Results (ColoredMNIST) 55

Slide 30

Slide 30 text

Experimental Results (Correlation Behaviour) 30 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization 55

Slide 31

Slide 31 text

31 03 Diverse Weights Averaging 
 for Out-of-Distribution Generalization 55

Slide 32

Slide 32 text

32 Diverse Weights Averaging for Out-of-Distribution Generalization Background

Slide 33

Slide 33 text

33 Diverse Weights Averaging for Out-of-Distribution Generalization Proposal

Slide 34

Slide 34 text

34 Diverse Weights Averaging for Out-of-Distribution Generalization Setup Diversity Shift Correlation Shift

Slide 35

Slide 35 text

35 Diverse Weights Averaging for Out-of-Distribution Generalization Theoretical Contribution: Bias-Variance Analysis in OOD (1/2) Variance Bias Model θ* ̂ θ ˇ θ Bias and Correlation Shift > Bias in OOD increases when the class posteriors mismatch Assumption: Large NNs

Slide 36

Slide 36 text

36 Diverse Weights Averaging for Out-of-Distribution Generalization Theoretical Contribution: Bias-Variance Analysis in OOD (2/2) Variance Bias Model θ* ̂ θ ˇ θ Variance and Divershity Shift > Variance in OOD increases when the input marginals mismatch Assumption: NNs with diagonallly dominant NTK

Slide 37

Slide 37 text

37 Diverse Weights Averaging for Out-of-Distribution Generalization Theoretical Contribution: Apply Following Theory to OOD 
 Bias-Variance-Covariance Decomposition for Ensembling (NTTݚ ্ాमޭઌੜ1996) where Increasing M does not lower error -> Correlation Shift cannot be dealt with Error goes down as M goes up, can handle Diversity Shift Should be controlled for low generalization error

Slide 38

Slide 38 text

38 Diverse Weights Averaging for Out-of-Distribution Generalization Ablation Study

Slide 39

Slide 39 text

39 Diverse Weights Averaging for Out-of-Distribution Generalization Experimental Results / Benchmark

Slide 40

Slide 40 text

40 04 Assaying Out-Of-Distribution Generalization in Transfer Learning 55

Slide 41

Slide 41 text

41 Assaying Out-Of-Distribution Generalization in Transfer Learning Motivation •Different communities (e.g., calibration, adversarial robustness, algorithmic corruptions, invariance across shifts) are considering the same things and drawing different conclusions •They evaluated with the same exhaustive benchmark for a uni fi ed understanding Overview •Large-scale Experiments!

Slide 42

Slide 42 text

42 Assaying Out-Of-Distribution Generalization in Transfer Learning Background (ECE: Expacted Calibration Error) Modern DNNs do well on ranking performance (such as accuracy, AUC), but known to be bad in uncertainty (calibration, ECE). This prevents the use of DNNs in automated driving, medical image diagnostics, and recommendation systems. Figure Taken From [Chuan et al 2017]

Slide 43

Slide 43 text

43 Assaying Out-Of-Distribution Generalization in Transfer Learning Full Results The results presented hereafter are based on this correlation coef fi cient aggregated on the horizontal axis, etc.

Slide 44

Slide 44 text

44 Assaying Out-Of-Distribution Generalization in Transfer Learning Experimental Results (Results can signi fi cantly change for different shift types ) Takeaway: ID and OOD accuracy only show a linear trend on speci fi c tasks. We observe three additional settings: underspeci fi cation (vertical line), no generalization (horizontal line), and random generalization (large point cloud). We did not observe any trade-off between accuracy and robustness, where more accurate models would over fi t to “spurious features” that do not generalize. Robustness methods have to be tested in many different settings. Currently, there seems to be no single method that is superior in all OOD settings.

Slide 45

Slide 45 text

45 Assaying Out-Of-Distribution Generalization in Transfer Learning Experimental Results (What are good proxies to measuring robustness to distribution shifts?) Takeaway: Accuracy is the strongest ID predictor of OOD robustness and models that generalize well in distribution tend to also be more robust. Evaluating accuracy on additional held-out OOD data is an even stronger predictor.

Slide 46

Slide 46 text

46 Assaying Out-Of-Distribution Generalization in Transfer Learning Experimental Results (On the transfer of metrics from ID to OOD data) Takeaway: Among all metrics adversarial robustness transfers best from ID to OOD data, which suggests that models respond similarly to adversarial attacks on ID and OOD data. Calibration transfers worst, which means that models that are well calibrated on ID data are not necessarily well calibrated on OOD data.

Slide 47

Slide 47 text

47 05 Acknowledgement Kartik Ahuja Rio Yokota Ioannis Mitliagkas Kohta Ishikawa Ikuro Sato Tetsuya Motokawa Shiro Takagi Kilian Fatras Masanari Kimura Charles Guille-Escuret

Slide 48

Slide 48 text

48 Thank you for listening

Slide 49

Slide 49 text

49 Reference (1/5) • Estimating and Explaining Model Performance When Both Covariates and Labels Shift • Diverse Weights Averaging for Out-of-Distribution Generalization • Domain-Adjusted Regression or: ERM May Already Learn Features Suf fi cient for Out-of-Distribution Generalization • Meta-DMoE: Adapting to Domain Shift by Meta-Distillation from Mixture-of-Experts • Assaying Out-Of-Distribution Generalization in Transfer Learning • Improving Multi-Task Generalization via Regularizing Spurious Correlation • MEMO: Test Time Robustness via Adaptation and Augmentation • When are Local Queries Useful for Robust Learning? • The Missing Invariance Principle found-the Reciprocal Twin of Invariant Risk Minimization • Adapting to Online Label Shift with Provable Guarantees • Hard ImageNet: Segmentations for Objects with Strong Spurious Cues • Invariance Learning based on Label Hierarchy • Hyperparameter Sensitivity in Deep Outlier Detection: Analysis and a Scalable Hyper-Ensemble Solution • Domain Generalization without Excess Empirical Risk • Representing Spatial Trajectories as Distributions • Multitasking Models are Robust to Structural Failure: A Neural Model for Bilingual Cognitive Reserve • SPD domain-speci fi c batch normalization to crack interpretable unsupervised domain adaptation in EEG • Task Discovery: Finding the Tasks that Neural Networks Generalize on • Domain Adaptation under Open Set Label Shift • Explicit Tradeoffs between Adversarial and Natural Distributional Robustness • When does dough become a bagel? Analyzing the remaining mistakes on ImageNet • NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation • RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness • Unsupervised Learning under Latent Label Shift Distribution Shift

Slide 50

Slide 50 text

50 Reference (2/5) • Towards Impoving Calibration in Object Detection Under Domain Shift • Single Model Uncertainty Estimation via Stochastic Data Centering • Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors • On Uncertainty, Tempering, and Data Augmentation in Bayesian Classi fi cation • Joint Entropy Search for Maximally-Informed Bayesian Optimization • Scalable Sensitivity and Uncertainty Analysis for Causal-Effect Estimates of Continuous-Valued Interventions Calibration Generalization • Neuron with Steady Response Leads to Better Generalization • Chroma-VAE: Mitigating Shortcut Learning with Generative Classi fi ers • Redundant representations help generalization in wide neural networks • What is Good Metric to Study Generzalitation of Maximum Learners? • Rethinking Generalization in Few-Shot Classi fi cation • Generalization for multi class classi fi cation in overparameterized linear models • LISA: Learning Interpretable Skill Abstractions from Language • Geoclidean: Few-Shot Generalization in Euclidean Geometry

Slide 51

Slide 51 text

51 Reference (3/5) • On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity • The Effects of Regularisation and Data Augmentation are Class Dependent • Tikhonov Regularization is Optimal Transport Robust under Martingale Constraints • Feature Learning in L2-regularized DNNs: Attraction/Repulsion and Sparsity Regularization Adversarial Robustness • Noise attention learning: enhancing noise robustness by gradient scaling • Why do arti fi cially generated data help adversarial robustness? • On the Adversarial Robustness of Mixture of Experts Optimizer • Target-based Surrogates for Stochastic Optimization • Sharper Convergence Guarantees for Asynchronos SGD for Distributed Federated Learning • First-Order Algorithms for Min-Max Optimization in Geodesic Metric Spaces • Generalization Bounds with Minimal Dependency on Hypothesis Class via Distributionally Robust Optimization • Adam Can Converge Without Any Modi fi cation On Update Rules • Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum Minimization • A consistently adaptive trust-region method • Better SGD using Second-order Momentum

Slide 52

Slide 52 text

52 Reference (4/5) Theory • Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization • Effects of Data Geometry in Early Deep Learning • Benign, Tempered, or Catastrophic: A Taxonomy of Over fi tting • Identifying good directions to escape the NTK regime and ef fi ciently learn low-degree plus sparse polynomials • What Can Transformers Learn In-Context? A Case Study of Simple Function Classes • On Margin Maximization in Linear and ReLU Networks • A Combinatorial Perspective on the Optimization of Shallow ReLU Networks • Bridging the Gap: Unifying the Training and Evaluation of Neural Network Binary Classi fi ers • On the Double Descent of Random Features Models Trained with SGD • Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization) • Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions • Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks • Chaotic Dynamics are Intrinsic to Neural Network Training with SGD • On the non-universality of deep learning: quantifying the cost of symmetry • High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation • Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit Robustness • Where do Models go Wrong? Parameter-Space Saliency Maps for Explainability • What Can Transformers Learn In-Context? A Case Study of Simple Function Classes • Is one annotation enough? A data-centric image classi fi cation benchmark for noisy and ambiguous label estimation • To be Robust or to be Fair: Towards Fairness in Adversarial Training

Slide 53

Slide 53 text

53 Reference (5/5) Others • BLaDE: Robust Exploration via Diffusion Models • Active Learning Classi fi ers with Label and Seed Queries • Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel • Beyond Not-Forgetting: Continual Learning with Backward Knowledge Transfer • Learning Options via Compression • A Simple Decentralized Cross-Entropy Method • Turbocharging Solution Concepts: Solving NEs, CEs and CCEs with Neural Equilibrium Solvers • De fi ning and Characterizing Reward Hacking • Exploiting the Relationship Between Kendall's Rank Correlation and Cosine Similarity for Attribution Protection • Not All Bits have Equal Value: Heterogeneous Precisions via Trainable Noise • AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Ef fi cient Language Models • TVLT: Textless Vision-Language Transformer • SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections • TokenMixup: Ef fi cient Attention-guided Token-level Data Augmentation for Transformers • MoCoDA: Model-based Counterfactual Data Augmentation • Explainability Via Causal Self-Talk • Memorization Without Over fi tting: Analyzing the Training Dynamics of Large Language Models • Dataset Inference for Self-Supervised Models • PDEBENCH: An Extensive Benchmark for Scienti fi c Machine Learning • Pruning has a disparate impact on model accuracy