Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Trends on Distribution Shift in NeurIPS2022

Hiroki Naganuma
February 17, 2023

Trends on Distribution Shift in NeurIPS2022

本発表では、深層学習における分布シフトの問題に焦点を当て、そのトピックを紹介します。従来の学習フレームワークでは、訓練データとテストデータは i.i.d. であると仮定され、経験損失を最小化することで期待損失を最小化することが期待されます(Empirical Risk Minimization)。しかし、現実世界ではテストデータの分布が訓練データの分布と異なることが一般的であり、学習・推論時の環境変化は、教師あり学習の前提に反するものです。従って、分布シフトの問題に取り組むことが、深層学習の実応用に不可欠です。前半で個々の論文の詳細ではなく、分野全体の概観・トレンドを紹介し、後半ではいくつかの論文について簡潔に紹介します。

Hiroki Naganuma

February 17, 2023
Tweet

More Decks by Hiroki Naganuma

Other Decks in Research

Transcript

  1. 1
    Hiroki Naganuma1,2

    1Université de Montréal, 2Mila

    [email protected]
    Out-of-Distribution Generalization, Calibration, Uncertainty and Optimization


    - Survey of these topics from the bird's-eye view -
    ʲRICOS×ZOZOʳNeurIPS2022࿦จಡΈձ

    February 17th, 2023
    Trends on Distribution Shift in NeurIPS2022
    @Zoom /
    YouTube Live

    View Slide

  2. About Me
    Hiroki Naganuma / ௕প େथ
    Ph.D. student in Computer Science

    @Université de Montréal, Mila - Quebec Arti
    fi
    cial Intelligence Institute


    ֎෦ݚڀһ

    @ZOZO Research



    Research Interest


    • ෼෍֎൚Խ (Out-of-Distribution Generalization)


    • ෆ࣮֬ੑͱΩϟϦϒϨʔγϣϯ (Calibration)


    • υϝΠϯ൚Խʹ͓͚Δ࠷ద༌ૹ (Optimal Transport)


    • Ϟσϧબ୒ࢦඪ (Information Criteria)


    • ࠷దԽΞϧΰϦζϜ


    • GANs ʹ͓͚Δ࠷దԽख๏ (Generative Adversarial Networks)


    • ࠷దԽͱؼೲతόΠΞε (Implicit Bias and Optimization)
    2

    View Slide

  3. 3
    Introduction and Motivation
    Problem Setting / Training Error and Generalization Error
    Notation
    Generalization Error Training Error
    - what we want to minimize


    - Inaccessible
    - what to minimize instead


    - assume that the training data follows the
    same distribution as the test data (i.i.d)
    ≈ prediction performance
    Introduction

    View Slide

  4. 4
    Limitation of IID Assumption
    Introduction and Motivation
    Spurious
    Invariant
    Cow
    Camel
    Desert
    Grass
    Classi
    fi
    er based on ERM under IID Assumption
    Data from different
    distribution

    is predicted as Cow
    Data from different
    distribution

    is predicted as Camel
    IID Data (Majority)
    IID Data (Majority)
    Introduction

    View Slide

  5. 5
    Introduction and Motivation
    IID Data
    The generalization performance to predict out-of-distribution data is called OOD-Generalization.
    Figure take from [Robert Geirhos et al 2019]
    Introduction
    Out-of-Distribution Generalization Example

    View Slide

  6. 6
    Introduction and Motivation
    Invariant Feature and Shortcut Feature
    Introduction

    View Slide

  7. 7
    Difference between Domain Generalization and Domain Adaptation
    Introduction and Motivation
    Introduction
    Domain Generalization Domain Adaptation

    View Slide

  8. 8
    Introduction and Motivation
    Out-of-Distribution Generalization
    Introduction
    ROOD(f ) = max
    e∈ℰall
    Re(f )
    Re(f ) :=
    𝔼
    Xe,Ye∼ℙe [ℓ (f (Xe), Ye
    )]
    where
    Assumption
    unknown
    Problem
    If we have access to a prior distribution over potential test environments
    There are two problems in minimizing the following
    A) Calculation


    1. Can the prior distribution be explicitly expressed?


    2. It is doubtful that the posterior distribution can be integrated


    B) Even if the same performance is the same in the worst case,
    performance cannot be compared in the simplest case.

    View Slide

  9. 9
    Introduction and Motivation
    What Out-of-Distribution Generalization aims for
    Introduction
    Assumption
    unknown
    Aim 1
    Obtain strong guarantees


    where
    where is not tending to in
    fi
    nity but is actually rather small
    Aim 2
    Obtain at least empirical performance


    View Slide

  10. 10
    Introduction and Motivation
    Introduction
    Empirical Risk Minimization The ERM uses too strong of an iid assumption

    •Interested in average case OOD performance


    •Test data is drawn from IID


    •No knowledge of the data
    Robust Optimization
    Equivalent to the weighted average of errors in each environment

    Possibility that there is a better algorithm because Robust Optimization
    does not take into account the structure between distributions of
    Robustness at training time does not in general imply robustness at test time
    Adversarial Examples
    → → → Case like the cow and
    its background
    Hint for OOD Generalization(1/2)

    View Slide

  11. 11
    Introduction and Motivation
    Introduction
    Domain Adaptation
    Assumption
    Hint for OOD Generalization(2/2)
    we can extract some variable that will be useful to predict across environments
    doesn’t change across environments
    Idea
    looking for features
    where doesn’t change across
    (Invariant prediction)

    View Slide

  12. 12
    Introduction and Motivation
    Introduction
    How we
    fi
    nd Invariant prediction ?
    Strategy
    Invariant Risk Minimization
    A prediction with the following characteristics is useful for OOD


    •prediction does not change over


    •Error becomes small over

    View Slide

  13. 13
    Introduction and Motivation
    Out-of-Distribution Generalization
    Introduction
    ROOD(f ) = max
    e∈ℰall
    Re(f )
    Re(f ) :=
    𝔼
    Xe,Ye∼ℙe [ℓ (f (Xe), Ye
    )]
    where
    Empirical Risk Minimization Robust Optimization
    Invariant Risk Minimization

    View Slide

  14. 14
    Trend: NeurIPS 2022 Workshop on Distribution Shifts
    Scope


    •Examples of real-world distribution shifts in various application areas


    •Methods for improving robustness to distribution shifts


    •Empirical and theoretical characterization of distribution shifts


    •Benchmarks and evaluations
    95 Accepted Papers

    View Slide

  15. 15
    Trend: Distribution Shifts (Principle and Methods)
    How to generalize under distribnution shift (including both principle clari
    fi
    cation and
    proposed method)
    •Shortcut Learning in Deep Neural Networks:

    ໰୊Λղͨ͘Ίʹ࢖ͬͯ͸͍͚ͳ͍ผͷ৘ใΛ࢖ͬͯ”ͣΔ”Λ͢ΔγϣʔτΧοτֶश͸ಈ෺Ͱ΋ΈΒΕɺ
    ݱࡏͷML/DLͰ΋޿͘ΈΒΕΔɻ͜ΕʹΑΓML͸ҧ͏ํ޲ʹ൚Խ͠ɺֶश෼෍֎ʢo.o.d)ʹ֎ૠͰ͖ͳ͍


    •Towards a Theoretical Framework of Out-of-Distribution Generalization:

    OODͷ൚ԽޡࠩڥքΛূ໌


    •An Information-theoretic Approach to Distribution Shifts:

    Ͳͷಛ௃બͿͱOOD൚Խ͢Δͷ͔ʁͦΕ͕৘ใྔͱ͔ͷΞϓϩʔνͰ΍Δͱ͏·͍͘͘ΑͱݴͬͯΔ


    •Fishr: Invariant Gradient Variances for Out-of-distribution Generalization: ଛࣦͷޯ഑ͷۭؒʹ͓͍ͯυϝ
    ΠϯෆมੑΛڧ੍͢Δਖ਼ଇԽΛఏҊ


    •Predicting Unreliable Predictions by Shattering a Neural Network: ׆ੑԽྖҬͷ਺͕গͳ͍Ϟσϧͷํ͕
    ҰൠԽ͠΍͘͢ɺ஌ࣝͷந৅Խ͕ਐΜͩϞσϧͷํ͕ҰൠԽ͠΍͍͢
    Connection between invariance and causality

    View Slide

  16. 16
    Trend: Distribution Shifts (Datasets)
    Dataset
    •Noise or Signal: The Role of Image Backgrounds in Object Recognition:

    ը૾ͷഎܠґଘੑ / Background Challenge


    •On the Impact of Spurious Correlation for Out-of-distribution Detection:

    ෆมతಛ௃ͱεϓϦΞεಛ௃ͷ྆ํΛߟྀͨ͠σʔλγϑτͷϞσϧΛఏࣔ


    •WILDS: A Benchmark of in-the-Wild Distribution Shifts:

    ը૾ɾݴޠɾάϥϑͷ OOD σʔληοτ܈


    •In Search of Lost Domain Generalization: DomainBed:

    υϝΠϯ൚Խͷσʔληοτ܈

    View Slide

  17. 17
    Trend: Distribution Shifts (Empirical Evaluation)
    Empirical Evaluation
    •OoD-Bench: Benchmarking and Understanding Out-of-Distribution Generalization Datasets and Algorithms:

    OODσʔληοτΛDiversity ShiftͱCorrelation Shiftͷ࣠ͰΧςΰϦ෼͚ͨ͠


    •A Fine-Grained Analysis on Distribution Shift:

    ෳ਺ͷҟͳΔ෼෍γϑτʹΘͨͬͯΞϧΰϦζϜͷϩόετੑΛධՁ


    •Understanding and Testing Generalization of Deep Networks on Out-of-Distribution Data:

    γϑτΛࡾͭͷλΠϓʹ෼ׂɺΞʔΩςΫνϟʹΑΔIDੑೳͱOODੑೳͷൺֱΛͨ͠


    •How does a neural network’s architecture impact its robustness to noisy labels?:

    ωοτϫʔΫͷΞʔΩςΫνϟ͕ɺϊΠζͷଟ͍ϥϕϧʹର͢Δؤ݈ੑʹͲͷΑ͏ʹӨڹ͢Δ͔Λ୳Δ


    •On Calibration and Out-of-domain Generalization:

    OODͷੑೳͱϞσϧͷΩϟϦϒϨʔγϣϯͱͷؒʹؔ࿈ੑΛݟग़ͨ͠

    View Slide

  18. 18
    01 Introduction and Motivation / Trend


    02 Empirical Study on Optimizer Selection for Out-of-Distribution Generalization


    03 Diverse Weights Averaging for Out-of-Distribution Generalization


    04 Assaying Out-Of-Distribution Generalization in Transfer Learning
    Outline

    View Slide

  19. 19
    02 Empirical Study on Optimizer Selection

    for Out-of-Distribution Generalization
    55

    View Slide

  20. Background: Inductive Bias of Optimization in Deep Learning
    • In general deep learning problem setting, there are many global (=local) minima


    • The generalization performance of each global minima is different


    • Different learning methods, such as hyperparameters and optimizer, converge to different global minima
    Figureɿ[Matthew Hutson 2018] Figureɿ[Naresh Kumar 2019]
    20
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

    View Slide

  21. Background: Intuition of Characteristic of Optimizers
    Figure ɿ[Difan Zou et al]
    • Different Optimizers have different convergence rates and generalization performance


    • Some experiments implies that Adam memorise the noise of train data
    21
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

    View Slide

  22. 22
    Related Work: Optimizer Comparison on IID Setting
    With suf
    fi
    cient tuning, the Adaptive optimization method slightly outperforms the Non-Adaptive optimization
    method, but not by much.
    Comprehensive Experiments Under IID Assumption
    Figureɿ[Guodong Zhang et al 2019]

    View Slide

  23. Out-of-Distribution Generalization Datasets (Computer Vision)
    DomainBed Background Challenge
    Figureɿ[Ishaan Gulrajani and David Lopez-Paz 2020] Figureɿ[Kai Xiao et al, 2020]
    23
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

    View Slide

  24. Out-of-Distribution Generalization Datasets (NLP)
    Civil Comments
    Amazon
    Figureɿ[Pang Wei Koh, 2020]
    24
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

    View Slide

  25. Optimizers Subjected to Our Analysis
    25
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

    View Slide

  26. 26
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
    Experimental Results (CV+NLP)
    55

    View Slide

  27. 27
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
    Experimental Results (CV+NLP)
    55

    View Slide

  28. Experimental Results (NLP)
    28
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
    55

    View Slide

  29. 29
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
    Experimental Results (ColoredMNIST)
    55

    View Slide

  30. Experimental Results (Correlation Behaviour)
    30
    Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
    55

    View Slide

  31. 31
    03 Diverse Weights Averaging

    for Out-of-Distribution Generalization
    55

    View Slide

  32. 32
    Diverse Weights Averaging for Out-of-Distribution Generalization
    Background

    View Slide

  33. 33
    Diverse Weights Averaging for Out-of-Distribution Generalization
    Proposal

    View Slide

  34. 34
    Diverse Weights Averaging for Out-of-Distribution Generalization
    Setup
    Diversity Shift
    Correlation Shift

    View Slide

  35. 35
    Diverse Weights Averaging for Out-of-Distribution Generalization
    Theoretical Contribution: Bias-Variance Analysis in OOD (1/2)
    Variance
    Bias
    Model
    θ*
    ̂
    θ
    ˇ
    θ
    Bias and Correlation Shift


    > Bias in OOD increases when the class
    posteriors mismatch
    Assumption: Large NNs

    View Slide

  36. 36
    Diverse Weights Averaging for Out-of-Distribution Generalization
    Theoretical Contribution: Bias-Variance Analysis in OOD (2/2)
    Variance
    Bias
    Model
    θ*
    ̂
    θ
    ˇ
    θ
    Variance and Divershity Shift


    > Variance in OOD increases when the input
    marginals mismatch
    Assumption: NNs with diagonallly dominant NTK

    View Slide

  37. 37
    Diverse Weights Averaging for Out-of-Distribution Generalization
    Theoretical Contribution: Apply Following Theory to OOD

    Bias-Variance-Covariance Decomposition for Ensembling (NTTݚ ্ాमޭઌੜ1996)
    where
    Increasing M does not lower error -> Correlation Shift cannot be dealt with
    Error goes down as M goes up, can handle Diversity Shift
    Should be controlled for low generalization error

    View Slide

  38. 38
    Diverse Weights Averaging for Out-of-Distribution Generalization
    Ablation Study

    View Slide

  39. 39
    Diverse Weights Averaging for Out-of-Distribution Generalization
    Experimental Results / Benchmark

    View Slide

  40. 40
    04 Assaying Out-Of-Distribution Generalization
    in Transfer Learning
    55

    View Slide

  41. 41
    Assaying Out-Of-Distribution Generalization in Transfer Learning
    Motivation


    •Different communities (e.g., calibration, adversarial robustness, algorithmic corruptions,
    invariance across shifts) are considering the same things and drawing different
    conclusions


    •They evaluated with the same exhaustive benchmark for a uni
    fi
    ed understanding
    Overview


    •Large-scale Experiments!

    View Slide

  42. 42
    Assaying Out-Of-Distribution Generalization in Transfer Learning
    Background (ECE: Expacted Calibration Error)
    Modern DNNs do well on ranking performance (such as accuracy, AUC),
    but known to be bad in uncertainty (calibration, ECE).
    This prevents the use of DNNs in automated driving, medical image
    diagnostics, and recommendation systems.
    Figure Taken From [Chuan et al 2017]

    View Slide

  43. 43
    Assaying Out-Of-Distribution Generalization in Transfer Learning
    Full Results
    The results presented hereafter are based on
    this correlation coef
    fi
    cient aggregated on the
    horizontal axis, etc.

    View Slide

  44. 44
    Assaying Out-Of-Distribution Generalization in Transfer Learning
    Experimental Results (Results can signi
    fi
    cantly change for different shift types )
    Takeaway: ID and OOD accuracy only show a linear trend on speci
    fi
    c tasks. We observe three additional settings:
    underspeci
    fi
    cation (vertical line), no generalization (horizontal line), and random generalization (large point cloud).
    We did not observe any trade-off between accuracy and robustness, where more accurate models would over
    fi
    t to
    “spurious features” that do not generalize. Robustness methods have to be tested in many different settings. Currently,
    there seems to be no single method that is superior in all OOD settings.

    View Slide

  45. 45
    Assaying Out-Of-Distribution Generalization in Transfer Learning
    Experimental Results (What are good proxies to measuring robustness to distribution
    shifts?)
    Takeaway: Accuracy is the strongest ID predictor of OOD robustness and models that generalize well in distribution
    tend to also be more robust. Evaluating accuracy on additional held-out OOD data is an even stronger predictor.

    View Slide

  46. 46
    Assaying Out-Of-Distribution Generalization in Transfer Learning
    Experimental Results (On the transfer of metrics from ID to OOD data)
    Takeaway: Among all metrics adversarial robustness transfers best from ID to OOD data, which suggests that models
    respond similarly to adversarial attacks on ID and OOD data. Calibration transfers worst, which means that models
    that are well calibrated on ID data are not necessarily well calibrated on OOD data.

    View Slide

  47. 47
    05 Acknowledgement
    Kartik Ahuja
    Rio Yokota
    Ioannis Mitliagkas
    Kohta Ishikawa Ikuro Sato
    Tetsuya Motokawa
    Shiro Takagi
    Kilian Fatras
    Masanari Kimura Charles Guille-Escuret

    View Slide

  48. 48
    Thank you for listening

    View Slide

  49. 49
    Reference (1/5)
    • Estimating and Explaining Model Performance When Both Covariates and Labels Shift


    • Diverse Weights Averaging for Out-of-Distribution Generalization


    • Domain-Adjusted Regression or: ERM May Already Learn Features Suf
    fi
    cient for Out-of-Distribution Generalization


    • Meta-DMoE: Adapting to Domain Shift by Meta-Distillation from Mixture-of-Experts


    • Assaying Out-Of-Distribution Generalization in Transfer Learning


    • Improving Multi-Task Generalization via Regularizing Spurious Correlation


    • MEMO: Test Time Robustness via Adaptation and Augmentation


    • When are Local Queries Useful for Robust Learning?


    • The Missing Invariance Principle found-the Reciprocal Twin of Invariant Risk Minimization


    • Adapting to Online Label Shift with Provable Guarantees


    • Hard ImageNet: Segmentations for Objects with Strong Spurious Cues


    • Invariance Learning based on Label Hierarchy


    • Hyperparameter Sensitivity in Deep Outlier Detection: Analysis and a Scalable Hyper-Ensemble Solution


    • Domain Generalization without Excess Empirical Risk


    • Representing Spatial Trajectories as Distributions


    • Multitasking Models are Robust to Structural Failure: A Neural Model for Bilingual Cognitive Reserve


    • SPD domain-speci
    fi
    c batch normalization to crack interpretable unsupervised domain adaptation in EEG


    • Task Discovery: Finding the Tasks that Neural Networks Generalize on


    • Domain Adaptation under Open Set Label Shift


    • Explicit Tradeoffs between Adversarial and Natural Distributional Robustness


    • When does dough become a bagel? Analyzing the remaining mistakes on ImageNet


    • NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation


    • RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness


    • Unsupervised Learning under Latent Label Shift
    Distribution Shift

    View Slide

  50. 50
    Reference (2/5)
    • Towards Impoving Calibration in Object Detection Under Domain Shift


    • Single Model Uncertainty Estimation via Stochastic Data Centering


    • Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors


    • On Uncertainty, Tempering, and Data Augmentation in Bayesian Classi
    fi
    cation


    • Joint Entropy Search for Maximally-Informed Bayesian Optimization


    • Scalable Sensitivity and Uncertainty Analysis for Causal-Effect Estimates of Continuous-Valued Interventions


    Calibration
    Generalization
    • Neuron with Steady Response Leads to Better Generalization


    • Chroma-VAE: Mitigating Shortcut Learning with Generative Classi
    fi
    ers


    • Redundant representations help generalization in wide neural networks


    • What is Good Metric to Study Generzalitation of Maximum Learners?


    • Rethinking Generalization in Few-Shot Classi
    fi
    cation


    • Generalization for multi class classi
    fi
    cation in overparameterized linear models


    • LISA: Learning Interpretable Skill Abstractions from Language


    • Geoclidean: Few-Shot Generalization in Euclidean Geometry

    View Slide

  51. 51
    Reference (3/5)
    • On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity


    • The Effects of Regularisation and Data Augmentation are Class Dependent


    • Tikhonov Regularization is Optimal Transport Robust under Martingale Constraints


    • Feature Learning in L2-regularized DNNs: Attraction/Repulsion and Sparsity
    Regularization
    Adversarial Robustness
    • Noise attention learning: enhancing noise robustness by gradient scaling


    • Why do arti
    fi
    cially generated data help adversarial robustness?


    • On the Adversarial Robustness of Mixture of Experts
    Optimizer
    • Target-based Surrogates for Stochastic Optimization


    • Sharper Convergence Guarantees for Asynchronos SGD for Distributed Federated Learning


    • First-Order Algorithms for Min-Max Optimization in Geodesic Metric Spaces


    • Generalization Bounds with Minimal Dependency on Hypothesis Class via Distributionally Robust Optimization


    • Adam Can Converge Without Any Modi
    fi
    cation On Update Rules


    • Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum Minimization


    • A consistently adaptive trust-region method


    • Better SGD using Second-order Momentum

    View Slide

  52. 52
    Reference (4/5)
    Theory
    • Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization


    • Effects of Data Geometry in Early Deep Learning


    • Benign, Tempered, or Catastrophic: A Taxonomy of Over
    fi
    tting


    • Identifying good directions to escape the NTK regime and ef
    fi
    ciently learn low-degree plus sparse polynomials


    • What Can Transformers Learn In-Context? A Case Study of Simple Function Classes


    • On Margin Maximization in Linear and ReLU Networks


    • A Combinatorial Perspective on the Optimization of Shallow ReLU Networks


    • Bridging the Gap: Unifying the Training and Evaluation of Neural Network Binary Classi
    fi
    ers


    • On the Double Descent of Random Features Models Trained with SGD


    • Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization)


    • Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions


    • Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks


    • Chaotic Dynamics are Intrinsic to Neural Network Training with SGD


    • On the non-universality of deep learning: quantifying the cost of symmetry


    • High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation


    • Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit
    Robustness
    • Where do Models go Wrong? Parameter-Space Saliency Maps for Explainability


    • What Can Transformers Learn In-Context? A Case Study of Simple Function Classes


    • Is one annotation enough? A data-centric image classi
    fi
    cation benchmark for noisy and ambiguous label estimation


    • To be Robust or to be Fair: Towards Fairness in Adversarial Training

    View Slide

  53. 53
    Reference (5/5)
    Others
    • BLaDE: Robust Exploration via Diffusion Models


    • Active Learning Classi
    fi
    ers with Label and Seed Queries


    • Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel


    • Beyond Not-Forgetting: Continual Learning with Backward Knowledge Transfer


    • Learning Options via Compression


    • A Simple Decentralized Cross-Entropy Method


    • Turbocharging Solution Concepts: Solving NEs, CEs and CCEs with Neural Equilibrium Solvers


    • De
    fi
    ning and Characterizing Reward Hacking


    • Exploiting the Relationship Between Kendall's Rank Correlation and Cosine Similarity for Attribution Protection


    • Not All Bits have Equal Value: Heterogeneous Precisions via Trainable Noise


    • AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Ef
    fi
    cient Language Models


    • TVLT: Textless Vision-Language Transformer


    • SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections


    • TokenMixup: Ef
    fi
    cient Attention-guided Token-level Data Augmentation for Transformers


    • MoCoDA: Model-based Counterfactual Data Augmentation


    • Explainability Via Causal Self-Talk


    • Memorization Without Over
    fi
    tting: Analyzing the Training Dynamics of Large Language Models


    • Dataset Inference for Self-Supervised Models


    • PDEBENCH: An Extensive Benchmark for Scienti
    fi
    c Machine Learning


    • Pruning has a disparate impact on model accuracy

    View Slide