Hongseok Namkoong (Columbia University, New York, USA) On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets
Venue: Humboldt University of Berlin, Dorotheenstraße 24
AI Hongseok Namkoong namkoong@gsb.columbia.edu Decision, Risk, and Operations Division, Columbia Business School Based on joint works with Tiffany Cai, Peng Cui, Jiashuo Liu, and Tianyu Wang
optimization: Solve worst-case problem • Idea: Do well almost all the time, instead of on average! Application of optimal transport E.g., Kuhn, Esfahani, Nguyen, Shafieezadeh-Abadeh (2019)
have not contributed to any major success in ML/AI • My experience: good for last-layer interventions (e.g., fairness adjustments), but these ideas do not scale! ◦ Key issue: Data, data, data… • Today: What impact can theory-driven principles have in ML/AI?
red line? Algorithmic interventions do not provide this robustness • Only larger training data—as a result, recent works in AI largely focus on scaling data from the internet • No principled understanding of datasets Caveat: This is a one-slide summary of an entire field; naturally, I omit nuances.
binding constraint outside of the internet • We cannot just “scale” data; need to understand which data to collect • To start, let’s examine implicit assumptions so far ◦ AI researchers focus on building a universally robust model, just like humans! ◦ Implicitly, this view focuses on covariate shift (X-shift), e.g., image recognition ◦ One-size-fits-all mindset
Y|X-shifts when there are unobserved factors whose distribution changes across time & space X-shifts Y|X-shifts changes in sampling, underrepresented groups changes in labeling, poorly chosen X, confounders
Y|X-shifts when there are unobserved factors whose distribution changes across time & space • Conjecture: Y|X-shifts are more prominent in practice • For Y|X-shifts, we don’t expect a single model to perform well across distributions • Requires application-specific understanding of distributional differences
Describing Distribution Shifts: Illustrations on Tabular Datasets • Look at loss ratio of deployed model vs. best model for target Even tabular benchmarks mainly focus on X-shifts
• Out of 169 train-target pairs with significant performance degradation, 80% of them are primarily attributed to Y|X-shifts. • CS benchmarking view breaks down: we can’t just compare models based on their out-of-distribution performance! • Infeasible to simultaneously perform well across train and target • We need to build an understanding of why the distribution changed! WhyShift https://github.com/namkoong-lab/whyshift arxiv github
Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets ImageNet Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts
and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets • Train & target performance correlated only when X-shifts dominate Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts
◦ They make assumptions about data distributions but do not check them ◦ Need application-specific understanding of real shift patterns • We need a modeling language for distribution shifts! One size fits all
ambiguity set arbitrary; primarily driven by mathematical convenience and details “left to the modeler” • Little thought given to model class DRO revisited
of algorithmic design knobs on model performance Model Class (Tree, Linear, MLP) Ambiguity Set (Distance Type, Radius) Shift Pattern (Y|X-ratio) Validation Type (Average, Worst) Task/State fixed effect
Predict whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: single state
whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: worst state
age subgroups: [20,25), [25,30), …, [75,100) • Consider DRO methods that consider shifts on a subset of covariates • Variable selection for ambiguity set: top-k with largest subgroup differences • Performance varies a lot over variables selected k k all all Marginal DRO Wasserstein DRO
view ◦ Distribution shift: out-of-distribution performance is worse than in-distribution performance! ◦ But this just means P: train Q: target • Attribute performance degradation: not all shifts matter • Different shifts warrant different interventions Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
given X E Q [L|X] E P [L|X] L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
given X E Q [L|X] E P [L|X] You can only compare Y|X on shared X E P [L|X] not well-defined E Q [L|X] not well-defined L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
S x density of X X=age X=age L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
Q [E Q [L|X]] L: loss P: train Q: target S: shared Performance on the training distribution Performance on the target distribution Decompose into X-shift vs. Y|X-shift Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] L: loss P: train Q: target S: shared Diagnosis: S has more X’s that are harder to predict than P Potential interventions: Use domain adaptation, e.g. importance weighting Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Y|X moves farther from predicted model Potential interventions: Re-collect data or modify covariates L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Q has “new” X’s that are harder to predict than S Potential interventions: Collect + label more data on “new” examples L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Decompose change in performance
E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Decompose change in performance
Q: general population Performance attributed to X shift (S Q), meaning “new examples” such as older people L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011
adaptation may be effective L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Employment prediction case study [X shift] P: age ≤25 overrepresented, Q: evenly sampled population
missing covariate: education affects employment L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Employment prediction case study [Y|X shift] P: West Virginia, Q: Maryland
language features With language features [Y|X shift] P: California (CA), Q: Puerto Rico (PR) CA model does not use language. Y|X shift because of missing covariate: language affects outcome → better performance in PR
X vs Y|X shift • Can help articulate modeling assumptions + data collection We need a modeling language for a data-centric view of AI • Limitations: shared space not easy to understand in high dimensions • Optimal transport can provide a flexible modeling language • What is the right geometry to model distribution shifts? Distribution Shift Decomposition (DISDE) Cai, Namkoong, and Yadlowsky, Diagnosing Model Performance Under Distribution Shift, Major revision in Operations Research, https://github.com/namkoong-lab/disde Liu, Wang, Cui, and Namkoong, On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets, NeurIPS 2023, https://github.com/namkoong-lab/whyshift