Hongseok Namkoong (Columbia University, New York, USA) On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets

Slide 1

Slide 1 text

We need a modeling language for a data-centric view of AI Hongseok Namkoong [email protected] Decision, Risk, and Operations Division, Columbia Business School Based on joint works with Tiﬀany Cai, Peng Cui, Jiashuo Liu, and Tianyu Wang

Slide 2

Slide 2 text

AI builds on data as infrastructure

Slide 3

Slide 3 text

Pattern recognition will reﬂect existing biases

Slide 4

Slide 4 text

● Standard approach: Solve average-case risk minimization ● Distributionally robust optimization: Solve worst-case problem ● Idea: Do well almost all the time, instead of on average! Application of optimal transport E.g., Kuhn, Esfahani, Nguyen, Shafieezadeh-Abadeh (2019)

Slide 5

Slide 5 text

“Robust” AI ● Many algorithmic solutions toward robustness, generalization, and fairness ● These are just my body of work on the topic—so that I can dish on them later!

Slide 6

Slide 6 text

Self-reﬂections on my research ● While intellectually satisfying, these algos have not contributed to any major success in ML/AI ● My experience: good for last-layer interventions (e.g., fairness adjustments), but these ideas do not scale! ○ Key issue: Data, data, data… ● Today: What impact can theory-driven principles have in ML/AI?

Slide 7

Slide 7 text

Error

Slide 8

Slide 8 text

Slide credit: Ludwig Schmidt

Slide 9

Slide 9 text

ImageNet V2 ● Slide credit: Ludwig Schmidt Big drop

Slide 10

Slide 10 text

Improving eﬀective robustness ● How do we go up the red line? Algorithmic interventions do not provide this robustness ● Only larger training data—as a result, recent works in AI largely focus on scaling data from the internet ● No principled understanding of datasets Caveat: This is a one-slide summary of an entire ﬁeld; naturally, I omit nuances.

Slide 11

Slide 11 text

Modeling language for datasets ● Cost of data collection a binding constraint outside of the internet ● We cannot just “scale” data; need to understand which data to collect ● To start, let’s examine implicit assumptions so far ○ AI researchers focus on building a universally robust model, just like humans! ○ Implicitly, this view focuses on covariate shift (X-shift), e.g., image recognition ○ One-size-ﬁts-all mindset

Slide 12

Slide 12 text

X-shifts vs. Y|X-shifts ● On the other hand, we expect Y|X-shifts when there are unobserved factors whose distribution changes across time & space X-shifts Y|X-shifts changes in sampling, underrepresented groups changes in labeling, poorly chosen X, confounders

Slide 13

Slide 13 text

X-shifts vs. Y|X-shifts ● On the other hand, we expect Y|X-shifts when there are unobserved factors whose distribution changes across time & space ● Conjecture: Y|X-shifts are more prominent in practice ● For Y|X-shifts, we don’t expect a single model to perform well across distributions ● Requires application-speciﬁc understanding of distributional diﬀerences

Slide 14

Slide 14 text

● Look at loss ratio of deployed model vs. best model for target Even tabular benchmarks mainly focus on X-shifts

Slide 15

Slide 15 text

Liu, Wang, Cui, Namkoong, On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets ● Look at loss ratio of deployed model vs. best model for target Even tabular benchmarks mainly focus on X-shifts

Slide 16

Slide 16 text

● 7 spatiotemporal and demographic shifts from 5 tabular datasets ● Out of 169 train-target pairs with signiﬁcant performance degradation, 80% of them are primarily attributed to Y|X-shifts. ● CS benchmarking view breaks down: we can’t just compare models based on their out-of-distribution performance! ● Infeasible to simultaneously perform well across train and target ● We need to build an understanding of why the distribution changed! WhyShift https://github.com/namkoong-lab/whyshift arxiv github

Slide 17

Slide 17 text

● Train & target performance correlated only when X-shifts dominate Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets ImageNet Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts

Slide 18

Slide 18 text

Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets ● Train & target performance correlated only when X-shifts dominate Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts

Slide 19

Slide 19 text

● Existing algos (e.g. DRO) do not provide reliable gains ○ They make assumptions about data distributions but do not check them ○ Need application-speciﬁc understanding of real shift patterns ● We need a modeling language for distribution shifts! One size ﬁts all

Slide 20

Slide 20 text

● Distributionally robust optimization: Solve worst-case problem ● Choice of ambiguity set arbitrary; primarily driven by mathematical convenience and details “left to the modeler” ● Little thought given to model class DRO revisited

Slide 21

Slide 21 text

Empirical analysis of 10,000+ DRO models ● Examine the impact of algorithmic design knobs on model performance Model Class (Tree, Linear, MLP) Ambiguity Set (Distance Type, Radius) Shift Pattern (Y|X-ratio) Validation Type (Average, Worst) Task/State fixed effect

Slide 22

Slide 22 text

Target performance: single state ● Model class most important! ● Trees >>> DRO ambiguity set

Slide 23

Slide 23 text

● Eﬀect of ambiguity set inconsistent across diﬀerent outcomes Upper: Predict whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: single state

Slide 24

Slide 24 text

● Even for worst-state performance, DRO is unreliable Upper: Predict whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: worst state

Slide 25

Slide 25 text

Toward better ambiguity sets ● Consider covariate shifts induced by age subgroups: [20,25), [25,30), …, [75,100) ● Consider DRO methods that consider shifts on a subset of covariates ● Variable selection for ambiguity set: top-k with largest subgroup diﬀerences ● Performance varies a lot over variables selected k k all all Marginal DRO Wasserstein DRO

Slide 26

Slide 26 text

AI pipeline Data collection Model training Validation & Monitoring AI development cycle

Slide 27

Slide 27 text

Today: A step toward a modeling language ● Current ML view ○ Distribution shift: out-of-distribution performance is worse than in-distribution performance! ○ But this just means P: train Q: target ● Attribute performance degradation: not all shifts matter ● Diﬀerent shifts warrant diﬀerent interventions Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Slide 28

Slide 28 text

density of X P x Q x X=age expected loss given X E Q [L|X] E P [L|X] L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Slide 29

Slide 29 text

density of X P x Q x X=age expected loss given X E Q [L|X] E P [L|X] You can only compare Y|X on shared X E P [L|X] not well-defined E Q [L|X] not well-defined L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Slide 30

Slide 30 text

Deﬁne Shared Distribution density of X P x Q x S x density of X X=age X=age L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Slide 31

Slide 31 text

Decompose change in performance E P [E P [L|X]] E Q [E Q [L|X]] L: loss P: train Q: target S: shared Performance on the training distribution Performance on the target distribution Decompose into X-shift vs. Y|X-shift Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Slide 32

Slide 32 text

Decompose change in performance E P [E P [L|X]] E S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] L: loss P: train Q: target S: shared Diagnosis: S has more X’s that are harder to predict than P Potential interventions: Use domain adaptation, e.g. importance weighting Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Slide 33

Slide 33 text

Decompose change in performance E P [E P [L|X]] E S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Y|X moves farther from predicted model Potential interventions: Re-collect data or modify covariates L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Slide 34

Slide 34 text

E P [E P [L|X]] E S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Q has “new” X’s that are harder to predict than S Potential interventions: Collect + label more data on “new” examples L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Decompose change in performance

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Employment prediction case study [X shift] P: only age ≤25, Q: general population Performance attributed to X shift (S Q), meaning “new examples” such as older people L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011

Slide 37

Slide 37 text

Substantial portion attributed to X shift (P S), suggesting domain adaptation may be eﬀective L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Employment prediction case study [X shift] P: age ≤25 overrepresented, Q: evenly sampled population

Slide 38

Slide 38 text

WV model does not use education. Y|X shift because of missing covariate: education aﬀects employment L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift https://github.com/namkoong-lab/disde https://arxiv.org/abs/2303.02011 Employment prediction case study [Y|X shift] P: West Virginia, Q: Maryland

Slide 39

Slide 39 text

Better data can be more eﬀective than better algorithms! No language features With language features [Y|X shift] P: California (CA), Q: Puerto Rico (PR) CA model does not use language. Y|X shift because of missing covariate: language aﬀects outcome → better performance in PR

Slide 40

Slide 40 text

● Diagnostic for understanding why performance dropped in terms of X vs Y|X shift ● Can help articulate modeling assumptions + data collection We need a modeling language for a data-centric view of AI ● Limitations: shared space not easy to understand in high dimensions ● Optimal transport can provide a ﬂexible modeling language ● What is the right geometry to model distribution shifts? Distribution Shift Decomposition (DISDE) Cai, Namkoong, and Yadlowsky, Diagnosing Model Performance Under Distribution Shift, Major revision in Operations Research, https://github.com/namkoong-lab/disde Liu, Wang, Cui, and Namkoong, On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets, NeurIPS 2023, https://github.com/namkoong-lab/whyshift