Slide 1

Slide 1 text

We need a modeling language for a data-centric view of AI Hongseok Namkoong [email protected] Decision, Risk, and Operations Division, Columbia Business School Based on joint works with Tiffany Cai, Peng Cui, Jiashuo Liu, and Tianyu Wang

Slide 2

Slide 2 text

AI builds on data as infrastructure

Slide 3

Slide 3 text

Pattern recognition will reflect existing biases

Slide 4

Slide 4 text

● Standard approach: Solve average-case risk minimization ● Distributionally robust optimization: Solve worst-case problem ● Idea: Do well almost all the time, instead of on average! Application of optimal transport E.g., Kuhn, Esfahani, Nguyen, Shafieezadeh-Abadeh (2019)

Slide 5

Slide 5 text

“Robust” AI ● Many algorithmic solutions toward robustness, generalization, and fairness ● These are just my body of work on the topic—so that I can dish on them later!

Slide 6

Slide 6 text

Self-reflections on my research ● While intellectually satisfying, these algos have not contributed to any major success in ML/AI ● My experience: good for last-layer interventions (e.g., fairness adjustments), but these ideas do not scale! ○ Key issue: Data, data, data… ● Today: What impact can theory-driven principles have in ML/AI?

Slide 7

Slide 7 text


Slide 8

Slide 8 text

Slide credit: Ludwig Schmidt

Slide 9

Slide 9 text

ImageNet V2 ● Slide credit: Ludwig Schmidt Big drop

Slide 10

Slide 10 text

Improving effective robustness ● How do we go up the red line? Algorithmic interventions do not provide this robustness ● Only larger training data—as a result, recent works in AI largely focus on scaling data from the internet ● No principled understanding of datasets Caveat: This is a one-slide summary of an entire field; naturally, I omit nuances.

Slide 11

Slide 11 text

Modeling language for datasets ● Cost of data collection a binding constraint outside of the internet ● We cannot just “scale” data; need to understand which data to collect ● To start, let’s examine implicit assumptions so far ○ AI researchers focus on building a universally robust model, just like humans! ○ Implicitly, this view focuses on covariate shift (X-shift), e.g., image recognition ○ One-size-fits-all mindset

Slide 12

Slide 12 text

X-shifts vs. Y|X-shifts ● On the other hand, we expect Y|X-shifts when there are unobserved factors whose distribution changes across time & space X-shifts Y|X-shifts changes in sampling, underrepresented groups changes in labeling, poorly chosen X, confounders

Slide 13

Slide 13 text

X-shifts vs. Y|X-shifts ● On the other hand, we expect Y|X-shifts when there are unobserved factors whose distribution changes across time & space ● Conjecture: Y|X-shifts are more prominent in practice ● For Y|X-shifts, we don’t expect a single model to perform well across distributions ● Requires application-specific understanding of distributional differences

Slide 14

Slide 14 text

● Look at loss ratio of deployed model vs. best model for target Even tabular benchmarks mainly focus on X-shifts

Slide 15

Slide 15 text

Liu, Wang, Cui, Namkoong, On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets ● Look at loss ratio of deployed model vs. best model for target Even tabular benchmarks mainly focus on X-shifts

Slide 16

Slide 16 text

● 7 spatiotemporal and demographic shifts from 5 tabular datasets ● Out of 169 train-target pairs with significant performance degradation, 80% of them are primarily attributed to Y|X-shifts. ● CS benchmarking view breaks down: we can’t just compare models based on their out-of-distribution performance! ● Infeasible to simultaneously perform well across train and target ● We need to build an understanding of why the distribution changed! WhyShift arxiv github

Slide 17

Slide 17 text

● Train & target performance correlated only when X-shifts dominate Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets ImageNet Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts

Slide 18

Slide 18 text

Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets ● Train & target performance correlated only when X-shifts dominate Accuracy-on-the-line doesn’t hold under strong 𝑌|𝑋-shifts

Slide 19

Slide 19 text

● Existing algos (e.g. DRO) do not provide reliable gains ○ They make assumptions about data distributions but do not check them ○ Need application-specific understanding of real shift patterns ● We need a modeling language for distribution shifts! One size fits all

Slide 20

Slide 20 text

● Distributionally robust optimization: Solve worst-case problem ● Choice of ambiguity set arbitrary; primarily driven by mathematical convenience and details “left to the modeler” ● Little thought given to model class DRO revisited

Slide 21

Slide 21 text

Empirical analysis of 10,000+ DRO models ● Examine the impact of algorithmic design knobs on model performance Model Class (Tree, Linear, MLP) Ambiguity Set (Distance Type, Radius) Shift Pattern (Y|X-ratio) Validation Type (Average, Worst) Task/State fixed effect

Slide 22

Slide 22 text

Target performance: single state ● Model class most important! ● Trees >>> DRO ambiguity set

Slide 23

Slide 23 text

● Effect of ambiguity set inconsistent across different outcomes Upper: Predict whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: single state

Slide 24

Slide 24 text

● Even for worst-state performance, DRO is unreliable Upper: Predict whether a low-income individual, not eligible for Medicare, has coverage from public health insurance. Lower: Predict whether annual income > $50K Target performance: worst state

Slide 25

Slide 25 text

Toward better ambiguity sets ● Consider covariate shifts induced by age subgroups: [20,25), [25,30), …, [75,100) ● Consider DRO methods that consider shifts on a subset of covariates ● Variable selection for ambiguity set: top-k with largest subgroup differences ● Performance varies a lot over variables selected k k all all Marginal DRO Wasserstein DRO

Slide 26

Slide 26 text

AI pipeline Data collection Model training Validation & Monitoring AI development cycle

Slide 27

Slide 27 text

Today: A step toward a modeling language ● Current ML view ○ Distribution shift: out-of-distribution performance is worse than in-distribution performance! ○ But this just means P: train Q: target ● Attribute performance degradation: not all shifts matter ● Different shifts warrant different interventions Diagnosing Model Performance Under Distribution Shift

Slide 28

Slide 28 text

density of X P x Q x X=age expected loss given X E Q [L|X] E P [L|X] L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift

Slide 29

Slide 29 text

density of X P x Q x X=age expected loss given X E Q [L|X] E P [L|X] You can only compare Y|X on shared X E P [L|X] not well-defined E Q [L|X] not well-defined L is loss L: loss P: train Q: target Diagnosing Model Performance Under Distribution Shift

Slide 30

Slide 30 text

Define Shared Distribution density of X P x Q x S x density of X X=age X=age L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift

Slide 31

Slide 31 text

Decompose change in performance E P [E P [L|X]] E Q [E Q [L|X]] L: loss P: train Q: target S: shared Performance on the training distribution Performance on the target distribution Decompose into X-shift vs. Y|X-shift Diagnosing Model Performance Under Distribution Shift

Slide 32

Slide 32 text

Decompose change in performance E P [E P [L|X]] E S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] L: loss P: train Q: target S: shared Diagnosis: S has more X’s that are harder to predict than P Potential interventions: Use domain adaptation, e.g. importance weighting Diagnosing Model Performance Under Distribution Shift

Slide 33

Slide 33 text

Decompose change in performance E P [E P [L|X]] E S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Y|X moves farther from predicted model Potential interventions: Re-collect data or modify covariates L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift

Slide 34

Slide 34 text

E P [E P [L|X]] E S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] Diagnosis: Q has “new” X’s that are harder to predict than S Potential interventions: Collect + label more data on “new” examples L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift Decompose change in performance

Slide 35

Slide 35 text

E P [E P [L|X]] E S [E P [L|X]] E S [E Q [L|X]] E Q [E Q [L|X]] E P [E Q [L|X]] E Q [E P [L|X]] L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift Decompose change in performance

Slide 36

Slide 36 text

Employment prediction case study [X shift] P: only age ≤25, Q: general population Performance attributed to X shift (S Q), meaning “new examples” such as older people L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift

Slide 37

Slide 37 text

Substantial portion attributed to X shift (P S), suggesting domain adaptation may be effective L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift Employment prediction case study [X shift] P: age ≤25 overrepresented, Q: evenly sampled population

Slide 38

Slide 38 text

WV model does not use education. Y|X shift because of missing covariate: education affects employment L: loss P: train Q: target S: shared Diagnosing Model Performance Under Distribution Shift Employment prediction case study [Y|X shift] P: West Virginia, Q: Maryland

Slide 39

Slide 39 text

Better data can be more effective than better algorithms! No language features With language features [Y|X shift] P: California (CA), Q: Puerto Rico (PR) CA model does not use language. Y|X shift because of missing covariate: language affects outcome → better performance in PR

Slide 40

Slide 40 text

● Diagnostic for understanding why performance dropped in terms of X vs Y|X shift ● Can help articulate modeling assumptions + data collection We need a modeling language for a data-centric view of AI ● Limitations: shared space not easy to understand in high dimensions ● Optimal transport can provide a flexible modeling language ● What is the right geometry to model distribution shifts? Distribution Shift Decomposition (DISDE) Cai, Namkoong, and Yadlowsky, Diagnosing Model Performance Under Distribution Shift, Major revision in Operations Research, Liu, Wang, Cui, and Namkoong, On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets, NeurIPS 2023,