Slide 1

Slide 1 text

Tabular foundation models for analytics: challenges and progress Ga¨ el Varoquaux

Slide 2

Slide 2 text

About me 15 years of machine-learning for health and social sciences Co-founded scikit-learn I don’t like wearing ties G Varoquaux 1

Slide 3

Slide 3 text

Tabular learning is everywhere Detecting credit-card fraud Evaluating major health risks Predicting airplane delays Detecting cyber attacks ... Machine learning from tables has a huge impact on our lives G Varoquaux 2

Slide 4

Slide 4 text

‘AI”: amazing breakthroughs on images, sound, text But the most precious data is in tables Learning on tables is not as easy as it seems G Varoquaux 3

Slide 5

Slide 5 text

1 Learning on tables Sex Experience Age Employee Position Title M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M NA 62 Bus Operator

Slide 6

Slide 6 text

Tabular data Categorical data Columns of different nature, different distributions G Varoquaux 5 sex experience age company size m 10 42 3 f 23 NA 121 m 3 28 12 f 16 45 23 m 13 48 3 231 m 6 36 593 m NA 62 32 f 9 35 NA f NA 39 238

Slide 7

Slide 7 text

Tabular data: feature distributions Distribution heterogeneity Feature distributions All different Very non-Gaussian (some long tailed) G Varoquaux 6

Slide 8

Slide 8 text

Good data preparation facilitates statistical modeling Gaussianization, standardization, outlier removal... G Varoquaux 7 erience age size 10 42 3 23 NA 121 3 28 12 16 45 23 13 48 3 231 6 36 593 NA 62 32 9 35 NA NA 39 238

Slide 9

Slide 9 text

G Varoquaux 8

Slide 10

Slide 10 text

Chaining transformations to facilitate learning? That sounds like a job for deep learning G Varoquaux 9

Slide 11

Slide 11 text

On tabular data: tree models > deep learning [Grinsztajn... 2022] Tree-based methods out-perform tailored deep architectures sklearn HistGradientBoosting... Faster and better predictor G Varoquaux 10

Slide 12

Slide 12 text

Tree models !? Feature-by-feature decision Sequence of simple decisions Natural Categorical handling y: y: y: y: y: G Varoquaux 11

Slide 13

Slide 13 text

Tree models !? Feature-by-feature decision Sequence of simple decisions Natural Categorical handling Missing-value handling easier than imputing [Le Morvan... 2021, Morvan and Varoquaux 2024] x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missin ... Predict +1.3 G Varoquaux 11

Slide 14

Slide 14 text

Tabular data’s geometry [Grinsztajn... 2022] Feature distributions All different Very non-Gaussian Trees don’t care G Varoquaux 12

Slide 15

Slide 15 text

Tabular data’s geometry [Grinsztajn... 2022] Feature distributions All different Very non-Gaussian Euclidean distance?! The data’s natural geometry is neither smooth nor vectorial G Varoquaux 12

Slide 16

Slide 16 text

Deep learning struggles on tables Gradient-boosted trees ♥ And scikit-learn is still most popular⋆ https://www.kaggle.com/kaggle-survey-2022 G Varoquaux 13

Slide 17

Slide 17 text

Maybe I’m just living in the past Maybe I’m just living in the past Co-founded scikit-learn 15 years ago Co-founded scikit-learn 15 years ago Trying to still be cool Trying to still be cool G Varoquaux 14

Slide 18

Slide 18 text

The numbers don’t lie Number of monthly downloads⋆ 0.0 0.5 1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M G Varoquaux 15 ⋆ from pypistats.org

Slide 19

Slide 19 text

The numbers don’t lie Number of monthly downloads⋆ 0.0 0.5 1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M Pandas!? Just a dataframe library Data preparation > scikit-learn G Varoquaux 15 ⋆ from pypistats.org

Slide 20

Slide 20 text

Data preparation = more than numbers Open-ended strings? Information normalization? Merging in information from external tables? G Varoquaux 16 Employer Experience Employee Position Title LAPD 10 yrs Master Police Officer LA city 23 yrs Social Worker IV LAPD 3 yrs Police Officer III LAPD 16 yrs Police Aide NA 13 yrs Electrician I Greyhound 6 yrs Bus Operator Greyhound NA Bus Operator LA city 9 yrs Social Worker III

Slide 21

Slide 21 text

Strings! More data preparation? A challenge? An opportunity Deep learning is good at text G Varoquaux 17 Sex Experience Age Employee Position Title M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M NA 62

Slide 22

Slide 22 text

2 Deep learning strikes back Table foundation models

Slide 23

Slide 23 text

Successes of neural networks in vision, text... Large data Pretrained G Varoquaux 19

Slide 24

Slide 24 text

Successes of neural networks in vision, text... Large data Pretrained Can we bring background information to tables? G Varoquaux 19

Slide 25

Slide 25 text

Enriching via strings Real-estate market Predict price of a property G Varoquaux 20 Age Surface # rooms City 102 76 4 Paris 18 123 6 Vitry 12 155 7 Reno 53 23 1 NYC 39 32 3 London 23 52 4 Vancouver 39 114 7 Prince Ruppert

Slide 26

Slide 26 text

City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M Vitry 33k Population 2.2M 33k City Rent Paris 1100€ Vitry 700€ Paris 1300€ Population 2.2M 33k 2.2M Enriching via strings Recognized strings (entities) Enable bringing in background information G Varoquaux 21

Slide 27

Slide 27 text

City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Enriching faces representation and summarization challenges G Varoquaux 22 Varying available features granularity of information The richer the data source the worse it gets

Slide 28

Slide 28 text

Semantic heterogeneity Heterogeneous features – surface, age, # rooms Discrete objects, high cardinality – 36 000 towns in France Different tables, different objects – properties, person, cities, departments. G Varoquaux 23

Slide 29

Slide 29 text

Vectorial embeddings Object → p Capture implicit regularities Distill information in a relational data source KEN: knowledge embedding with numbers [Cvetkov-Iliev... 2023] G Varoquaux 24

Slide 30

Slide 30 text

KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris Sherman County Long Orange County Orange Harris Orange ... Anaheim Name Irvine Houston Santa Ana ... State Ca Ca Tx Ca ... City Sherman Name 36.1 Long Harris 29.5 Lat Orange 33.7 101.5 95.2 117.8 ... ... ... Tx St Tx Ca ... Pop 3k 4.7M 3.2M ... County Triplet representation head relation tail ... Database tables Paris 4.7M Paris Harris County Pop Paris Orange Paris Anaheim City County Training Paris Harris Paris Houston City County negative sampling for entity embedding re opera training dynamics tail Strategy: Data represented as graphs (RDF triplets) G Varoquaux 25

Slide 31

Slide 31 text

KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris herman County Long riplet representation head relation tail ... Paris 4.7M Paris Harris County Pop Paris Orange Paris naheim City County Training embeddings Paris Harris Paris Houston City County negative sampling for training numerical attribute embedding entity embedding relation operator training dynamics harris orange sherman Analysis tail County 105 Votes 1285 130 ... Harris Orange Sherman ... Database 2 Transfert Strategy: Data represented as graphs (RDF triplets) Adapt knowledge-graph embedding approaches Capture relations and numerical attributes G Varoquaux 25

Slide 32

Slide 32 text

From tables to (knowledge) graphs Knowledge graphs = list of triples (head, relation, tail) or (h, r, t) e.g. (Paris, capitalOf, France) San Francisco San Diego California 0.87M State 1.4M Population City Population State San Francisco 0.87M California San Diego 1.4M California (San Francisco, Population, 0.87M) (San Francisco, State, California) (San Diego, Population, 1.4M) (San Diego, State, California) Table representation Triple / Knowledge graph “Head” column San Francisco San Diego California 0.87M State 1.4M Population City Population State San Francisco 0.87M California San Diego 1.4M California (San Francisco, Population, 0.87M) (San Francisco, State, California) (San Diego, Population, 1.4M) (San Diego, State, California) “Head” column The two representations are (almost) equivalent: G Varoquaux 26

Slide 33

Slide 33 text

Entity embeddings: contextual Contextual: two entities have close embeddings if they co-occur In NLP: word2vec = word co-occurences In knowledge-graphs: RDF2vec G Varoquaux 27 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California

Slide 34

Slide 34 text

Entity embeddings: contextual < relational Contextual: two entities have close embeddings if they co-occur Relational: two entities are close if they have the same relations to other entities G Varoquaux 27 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California

Slide 35

Slide 35 text

Knowledge-graph embeddings to capture relations TransE [Bordes... 2013] represents relation r as a translation vector r ∈ p between entity embeddings h and t: Scoring function: f (h, r, t) = −||h + r − t|| Italy France Paris Rome capitalOf Training: optimize h, r, t to minimize a margin loss: L = (h,r,t)∈G, (h′,t′) s.t.(h′,r,t′) G with h′=h or t=t′ [f (h′, r, t′) − f (h, r, t) + γ]+ G Varoquaux 28

Slide 36

Slide 36 text

KEN: embeddings to capture diverse information [Cvetkov-Iliev... 2023] We use MuRE [Balazevic... 2019] Good eg for one-to-many relation Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh + bt Contraction / projection Translation Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn n sIn hWorks etts Enables rich relational geometry Different directions = different information G Varoquaux 29

Slide 37

Slide 37 text

Entity embeddings that distill background information KEN: knowledge embedding with numbers [Cvetkov-Iliev... 2023] X ∈ p soda-inria.github.io/ken embeddings 6 million common entities cities, people, compagnies... G Varoquaux 30

Slide 38

Slide 38 text

Table foundation model CARTE⋆ Breakthrough transfer across tables ⇒ pretraining Deep learning ≫ tree models ⋆Context Aware Representation of Table Entries G Varoquaux 31

Slide 39

Slide 39 text

Pretraining for data tables? What prior for a bunch of numbers? 72 68 174 1 64 79 181 1 56 59 166 0 81 62 161 1 G Varoquaux 32

Slide 40

Slide 40 text

Pretraining for data tables? What prior for a bunch of numbers? 72 68 174 1 64 79 181 1 56 59 166 0 81 62 161 1 And now? Cardiovascular cohort Age Weight Height Commorbidity Cardiovascular event 72 68 174 Diabetes 1 64 79 181 Cardiac arrhythmia 1 56 59 166 NA 0 81 62 161 Asthma 1 G Varoquaux 32

Slide 41

Slide 41 text

Pretraining for data tables? What prior for a bunch of numbers? 72 68 174 1 64 79 181 1 56 59 166 0 81 62 161 1 And now? Cardiovascular cohort Age Weight Height Commorbidity Cardiovascular event 72 68 174 Diabetes 1 64 79 181 Cardiac arrhythmia 1 56 59 166 NA 0 81 62 161 Asthma 1 Column names and string entries contextualize the numbers G Varoquaux 32

Slide 42

Slide 42 text

Data integration challenges Cell level Entity alignment Londres / London String / language modeling Column level Different schemas Local relational structure G Varoquaux 33

Slide 43

Slide 43 text

CARTE: graph representation Title ISSN Publisher Country Region H index Nature 14764687 Nature Publishing Group United Kingdom Western Europe 1331 JMLR 15337928 NaN United States Northern America 239 Feature Initialization: Language Model Northern America ⋯ United States ⋯ 239 ⋯ JMLR ⋯ 153379 ⋯ ⋯ Title ⋯ Region ⋯ Country ⋯ H index ⋯ ISSN ⨀ Num. Values Graph representation to bridge tables Cell values ⇒ string embeddings on nodes Column titles ⇒ string embeddings on edges G Varoquaux 34 [Kim... 2024]

Slide 44

Slide 44 text

CARTE: graph-attention network Title ISSN Publisher Country Region H index Nature 14764687 Nature Publishing Group United Kingdom Western Europe 1331 JMLR 15337928 NaN United States Northern America 239 Feature Initialization: Language Model Northern America ⋯ 239 ⋯ ⋯ Title ⋯ Region ⋯ H index ⨀ Num. Values Graph attention Attention: - varying # columns - invariance to column order New attention with relational information Attention key & query: E ⊙ X (Adapted from KEN) Edge Cell value G Varoquaux 35 [Kim... 2024]

Slide 45

Slide 45 text

CARTE: graph-attention network Title ISSN Publisher Country Region H index Nature 14764687 Nature Publishing Group United Kingdom Western Europe 1331 JMLR 15337928 NaN United States Northern America 239 Feature Initialization: Language Model Northern America ⋯ 239 ⋯ ⋯ Title ⋯ Region ⋯ H index ⨀ Num. Values Graph attention Attention: - varying # columns - invariance to column order New attention with relational information Attention key & query: E ⊙ X (Adapted from KEN) Edge Cell value Pre-training Contrastive learning Large knowledge base Actual graphlet Negative G Varoquaux 35 [Kim... 2024]

Slide 46

Slide 46 text

CARTE predicts best! Few shot learning [Kim... 2024] b. Classification – 11 datasets a. Regression – 40 datasets TabVec – skrub’s TableVectorizer XGB – XGBoost RF – RandomForest CN – Concat Numerical EN – Embed Numerical G Varoquaux 36

Slide 47

Slide 47 text

CARTE predicts best! With a marked benefit [Kim... 2024] 25 20 15 10 [9.047] CARTE [11.218] S-LLM-CN-XGB-Bagging [14.678] S-LLM-CN-XGB [16.454] S-LLM-EN-XGB-Bagging [16.929] CatBoost [17.158] TabVec-Logistic [17.455] S-LLM-CN-HGB-Bagging [17.912] TabVec-Logistic-Bagging [19.559] TabVec-LLM-XGB-Bagging [20.287] S-LLM-EN-XGB [20.903] CatBoost-Bagging [21.058] TabVec-LLM-XGB [21.901] TabVec-FT-XGB-Bagging [22.191] TabVec-LLM-HGB-Bagging [22.480] TabVec-XGB [22.482] TabVec-XGB-Bagging [22.552] S-LLM-EN-HGB-Bagging [22.554] TarEnc-TabPFN G Varoquaux 37 (rank over 42 datasets) Better prediction − − − − − − − − − − − − − →

Slide 48

Slide 48 text

CARTE: learning across multiple tables Fine tuning on multiple tables together 8 7 6 5 4 3 2 S-LLM-Single dataset S-LLM-Not matched S-LLM-Matched CatBoost-Not matched CatBoost-Single dataset CARTE-Matched CARTE-Not matched CARTE-Single dataset CatBoost-Matched G Varoquaux 38 [Kim... 2024]

Slide 49

Slide 49 text

CARTE naturally handles missingness Percentage decrease created by missing values Methods Train size (Missing fraction) 64 (0.1) 64 (0.3) 512 (0.1) 512 (0.3) CARTE 13.28% 38.35% 10.19% 24.42% CatBoost 21.70% 53.32% 12.23% 29.70% TabVec-XGB 15.11% 51.27% 12.61% 30.35% TabVec-RF 7.68% 44.43% 12.77% 29.79% Normalized absolute score CARTE 0.44(0.20) 0.29(0.18) 0.75(0.12) 0.61(0.14) CatBoost 0.31(0.22) 0.17(0.17) 0.65(0.15) 0.50(0.15) TabVec-XGB 0.19(0.20) 0.11(0.15) 0.65(0.17) 0.50(0.17) TabVec-RF 0.23(0.21) 0.14(0.16) 0.63(0.15) 0.49(0.15) G Varoquaux 39 [Kim... 2024]

Slide 50

Slide 50 text

Different benchmarks, different domains CARTE = tables with strings CARTE’s TabLLM’s benchmark benchmark Fraction of numerical cols. 0.194 0.613 Fraction of cols. with |C| > 10 0.625 0.043 High-cardinality categories G Varoquaux 40 [Kim... 2024]

Slide 51

Slide 51 text

Different benchmarks, different domains CARTE = tables with strings CARTE’s TabLLM’s benchmark benchmark Fraction of numerical cols. 0.194 0.613 Fraction of cols. with |C| > 10 0.625 0.043 High-cardinality categories Modeling numbers matters On the TabLLM benchmark 4 3 2 1 TabLLM XGBoost TabPFN CARTE Need joint models of strings & numbers G Varoquaux 40 [Kim... 2024]

Slide 52

Slide 52 text

3 A brave new word of foundation models

Slide 53

Slide 53 text

CARTE, deep learning Table foundation model Powerful Try it out!⋆ ⋆ pypi.org/project/carte-ai Resource hungry G Varoquaux 42

Slide 54

Slide 54 text

Resource usage matters G Varoquaux 43

Slide 55

Slide 55 text

A worrying trend in AI [Varoquaux... 2024] 2010 2020 $1 $1k $1M Training cost a single run 2010 2020 106 109 1012 1015 Inference FLOP FLOPS in $100 GPU Ever-rising costs, and footprint G Varoquaux 44

Slide 56

Slide 56 text

A second look at those benchmarks [Kim... 2024] 25 20 15 10 [9.047] CARTE [11.218] S-LLM-CN-XGB-Bagging [14.678] S-LLM-CN-XGB [16.454] S-LLM-EN-XGB-Bagging [16.929] CatBoost [17.158] TabVec-Logistic [17.455] S-LLM-CN-HGB-Bagging [17.912] TabVec-Logistic-Bagging [19.559] TabVec-LLM-XGB-Bagging [20.287] S-LLM-EN-XGB [20.903] CatBoost-Bagging [21.058] TabVec-LLM-XGB [21.901] TabVec-FT-XGB-Bagging [22.191] TabVec-LLM-HGB-Bagging [22.480] TabVec-XGB [22.482] TabVec-XGB-Bagging [22.552] S-LLM-EN-HGB-Bagging [22.554] TarEnc-TabPFN [22.573] TabVec-RandomForest [22.610] TabVec-FT-XGB G Varoquaux 45 Better prediction − − − − − − − − − − − − − → CARTE is cool, but baselines are strong: Preprocessing matters LLMs + trees skrub’s TableVectorizer TableVectorizer + logistic regression t t t t t t t t t t t t t t

Slide 57

Slide 57 text

CARTE, deep learning Powerful Difficult to install⋆ Resource hungry skrub Exploring the trade-offs Easy install and adopt TableVectorizer ⋆ pypi.org/project/carte-ai G Varoquaux 46

Slide 58

Slide 58 text

Tabular Learning Learning from tables is everywhere - And data preparation is key Tabular foundation models: background knowledge, implicit priors Skrub: lightweight tabular-learning skrub-data.org @GaelVaroquaux

Slide 59

Slide 59 text

References I J. Ab´ ecassis, E. Dumas, J. Alberge, and G. Varoquaux. From prediction to prescription: Machine learning and causal inference. 2024. J. Alberge, V. Maladi` ere, O. Grisel, J. Ab´ ecassis, and G. Varoquaux. Teaching models to survive: Proper scoring rule and stochastic optimization with competing risks. arXiv preprint arXiv:2406.14085, 2024. I. Balazevic, C. Allen, and T. Hospedales. Multi-relational poincar´ e graph embeddings. Neural Information Processing Systems, 32:4463, 2019. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational data embeddings for feature enrichment with background information. Machine Learning, pages 1–34, 2023.

Slide 60

Slide 60 text

References II K. Dadi, G. Varoquaux, J. Houenou, D. Bzdok, B. Thirion, and D. Engemann. Population modeling with machine learning can enhance measures of mental health. GigaScience, 10(10):giab071, 2021. J. Dock` es, G. Varoquaux, and J.-B. Poline. Preventing dataset shift from breaking machine-learning biomarkers. GigaScience, 10(9):giab055, 2021. M. Doutreligne and G. Varoquaux. How to select predictive models for causal inference? 2023. URL https://hal.science/hal-03946902. L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. J. Kim, L. Grinsztajn, and G. Varoquaux. Carte: pretraining and transfer for tabular learning. ICML, 2024.

Slide 61

Slide 61 text

References III M. Le Morvan, J. Josse, E. Scornet, and G. Varoquaux. What’s a good imputation to predict with missing values? NeurIPS, 2021. M. L. Morvan and G. Varoquaux. Imputation for prediction: beware of diminishing returns. arXiv preprint arXiv:2407.19804, 2024. X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021. L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. R´ e. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In ACM Conference on Health, Inference, and Learning, pages 151–159, 2020. G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ digital medicine, 5(1):48, 2022. G. Varoquaux, A. S. Luccioni, and M. Whittaker. Hype, sustainability, and the price of the bigger-is-better paradigm in ai. arXiv preprint arXiv:2409.14160, 2024.

Slide 62

Slide 62 text

References IV J. K. Winkler, C. Fink, F. Toberer, A. Enk, T. Deinlein, R. Hofmann-Wellenhof, L. Thomas, A. Lallas, A. Blum, W. Stolz, ... Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatology, 155(10):1135–1141, 2019.