Table foundation models for analytics

Tabular foundation models for analytics: challenges and progress Ga¨ el
Varoquaux

About me 15 years of machine-learning for health and social
sciences Co-founded scikit-learn I don’t like wearing ties G Varoquaux 1

Tabular learning is everywhere Detecting credit-card fraud Evaluating major health
risks Predicting airplane delays Detecting cyber attacks ... Machine learning from tables has a huge impact on our lives G Varoquaux 2

‘AI”: amazing breakthroughs on images, sound, text But the most
precious data is in tables Learning on tables is not as easy as it seems G Varoquaux 3

1 Learning on tables Sex Experience Age Employee Position Title
M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M NA 62 Bus Operator

Tabular data Categorical data Columns of different nature, different distributions
G Varoquaux 5 sex experience age company size m 10 42 3 f 23 NA 121 m 3 28 12 f 16 45 23 m 13 48 3 231 m 6 36 593 m NA 62 32 f 9 35 NA f NA 39 238

Tabular data: feature distributions Distribution heterogeneity Feature distributions All different
Very non-Gaussian (some long tailed) G Varoquaux 6

Good data preparation facilitates statistical modeling Gaussianization, standardization, outlier removal...
G Varoquaux 7 erience age size 10 42 3 23 NA 121 3 28 12 16 45 23 13 48 3 231 6 36 593 NA 62 32 9 35 NA NA 39 238

G Varoquaux 8

Chaining transformations to facilitate learning? That sounds like a job
for deep learning G Varoquaux 9

On tabular data: tree models > deep learning [Grinsztajn... 2022]
Tree-based methods out-perform tailored deep architectures sklearn HistGradientBoosting... Faster and better predictor G Varoquaux 10

Tree models !? Feature-by-feature decision Sequence of simple decisions Natural
Categorical handling y: y: y: y: y: G Varoquaux 11

Tree models !? Feature-by-feature decision Sequence of simple decisions Natural
Categorical handling Missing-value handling easier than imputing [Le Morvan... 2021, Morvan and Varoquaux 2024] x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missin ... Predict +1.3 G Varoquaux 11

Tabular data’s geometry [Grinsztajn... 2022] Feature distributions All different Very
non-Gaussian Trees don’t care G Varoquaux 12

Tabular data’s geometry [Grinsztajn... 2022] Feature distributions All different Very
non-Gaussian Euclidean distance?! The data’s natural geometry is neither smooth nor vectorial G Varoquaux 12

Deep learning struggles on tables Gradient-boosted trees ♥ And scikit-learn
is still most popular⋆ https://www.kaggle.com/kaggle-survey-2022 G Varoquaux 13

Maybe I’m just living in the past Maybe I’m just
living in the past Co-founded scikit-learn 15 years ago Co-founded scikit-learn 15 years ago Trying to still be cool Trying to still be cool G Varoquaux 14

The numbers don’t lie Number of monthly downloads⋆ 0.0 0.5
1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M G Varoquaux 15 ⋆ from pypistats.org

The numbers don’t lie Number of monthly downloads⋆ 0.0 0.5
1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M Pandas!? Just a dataframe library Data preparation > scikit-learn G Varoquaux 15 ⋆ from pypistats.org

Data preparation = more than numbers Open-ended strings? Information normalization?
Merging in information from external tables? G Varoquaux 16 Employer Experience Employee Position Title LAPD 10 yrs Master Police Officer LA city 23 yrs Social Worker IV LAPD 3 yrs Police Officer III LAPD 16 yrs Police Aide NA 13 yrs Electrician I Greyhound 6 yrs Bus Operator Greyhound NA Bus Operator LA city 9 yrs Social Worker III

Strings! More data preparation? A challenge? An opportunity Deep learning
is good at text G Varoquaux 17 Sex Experience Age Employee Position Title M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M NA 62

2 Deep learning strikes back Table foundation models

Successes of neural networks in vision, text... Large data Pretrained
G Varoquaux 19

Successes of neural networks in vision, text... Large data Pretrained
Can we bring background information to tables? G Varoquaux 19

Enriching via strings Real-estate market Predict price of a property
G Varoquaux 20 Age Surface # rooms City 102 76 4 Paris 18 123 6 Vitry 12 155 7 Reno 53 23 1 NYC 39 32 3 London 23 52 4 Vancouver 39 114 7 Prince Ruppert

City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M
Vitry 33k Population 2.2M 33k City Rent Paris 1100€ Vitry 700€ Paris 1300€ Population 2.2M 33k 2.2M Enriching via strings Recognized strings (entities) Enable bringing in background information G Varoquaux 21

City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M
Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Enriching faces representation and summarization challenges G Varoquaux 22 Varying available features granularity of information The richer the data source the worse it gets

Semantic heterogeneity Heterogeneous features – surface, age, # rooms Discrete
objects, high cardinality – 36 000 towns in France Different tables, different objects – properties, person, cities, departments. G Varoquaux 23

Vectorial embeddings Object → p Capture implicit regularities Distill information
in a relational data source KEN: knowledge embedding with numbers [Cvetkov-Iliev... 2023] G Varoquaux 24

KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris Sherman County
Long Orange County Orange Harris Orange ... Anaheim Name Irvine Houston Santa Ana ... State Ca Ca Tx Ca ... City Sherman Name 36.1 Long Harris 29.5 Lat Orange 33.7 101.5 95.2 117.8 ... ... ... Tx St Tx Ca ... Pop 3k 4.7M 3.2M ... County Triplet representation head relation tail ... Database tables Paris 4.7M Paris Harris County Pop Paris Orange Paris Anaheim City County Training Paris Harris Paris Houston City County negative sampling for entity embedding re opera training dynamics tail Strategy: Data represented as graphs (RDF triplets) G Varoquaux 25

KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris herman County
Long riplet representation head relation tail ... Paris 4.7M Paris Harris County Pop Paris Orange Paris naheim City County Training embeddings Paris Harris Paris Houston City County negative sampling for training numerical attribute embedding entity embedding relation operator training dynamics harris orange sherman Analysis tail County 105 Votes 1285 130 ... Harris Orange Sherman ... Database 2 Transfert Strategy: Data represented as graphs (RDF triplets) Adapt knowledge-graph embedding approaches Capture relations and numerical attributes G Varoquaux 25

From tables to (knowledge) graphs Knowledge graphs = list of
triples (head, relation, tail) or (h, r, t) e.g. (Paris, capitalOf, France) San Francisco San Diego California 0.87M State 1.4M Population City Population State San Francisco 0.87M California San Diego 1.4M California (San Francisco, Population, 0.87M) (San Francisco, State, California) (San Diego, Population, 1.4M) (San Diego, State, California) Table representation Triple / Knowledge graph “Head” column San Francisco San Diego California 0.87M State 1.4M Population City Population State San Francisco 0.87M California San Diego 1.4M California (San Francisco, Population, 0.87M) (San Francisco, State, California) (San Diego, Population, 1.4M) (San Diego, State, California) “Head” column The two representations are (almost) equivalent: G Varoquaux 26

Entity embeddings: contextual Contextual: two entities have close embeddings if
they co-occur In NLP: word2vec = word co-occurences In knowledge-graphs: RDF2vec G Varoquaux 27 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California

Entity embeddings: contextual < relational Contextual: two entities have close
embeddings if they co-occur Relational: two entities are close if they have the same relations to other entities G Varoquaux 27 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California

Knowledge-graph embeddings to capture relations TransE [Bordes... 2013] represents relation
r as a translation vector r ∈ p between entity embeddings h and t: Scoring function: f (h, r, t) = −||h + r − t|| Italy France Paris Rome capitalOf Training: optimize h, r, t to minimize a margin loss: L = (h,r,t)∈G, (h′,t′) s.t.(h′,r,t′) G with h′=h or t=t′ [f (h′, r, t′) − f (h, r, t) + γ]+ G Varoquaux 28

KEN: embeddings to capture diverse information [Cvetkov-Iliev... 2023] We use
MuRE [Balazevic... 2019] Good eg for one-to-many relation Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh + bt Contraction / projection Translation Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn n sIn hWorks etts Enables rich relational geometry Different directions = different information G Varoquaux 29

Entity embeddings that distill background information KEN: knowledge embedding with
numbers [Cvetkov-Iliev... 2023] X ∈ p soda-inria.github.io/ken embeddings 6 million common entities cities, people, compagnies... G Varoquaux 30

Table foundation model CARTE⋆ Breakthrough transfer across tables ⇒ pretraining
Deep learning ≫ tree models ⋆Context Aware Representation of Table Entries G Varoquaux 31

Pretraining for data tables? What prior for a bunch of
numbers? 72 68 174 1 64 79 181 1 56 59 166 0 81 62 161 1 G Varoquaux 32

numbers? 72 68 174 1 64 79 181 1 56 59 166 0 81 62 161 1 And now? Cardiovascular cohort Age Weight Height Commorbidity Cardiovascular event 72 68 174 Diabetes 1 64 79 181 Cardiac arrhythmia 1 56 59 166 NA 0 81 62 161 Asthma 1 G Varoquaux 32

numbers? 72 68 174 1 64 79 181 1 56 59 166 0 81 62 161 1 And now? Cardiovascular cohort Age Weight Height Commorbidity Cardiovascular event 72 68 174 Diabetes 1 64 79 181 Cardiac arrhythmia 1 56 59 166 NA 0 81 62 161 Asthma 1 Column names and string entries contextualize the numbers G Varoquaux 32

Data integration challenges Cell level Entity alignment Londres / London
String / language modeling Column level Different schemas Local relational structure G Varoquaux 33

CARTE: graph representation Title ISSN Publisher Country Region H index
Nature 14764687 Nature Publishing Group United Kingdom Western Europe 1331 JMLR 15337928 NaN United States Northern America 239 Feature Initialization: Language Model Northern America ⋯ United States ⋯ 239 ⋯ JMLR ⋯ 153379 ⋯ ⋯ Title ⋯ Region ⋯ Country ⋯ H index ⋯ ISSN ⨀ Num. Values Graph representation to bridge tables Cell values ⇒ string embeddings on nodes Column titles ⇒ string embeddings on edges G Varoquaux 34 [Kim... 2024]

CARTE: graph-attention network Title ISSN Publisher Country Region H index
Nature 14764687 Nature Publishing Group United Kingdom Western Europe 1331 JMLR 15337928 NaN United States Northern America 239 Feature Initialization: Language Model Northern America ⋯ 239 ⋯ ⋯ Title ⋯ Region ⋯ H index ⨀ Num. Values Graph attention Attention: - varying # columns - invariance to column order New attention with relational information Attention key & query: E ⊙ X (Adapted from KEN) Edge Cell value G Varoquaux 35 [Kim... 2024]

CARTE: graph-attention network Title ISSN Publisher Country Region H index
Nature 14764687 Nature Publishing Group United Kingdom Western Europe 1331 JMLR 15337928 NaN United States Northern America 239 Feature Initialization: Language Model Northern America ⋯ 239 ⋯ ⋯ Title ⋯ Region ⋯ H index ⨀ Num. Values Graph attention Attention: - varying # columns - invariance to column order New attention with relational information Attention key & query: E ⊙ X (Adapted from KEN) Edge Cell value Pre-training Contrastive learning Large knowledge base Actual graphlet Negative G Varoquaux 35 [Kim... 2024]

CARTE predicts best! Few shot learning [Kim... 2024] b. Classification
– 11 datasets a. Regression – 40 datasets TabVec – skrub’s TableVectorizer XGB – XGBoost RF – RandomForest CN – Concat Numerical EN – Embed Numerical G Varoquaux 36

CARTE predicts best! With a marked benefit [Kim... 2024] 25
20 15 10 [9.047] CARTE [11.218] S-LLM-CN-XGB-Bagging [14.678] S-LLM-CN-XGB [16.454] S-LLM-EN-XGB-Bagging [16.929] CatBoost [17.158] TabVec-Logistic [17.455] S-LLM-CN-HGB-Bagging [17.912] TabVec-Logistic-Bagging [19.559] TabVec-LLM-XGB-Bagging [20.287] S-LLM-EN-XGB [20.903] CatBoost-Bagging [21.058] TabVec-LLM-XGB [21.901] TabVec-FT-XGB-Bagging [22.191] TabVec-LLM-HGB-Bagging [22.480] TabVec-XGB [22.482] TabVec-XGB-Bagging [22.552] S-LLM-EN-HGB-Bagging [22.554] TarEnc-TabPFN G Varoquaux 37 (rank over 42 datasets) Better prediction − − − − − − − − − − − − − →

CARTE: learning across multiple tables Fine tuning on multiple tables
together 8 7 6 5 4 3 2 S-LLM-Single dataset S-LLM-Not matched S-LLM-Matched CatBoost-Not matched CatBoost-Single dataset CARTE-Matched CARTE-Not matched CARTE-Single dataset CatBoost-Matched G Varoquaux 38 [Kim... 2024]

CARTE naturally handles missingness Percentage decrease created by missing values
Methods Train size (Missing fraction) 64 (0.1) 64 (0.3) 512 (0.1) 512 (0.3) CARTE 13.28% 38.35% 10.19% 24.42% CatBoost 21.70% 53.32% 12.23% 29.70% TabVec-XGB 15.11% 51.27% 12.61% 30.35% TabVec-RF 7.68% 44.43% 12.77% 29.79% Normalized absolute score CARTE 0.44(0.20) 0.29(0.18) 0.75(0.12) 0.61(0.14) CatBoost 0.31(0.22) 0.17(0.17) 0.65(0.15) 0.50(0.15) TabVec-XGB 0.19(0.20) 0.11(0.15) 0.65(0.17) 0.50(0.17) TabVec-RF 0.23(0.21) 0.14(0.16) 0.63(0.15) 0.49(0.15) G Varoquaux 39 [Kim... 2024]

Different benchmarks, different domains CARTE = tables with strings CARTE’s
TabLLM’s benchmark benchmark Fraction of numerical cols. 0.194 0.613 Fraction of cols. with |C| > 10 0.625 0.043 High-cardinality categories G Varoquaux 40 [Kim... 2024]

Different benchmarks, different domains CARTE = tables with strings CARTE’s
TabLLM’s benchmark benchmark Fraction of numerical cols. 0.194 0.613 Fraction of cols. with |C| > 10 0.625 0.043 High-cardinality categories Modeling numbers matters On the TabLLM benchmark 4 3 2 1 TabLLM XGBoost TabPFN CARTE Need joint models of strings & numbers G Varoquaux 40 [Kim... 2024]

3 A brave new word of foundation models

CARTE, deep learning Table foundation model Powerful Try it out!⋆
⋆ pypi.org/project/carte-ai Resource hungry G Varoquaux 42

Resource usage matters G Varoquaux 43

A worrying trend in AI [Varoquaux... 2024] 2010 2020 $1
$1k $1M Training cost a single run 2010 2020 106 109 1012 1015 Inference FLOP FLOPS in $100 GPU Ever-rising costs, and footprint G Varoquaux 44

A second look at those benchmarks [Kim... 2024] 25 20
15 10 [9.047] CARTE [11.218] S-LLM-CN-XGB-Bagging [14.678] S-LLM-CN-XGB [16.454] S-LLM-EN-XGB-Bagging [16.929] CatBoost [17.158] TabVec-Logistic [17.455] S-LLM-CN-HGB-Bagging [17.912] TabVec-Logistic-Bagging [19.559] TabVec-LLM-XGB-Bagging [20.287] S-LLM-EN-XGB [20.903] CatBoost-Bagging [21.058] TabVec-LLM-XGB [21.901] TabVec-FT-XGB-Bagging [22.191] TabVec-LLM-HGB-Bagging [22.480] TabVec-XGB [22.482] TabVec-XGB-Bagging [22.552] S-LLM-EN-HGB-Bagging [22.554] TarEnc-TabPFN [22.573] TabVec-RandomForest [22.610] TabVec-FT-XGB G Varoquaux 45 Better prediction − − − − − − − − − − − − − → CARTE is cool, but baselines are strong: Preprocessing matters LLMs + trees skrub’s TableVectorizer TableVectorizer + logistic regression t t t t t t t t t t t t t t

CARTE, deep learning Powerful Difficult to install⋆ Resource hungry skrub
Exploring the trade-offs Easy install and adopt TableVectorizer ⋆ pypi.org/project/carte-ai G Varoquaux 46

Tabular Learning Learning from tables is everywhere - And data
preparation is key Tabular foundation models: background knowledge, implicit priors Skrub: lightweight tabular-learning skrub-data.org @GaelVaroquaux

References I J. Ab´ ecassis, E. Dumas, J. Alberge, and
G. Varoquaux. From prediction to prescription: Machine learning and causal inference. 2024. J. Alberge, V. Maladi` ere, O. Grisel, J. Ab´ ecassis, and G. Varoquaux. Teaching models to survive: Proper scoring rule and stochastic optimization with competing risks. arXiv preprint arXiv:2406.14085, 2024. I. Balazevic, C. Allen, and T. Hospedales. Multi-relational poincar´ e graph embeddings. Neural Information Processing Systems, 32:4463, 2019. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational data embeddings for feature enrichment with background information. Machine Learning, pages 1–34, 2023.

References II K. Dadi, G. Varoquaux, J. Houenou, D. Bzdok,
B. Thirion, and D. Engemann. Population modeling with machine learning can enhance measures of mental health. GigaScience, 10(10):giab071, 2021. J. Dock` es, G. Varoquaux, and J.-B. Poline. Preventing dataset shift from breaking machine-learning biomarkers. GigaScience, 10(9):giab055, 2021. M. Doutreligne and G. Varoquaux. How to select predictive models for causal inference? 2023. URL https://hal.science/hal-03946902. L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. J. Kim, L. Grinsztajn, and G. Varoquaux. Carte: pretraining and transfer for tabular learning. ICML, 2024.

References III M. Le Morvan, J. Josse, E. Scornet, and
G. Varoquaux. What’s a good imputation to predict with missing values? NeurIPS, 2021. M. L. Morvan and G. Varoquaux. Imputation for prediction: beware of diminishing returns. arXiv preprint arXiv:2407.19804, 2024. X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021. L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. R´ e. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In ACM Conference on Health, Inference, and Learning, pages 151–159, 2020. G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ digital medicine, 5(1):48, 2022. G. Varoquaux, A. S. Luccioni, and M. Whittaker. Hype, sustainability, and the price of the bigger-is-better paradigm in ai. arXiv preprint arXiv:2409.14160, 2024.

References IV J. K. Winkler, C. Fink, F. Toberer, A.
Enk, T. Deinlein, R. Hofmann-Wellenhof, L. Thomas, A. Lallas, A. Blum, W. Stolz, ... Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatology, 155(10):1135–1141, 2019.

Table foundation models for analytics

Table foundation models for analytics

More Decks by Gael Varoquaux

Featured

Transcript