Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Table foundation models for analytics

Gael Varoquaux
December 15, 2024
360

Table foundation models for analytics

Deep-learning typically does not outperform tree-based models on tabular data. Often this may be explained by the small size of such datasets. For images, sound, text, the solution has been pretrained models, leading to foundation models, adapted and reused for many tasks. I will discuss the challenges to bring these ideas to tabular learning, and the progress that we have made, leading to a first tabular foundation model: the CARTE tabular model.

For machine learning on tables, CARTE outperforms the best baselines, including combining XGBoost with large language models

Gael Varoquaux

December 15, 2024
Tweet

Transcript

  1. About me 15 years of machine-learning for health and social

    sciences Co-founded scikit-learn I don’t like wearing ties G Varoquaux 1
  2. Tabular learning is everywhere Detecting credit-card fraud Evaluating major health

    risks Predicting airplane delays Detecting cyber attacks ... Machine learning from tables has a huge impact on our lives G Varoquaux 2
  3. ‘AI”: amazing breakthroughs on images, sound, text But the most

    precious data is in tables Learning on tables is not as easy as it seems G Varoquaux 3
  4. 1 Learning on tables Sex Experience Age Employee Position Title

    M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M NA 62 Bus Operator
  5. Tabular data Categorical data Columns of different nature, different distributions

    G Varoquaux 5 sex experience age company size m 10 42 3 f 23 NA 121 m 3 28 12 f 16 45 23 m 13 48 3 231 m 6 36 593 m NA 62 32 f 9 35 NA f NA 39 238
  6. Good data preparation facilitates statistical modeling Gaussianization, standardization, outlier removal...

    G Varoquaux 7 erience age size 10 42 3 23 NA 121 3 28 12 16 45 23 13 48 3 231 6 36 593 NA 62 32 9 35 NA NA 39 238
  7. On tabular data: tree models > deep learning [Grinsztajn... 2022]

    Tree-based methods out-perform tailored deep architectures sklearn HistGradientBoosting... Faster and better predictor G Varoquaux 10
  8. Tree models !? Feature-by-feature decision Sequence of simple decisions Natural

    Categorical handling y: y: y: y: y: G Varoquaux 11
  9. Tree models !? Feature-by-feature decision Sequence of simple decisions Natural

    Categorical handling Missing-value handling easier than imputing [Le Morvan... 2021, Morvan and Varoquaux 2024] x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missin ... Predict +1.3 G Varoquaux 11
  10. Tabular data’s geometry [Grinsztajn... 2022] Feature distributions All different Very

    non-Gaussian Euclidean distance?! The data’s natural geometry is neither smooth nor vectorial G Varoquaux 12
  11. Deep learning struggles on tables Gradient-boosted trees ♥ And scikit-learn

    is still most popular⋆ https://www.kaggle.com/kaggle-survey-2022 G Varoquaux 13
  12. Maybe I’m just living in the past Maybe I’m just

    living in the past Co-founded scikit-learn 15 years ago Co-founded scikit-learn 15 years ago Trying to still be cool Trying to still be cool G Varoquaux 14
  13. The numbers don’t lie Number of monthly downloads⋆ 0.0 0.5

    1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M G Varoquaux 15 ⋆ from pypistats.org
  14. The numbers don’t lie Number of monthly downloads⋆ 0.0 0.5

    1.0 1.5 2.0 2.5 1e8 PyTorch scikit-learn pandas 34M 80M 282M Pandas!? Just a dataframe library Data preparation > scikit-learn G Varoquaux 15 ⋆ from pypistats.org
  15. Data preparation = more than numbers Open-ended strings? Information normalization?

    Merging in information from external tables? G Varoquaux 16 Employer Experience Employee Position Title LAPD 10 yrs Master Police Officer LA city 23 yrs Social Worker IV LAPD 3 yrs Police Officer III LAPD 16 yrs Police Aide NA 13 yrs Electrician I Greyhound 6 yrs Bus Operator Greyhound NA Bus Operator LA city 9 yrs Social Worker III
  16. Strings! More data preparation? A challenge? An opportunity Deep learning

    is good at text G Varoquaux 17 Sex Experience Age Employee Position Title M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M NA 62
  17. Successes of neural networks in vision, text... Large data Pretrained

    Can we bring background information to tables? G Varoquaux 19
  18. Enriching via strings Real-estate market Predict price of a property

    G Varoquaux 20 Age Surface # rooms City 102 76 4 Paris 18 123 6 Vitry 12 155 7 Reno 53 23 1 NYC 39 32 3 London 23 52 4 Vancouver 39 114 7 Prince Ruppert
  19. City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M

    Vitry 33k Population 2.2M 33k City Rent Paris 1100€ Vitry 700€ Paris 1300€ Population 2.2M 33k 2.2M Enriching via strings Recognized strings (entities) Enable bringing in background information G Varoquaux 21
  20. City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M

    Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Enriching faces representation and summarization challenges G Varoquaux 22 Varying available features granularity of information The richer the data source the worse it gets
  21. Semantic heterogeneity Heterogeneous features – surface, age, # rooms Discrete

    objects, high cardinality – 36 000 towns in France Different tables, different objects – properties, person, cities, departments. G Varoquaux 23
  22. Vectorial embeddings Object → p Capture implicit regularities Distill information

    in a relational data source KEN: knowledge embedding with numbers [Cvetkov-Iliev... 2023] G Varoquaux 24
  23. KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris Sherman County

    Long Orange County Orange Harris Orange ... Anaheim Name Irvine Houston Santa Ana ... State Ca Ca Tx Ca ... City Sherman Name 36.1 Long Harris 29.5 Lat Orange 33.7 101.5 95.2 117.8 ... ... ... Tx St Tx Ca ... Pop 3k 4.7M 3.2M ... County Triplet representation head relation tail ... Database tables Paris 4.7M Paris Harris County Pop Paris Orange Paris Anaheim City County Training Paris Harris Paris Houston City County negative sampling for entity embedding re opera training dynamics tail Strategy: Data represented as graphs (RDF triplets) G Varoquaux 25
  24. KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris herman County

    Long riplet representation head relation tail ... Paris 4.7M Paris Harris County Pop Paris Orange Paris naheim City County Training embeddings Paris Harris Paris Houston City County negative sampling for training numerical attribute embedding entity embedding relation operator training dynamics harris orange sherman Analysis tail County 105 Votes 1285 130 ... Harris Orange Sherman ... Database 2 Transfert Strategy: Data represented as graphs (RDF triplets) Adapt knowledge-graph embedding approaches Capture relations and numerical attributes G Varoquaux 25
  25. From tables to (knowledge) graphs Knowledge graphs = list of

    triples (head, relation, tail) or (h, r, t) e.g. (Paris, capitalOf, France) San Francisco San Diego California 0.87M State 1.4M Population City Population State San Francisco 0.87M California San Diego 1.4M California (San Francisco, Population, 0.87M) (San Francisco, State, California) (San Diego, Population, 1.4M) (San Diego, State, California) Table representation Triple / Knowledge graph “Head” column San Francisco San Diego California 0.87M State 1.4M Population City Population State San Francisco 0.87M California San Diego 1.4M California (San Francisco, Population, 0.87M) (San Francisco, State, California) (San Diego, Population, 1.4M) (San Diego, State, California) “Head” column The two representations are (almost) equivalent: G Varoquaux 26
  26. Entity embeddings: contextual Contextual: two entities have close embeddings if

    they co-occur In NLP: word2vec = word co-occurences In knowledge-graphs: RDF2vec G Varoquaux 27 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California
  27. Entity embeddings: contextual < relational Contextual: two entities have close

    embeddings if they co-occur Relational: two entities are close if they have the same relations to other entities G Varoquaux 27 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California
  28. Knowledge-graph embeddings to capture relations TransE [Bordes... 2013] represents relation

    r as a translation vector r ∈ p between entity embeddings h and t: Scoring function: f (h, r, t) = −||h + r − t|| Italy France Paris Rome capitalOf Training: optimize h, r, t to minimize a margin loss: L = (h,r,t)∈G, (h′,t′) s.t.(h′,r,t′) G with h′=h or t=t′ [f (h′, r, t′) − f (h, r, t) + γ]+ G Varoquaux 28
  29. KEN: embeddings to capture diverse information [Cvetkov-Iliev... 2023] We use

    MuRE [Balazevic... 2019] Good eg for one-to-many relation Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh + bt Contraction / projection Translation Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn n sIn hWorks etts Enables rich relational geometry Different directions = different information G Varoquaux 29
  30. Entity embeddings that distill background information KEN: knowledge embedding with

    numbers [Cvetkov-Iliev... 2023] X ∈ p soda-inria.github.io/ken embeddings 6 million common entities cities, people, compagnies... G Varoquaux 30
  31. Table foundation model CARTE⋆ Breakthrough transfer across tables ⇒ pretraining

    Deep learning ≫ tree models ⋆Context Aware Representation of Table Entries G Varoquaux 31
  32. Pretraining for data tables? What prior for a bunch of

    numbers? 72 68 174 1 64 79 181 1 56 59 166 0 81 62 161 1 G Varoquaux 32
  33. Pretraining for data tables? What prior for a bunch of

    numbers? 72 68 174 1 64 79 181 1 56 59 166 0 81 62 161 1 And now? Cardiovascular cohort Age Weight Height Commorbidity Cardiovascular event 72 68 174 Diabetes 1 64 79 181 Cardiac arrhythmia 1 56 59 166 NA 0 81 62 161 Asthma 1 G Varoquaux 32
  34. Pretraining for data tables? What prior for a bunch of

    numbers? 72 68 174 1 64 79 181 1 56 59 166 0 81 62 161 1 And now? Cardiovascular cohort Age Weight Height Commorbidity Cardiovascular event 72 68 174 Diabetes 1 64 79 181 Cardiac arrhythmia 1 56 59 166 NA 0 81 62 161 Asthma 1 Column names and string entries contextualize the numbers G Varoquaux 32
  35. Data integration challenges Cell level Entity alignment Londres / London

    String / language modeling Column level Different schemas Local relational structure G Varoquaux 33
  36. CARTE: graph representation Title ISSN Publisher Country Region H index

    Nature 14764687 Nature Publishing Group United Kingdom Western Europe 1331 JMLR 15337928 NaN United States Northern America 239 Feature Initialization: Language Model Northern America ⋯ United States ⋯ 239 ⋯ JMLR ⋯ 153379 ⋯ ⋯ Title ⋯ Region ⋯ Country ⋯ H index ⋯ ISSN ⨀ Num. Values Graph representation to bridge tables Cell values ⇒ string embeddings on nodes Column titles ⇒ string embeddings on edges G Varoquaux 34 [Kim... 2024]
  37. CARTE: graph-attention network Title ISSN Publisher Country Region H index

    Nature 14764687 Nature Publishing Group United Kingdom Western Europe 1331 JMLR 15337928 NaN United States Northern America 239 Feature Initialization: Language Model Northern America ⋯ 239 ⋯ ⋯ Title ⋯ Region ⋯ H index ⨀ Num. Values Graph attention Attention: - varying # columns - invariance to column order New attention with relational information Attention key & query: E ⊙ X (Adapted from KEN) Edge Cell value G Varoquaux 35 [Kim... 2024]
  38. CARTE: graph-attention network Title ISSN Publisher Country Region H index

    Nature 14764687 Nature Publishing Group United Kingdom Western Europe 1331 JMLR 15337928 NaN United States Northern America 239 Feature Initialization: Language Model Northern America ⋯ 239 ⋯ ⋯ Title ⋯ Region ⋯ H index ⨀ Num. Values Graph attention Attention: - varying # columns - invariance to column order New attention with relational information Attention key & query: E ⊙ X (Adapted from KEN) Edge Cell value Pre-training Contrastive learning Large knowledge base Actual graphlet Negative G Varoquaux 35 [Kim... 2024]
  39. CARTE predicts best! Few shot learning [Kim... 2024] b. Classification

    – 11 datasets a. Regression – 40 datasets TabVec – skrub’s TableVectorizer XGB – XGBoost RF – RandomForest CN – Concat Numerical EN – Embed Numerical G Varoquaux 36
  40. CARTE predicts best! With a marked benefit [Kim... 2024] 25

    20 15 10 [9.047] CARTE [11.218] S-LLM-CN-XGB-Bagging [14.678] S-LLM-CN-XGB [16.454] S-LLM-EN-XGB-Bagging [16.929] CatBoost [17.158] TabVec-Logistic [17.455] S-LLM-CN-HGB-Bagging [17.912] TabVec-Logistic-Bagging [19.559] TabVec-LLM-XGB-Bagging [20.287] S-LLM-EN-XGB [20.903] CatBoost-Bagging [21.058] TabVec-LLM-XGB [21.901] TabVec-FT-XGB-Bagging [22.191] TabVec-LLM-HGB-Bagging [22.480] TabVec-XGB [22.482] TabVec-XGB-Bagging [22.552] S-LLM-EN-HGB-Bagging [22.554] TarEnc-TabPFN G Varoquaux 37 (rank over 42 datasets) Better prediction − − − − − − − − − − − − − →
  41. CARTE: learning across multiple tables Fine tuning on multiple tables

    together 8 7 6 5 4 3 2 S-LLM-Single dataset S-LLM-Not matched S-LLM-Matched CatBoost-Not matched CatBoost-Single dataset CARTE-Matched CARTE-Not matched CARTE-Single dataset CatBoost-Matched G Varoquaux 38 [Kim... 2024]
  42. CARTE naturally handles missingness Percentage decrease created by missing values

    Methods Train size (Missing fraction) 64 (0.1) 64 (0.3) 512 (0.1) 512 (0.3) CARTE 13.28% 38.35% 10.19% 24.42% CatBoost 21.70% 53.32% 12.23% 29.70% TabVec-XGB 15.11% 51.27% 12.61% 30.35% TabVec-RF 7.68% 44.43% 12.77% 29.79% Normalized absolute score CARTE 0.44(0.20) 0.29(0.18) 0.75(0.12) 0.61(0.14) CatBoost 0.31(0.22) 0.17(0.17) 0.65(0.15) 0.50(0.15) TabVec-XGB 0.19(0.20) 0.11(0.15) 0.65(0.17) 0.50(0.17) TabVec-RF 0.23(0.21) 0.14(0.16) 0.63(0.15) 0.49(0.15) G Varoquaux 39 [Kim... 2024]
  43. Different benchmarks, different domains CARTE = tables with strings CARTE’s

    TabLLM’s benchmark benchmark Fraction of numerical cols. 0.194 0.613 Fraction of cols. with |C| > 10 0.625 0.043 High-cardinality categories G Varoquaux 40 [Kim... 2024]
  44. Different benchmarks, different domains CARTE = tables with strings CARTE’s

    TabLLM’s benchmark benchmark Fraction of numerical cols. 0.194 0.613 Fraction of cols. with |C| > 10 0.625 0.043 High-cardinality categories Modeling numbers matters On the TabLLM benchmark 4 3 2 1 TabLLM XGBoost TabPFN CARTE Need joint models of strings & numbers G Varoquaux 40 [Kim... 2024]
  45. CARTE, deep learning Table foundation model Powerful Try it out!⋆

    ⋆ pypi.org/project/carte-ai Resource hungry G Varoquaux 42
  46. A worrying trend in AI [Varoquaux... 2024] 2010 2020 $1

    $1k $1M Training cost a single run 2010 2020 106 109 1012 1015 Inference FLOP FLOPS in $100 GPU Ever-rising costs, and footprint G Varoquaux 44
  47. A second look at those benchmarks [Kim... 2024] 25 20

    15 10 [9.047] CARTE [11.218] S-LLM-CN-XGB-Bagging [14.678] S-LLM-CN-XGB [16.454] S-LLM-EN-XGB-Bagging [16.929] CatBoost [17.158] TabVec-Logistic [17.455] S-LLM-CN-HGB-Bagging [17.912] TabVec-Logistic-Bagging [19.559] TabVec-LLM-XGB-Bagging [20.287] S-LLM-EN-XGB [20.903] CatBoost-Bagging [21.058] TabVec-LLM-XGB [21.901] TabVec-FT-XGB-Bagging [22.191] TabVec-LLM-HGB-Bagging [22.480] TabVec-XGB [22.482] TabVec-XGB-Bagging [22.552] S-LLM-EN-HGB-Bagging [22.554] TarEnc-TabPFN [22.573] TabVec-RandomForest [22.610] TabVec-FT-XGB G Varoquaux 45 Better prediction − − − − − − − − − − − − − → CARTE is cool, but baselines are strong: Preprocessing matters LLMs + trees skrub’s TableVectorizer TableVectorizer + logistic regression t t t t t t t t t t t t t t
  48. CARTE, deep learning Powerful Difficult to install⋆ Resource hungry skrub

    Exploring the trade-offs Easy install and adopt TableVectorizer ⋆ pypi.org/project/carte-ai G Varoquaux 46
  49. Tabular Learning Learning from tables is everywhere - And data

    preparation is key Tabular foundation models: background knowledge, implicit priors Skrub: lightweight tabular-learning skrub-data.org @GaelVaroquaux
  50. References I J. Ab´ ecassis, E. Dumas, J. Alberge, and

    G. Varoquaux. From prediction to prescription: Machine learning and causal inference. 2024. J. Alberge, V. Maladi` ere, O. Grisel, J. Ab´ ecassis, and G. Varoquaux. Teaching models to survive: Proper scoring rule and stochastic optimization with competing risks. arXiv preprint arXiv:2406.14085, 2024. I. Balazevic, C. Allen, and T. Hospedales. Multi-relational poincar´ e graph embeddings. Neural Information Processing Systems, 32:4463, 2019. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational data embeddings for feature enrichment with background information. Machine Learning, pages 1–34, 2023.
  51. References II K. Dadi, G. Varoquaux, J. Houenou, D. Bzdok,

    B. Thirion, and D. Engemann. Population modeling with machine learning can enhance measures of mental health. GigaScience, 10(10):giab071, 2021. J. Dock` es, G. Varoquaux, and J.-B. Poline. Preventing dataset shift from breaking machine-learning biomarkers. GigaScience, 10(9):giab055, 2021. M. Doutreligne and G. Varoquaux. How to select predictive models for causal inference? 2023. URL https://hal.science/hal-03946902. L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. J. Kim, L. Grinsztajn, and G. Varoquaux. Carte: pretraining and transfer for tabular learning. ICML, 2024.
  52. References III M. Le Morvan, J. Josse, E. Scornet, and

    G. Varoquaux. What’s a good imputation to predict with missing values? NeurIPS, 2021. M. L. Morvan and G. Varoquaux. Imputation for prediction: beware of diminishing returns. arXiv preprint arXiv:2407.19804, 2024. X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021. L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. R´ e. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In ACM Conference on Health, Inference, and Learning, pages 151–159, 2020. G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ digital medicine, 5(1):48, 2022. G. Varoquaux, A. S. Luccioni, and M. Whittaker. Hype, sustainability, and the price of the bigger-is-better paradigm in ai. arXiv preprint arXiv:2409.14160, 2024.
  53. References IV J. K. Winkler, C. Fink, F. Toberer, A.

    Enk, T. Deinlein, R. Hofmann-Wellenhof, L. Thomas, A. Lallas, A. Blum, W. Stolz, ... Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatology, 155(10):1135–1141, 2019.