Representation learning on relational data to automate data preparation

Representation learning on relational data to automate data preparation Ga¨
el Varoquaux

Data preparation is crucial to analysis Better pipelines can reduce
this need Focus on supervised learning: “good“ representations, models = gives good predictions But supervised learning is more: weakly-parametric estimators of conditional relations G Varoquaux 1

1 Data tables, not vector spaces Gender Experience Age Employee
Position Title M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M NA 62 Bus Operator F 9 yrs 35 Social Worker III F NA 39 Library Assistant II M 8 yrs NA Library Assistant I p

In data science most data is tabular G Varoquaux 3

Data modeling practices Count, normalize, encode Transform everything to numbers
It’s the nature of statistics We must feed the models G Varoquaux 4

Adapting the data to our models Improving data & knowledge
representation: curating it, transforming it, not automated by traditional machine learning Data massaging Mostly pandas and SQL scripts G Varoquaux 5

Adapting the data to our models Improving data & knowledge
representation: curating it, transforming it, not automated by traditional machine learning Data massaging Mostly pandas and SQL scripts Data preparation = #1 challenge (“Dirty data”) [Kaggle 2018, Lam... 2021] www.kaggle.com/ash316/novice-to-grandmaster G Varoquaux 5

Data massaging is exhausting Will deep learning save us? G
Varoquaux 6

Deep learning underperforms on data tables [Grinsztajn... 2022] Taylored deep-learning
architectures But tree-based methods perform best FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost 0.6 0.7 0.8 0.9 1.0 1e+01 1e+03 1e+05 Random search time (seconds) Normalized test accuracy of best model (on valid set) up to this iteration G Varoquaux 7

Deep learning underperforms on data tables [Grinsztajn... 2022] Tabular data
Various non-Gaussian marginals Many categorical features Trees’ inductive bias: Axis-aligned Each column is meaningful Non smooth 2 0 2 2 0 2 The data’s natural geometry is neither smooth not vectorial Our toolkit is based on smooth optimization in vector spaces G Varoquaux 7

Missing Data Frequent in health & social sciences p ∪
{NA} not a vector space G Varoquaux 8

Impute & regress [Le Morvan... 2021] Impute: fill in the
blanks with likely values Standard statistical inference needs missing at random: missingness is independent from unseen values G Varoquaux 9

blanks with likely values Standard statistical inference needs missing at random: missingness is independent from unseen values Complete data Imputed data (manifolds) Theorem (informal): a universally consistent learner leads to optimal prediction for all missing data mechanisms and almost all imputation functions. Asymptotically, imputing well is not needed to predict well. G Varoquaux 9

blanks with likely values Standard statistical inference needs missing at random: missingness is independent from unseen values Theorem (informal): a universally consistent learner leads to optimal prediction for all missing data mechanisms and almost all imputation functions. Asymptotically, imputing well is not needed to predict well. Imputation and regression must be jointly optimized When imputing with ¾[Xmis |Xobs ], the optimal regressor to predict is discontinuous G Varoquaux 9

Trees handling missing values MIA (Missing Incorporated Attribute) [Josse... 2019]
x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 sklearn HistGradientBoostingClassifier The learner readily handles missing values G Varoquaux 10

Trees handling missing values MIA (Missing Incorporated Attribute) [Josse... 2019]
x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 sklearn HistGradientBoostingClassifier The learner readily handles missing values Benchmarks [Perez-Lebel... 2022] Tree handling of missing values work best Imputation works well, but expensive G Varoquaux 10

Missing Data p ∪ {NA} not a vector space Imputation
is not about finding likely values Rather a representation to facilitate learning [Le Morvan... 2021] G Varoquaux 11

Entity representations Open-ended entries G Varoquaux 12 Employee Position Title
Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III

Modeling strings, rather than categories Notion of category ⇔ entity
normalization Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 13

Modeling strings: GapEncoder = string embeddings Factorizing sub-string count matrices
3-gram1 P 3-gram2 ol 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol G Varoquaux 14 [Cerda and Varoquaux 2020]

Modeling strings: GapEncoder = string embeddings Factorizing sub-string count matrices
3-gram1 P 3-gram2 ol 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 14 [Cerda and Varoquaux 2020]

GapEncoder: String embeddings capturing latent categories ry or st se
er ty ue er Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant Categories G Varoquaux 15 Code: dirty-cat.github.io [Cerda and Varoquaux 2020]

GapEncoder: String embeddings capturing latent categories Plausible feature names istant,
library pment, operator ion, specialist rker, warehouse rogram, manager anic, community rescuer, rescue ection, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant feature nam es Categories G Varoquaux 15 [Cerda and Varoquaux 2020]

Representations tailored to the data fasttext: almost as good as
GapEncoder, if in the right language 0 20 40 60 80 100 relativescore(% ) One-hot +SVD Similarity encoding FastText +SVD Gamma-Poisson factorization FastText + SVD (d=30) 0 20 40 60 80 100 relative score (%) English French Hungarian G Varoquaux 16 [Cerda and Varoquaux 2020]

Vectorizing tables: the TableVectorizer The dirty-cat software dirty-cat.github.io TableVectorizer X
= tab vec.fit transform(df) Heuristics for different columns strings with ≥ 30 categories ⇒ GapEncoder date/time ⇒ DateTimeEncoder non-string discrete ⇒ TargetEncoder ... Strong baseline G Varoquaux 17

Data tables - Heterogeneous columns - Missing values - Open-ended
strings Tree-based models sklearn HistGradientBoosting Column encoding dirty cat TableVectorizer G Varoquaux 18

same entities Aggregating Analysis 2 Across tables We often start
from many tables

Example data-science analysis Real-estate market Expected price of a property?
Predict the price from relevant information available age surface area # of rooms floor location ... G Varoquaux 20

City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M
Vitry 33k Population 2.2M 33k City Rent Paris 1100€ Vitry 700€ Paris 1300€ Population 2.2M 33k 2.2M Example data-science analysis Data may need to be merged across tables G Varoquaux 21

Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ GroupBy + Avg Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis Aggregations may be needed across different data granularity G Varoquaux 22

Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis Multiple hops may be needed G Varoquaux 23

Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis Joining tables Aggregations Multiple hops G Varoquaux 23 Difficult for humans requires expertise on the data Difficult for machine learning discrete choices, combinatorial optim

City Rent Paris 1100€ Vitry 700€ Paris 1300€ City Pop.
Paris 2.2M Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis G Varoquaux 24 We need statistics and learning across tables

Relational data challenges statistical learning Statistics and learning use repetitions
and regularities Relational data Discrete objects, different tables, different natures properties, person, cities, departments... No clear repetition, regularity, metric, smoothness p G Varoquaux 25

Assembling data same entities Aggregating Analysis A “main” table Feature-enrichment
tables G Varoquaux 26

Deep Feature Synthesis [Kanter and Veeramachaneni 2015] Greedily - starts
from a target table - recursively joins related tables, to a given depth One-to-many relations: Computes different aggregations COUNT, SUM, LAST, MAX... City Population City School School Students Palaiseau 33k Palaiseau Lyc´ ee Camille Claudel Lyc´ ee Camille Claudel 800 Palaiseau Lyc´ ee Henri Poincar´ e Lyc´ ee Henri Poincar´ e 1000 Target table Depth 0 City Department Department PovertyRate Palaiseau Essonne Essonne 13.3% Depth 1 Depth 2 City Population COUNT( City.School) City.Department City.Department. PovertyRate SUM(City. School.Students) MAX(City. School.Students Palaiseau 33k 2 Essonne 13.3% 1800 800 Does not scale: # features explodes with depth and # tables G Varoquaux 27

Embeddings as assembly Entity embeddings that distill information across tables
Object → p KEN: knowledge embedding with numbers [Cvetkov-Iliev... 2023] G Varoquaux 28

KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris Sherman County
Long Orange County Orange Harris Orange ... Anaheim Name Irvine Houston Santa Ana ... State Ca Ca Tx Ca ... City Sherman Name 36.1 Long Harris 29.5 Lat Orange 33.7 101.5 95.2 117.8 ... ... ... Tx St Tx Ca ... Pop 3k 4.7M 3.2M ... County Triplet representation head relation tail ... Database tables Paris 4.7M Paris Harris County Pop Paris Orange Paris Anaheim City County Training Paris Harris Paris Houston City County negative sampling for entity embedding re opera training dynamics tail Strategy: Convert data to graph (RDF triplets) G Varoquaux 29

KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris herman County
Long riplet representation head relation tail ... Paris 4.7M Paris Harris County Pop Paris Orange Paris naheim City County Training embeddings Paris Harris Paris Houston City County negative sampling for training numerical attribute embedding entity embedding relation operator training dynamics harris orange sherman Analysis tail County 105 Votes 1285 130 ... Harris Orange Sherman ... Database 2 Transfert Strategy: Convert data to graph (RDF triplets) Adapt knowledge-graph embedding approaches Capture relations and numerical attributes G Varoquaux 29

From tables to (knowledge) graphs Knowledge graphs = list of
triples (head, relation, tail) or (h, r, t) e.g. (Paris, capitalOf, France) San Francisco San Diego California 0.87M State 1.4M Population City Population State San Francisco 0.87M California San Diego 1.4M California (San Francisco, Population, 0.87M) (San Francisco, State, California) (San Diego, Population, 1.4M) (San Diego, State, California) Table representation Triple / Knowledge graph “Head” column The two representations are (almost) equivalent: G Varoquaux 30

Entity embeddings: contextual Contextual: two entities have close embeddings if
they co-occur In NLP: word2vec = word co-occurences In knowledge-graphs: RDF2vec G Varoquaux 31 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California

Entity embeddings: contextual < relational Contextual: two entities have close
embeddings if they co-occur Relational: two entities are close if they have the same relations to other entities G Varoquaux 31 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California

Knowledge-graph embeddings to capture relations TransE [Bordes... 2013] represents relation
r as a translation vector r ∈ p between entity embeddings h and t: Scoring function: f (h, r, t) = −||h + r − t|| Italy France Paris Rome capitalOf Training: optimize h, r, t to minimize a margin loss: L = (h,r,t)∈G, (h′,t′) s.t.(h′,r,t′) G with h′=h or t=t′ [f (h′, r, t′) − f (h, r, t) + γ]+ G Varoquaux 32

KEN: embeddings to distill information [Cvetkov-Iliev... 2023] 1. Capture one-to-many
relation Use MuRE [Balazevic... 2019] Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh + bt Contraction / projection Translation Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn n sIn hWorks etts Enables rich relational geometry G Varoquaux 33

KEN: embeddings to distill information [Cvetkov-Iliev... 2023] 1. Capture one-to-many
relation Use MuRE [Balazevic... 2019] Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh + bt Contraction / projection Translation 2. Embed numerical attributes Attributes-specific mini MLP A numerical relation (attribute) r of value x representation: er(x) = ReLU(x wr + br) Use er(x) in place of tail embedding G Varoquaux 33

KEN embeddings do distill the information [Cvetkov-Iliev... 2023] Object →
→ → p Feature vector almost as performant for analysis as combinatorial feature generations Scalable to million of entries Capture multi-hop information (across multiple tables) Reconstructs well the distributions of numerical attributes mean, percentiles, counts, in one-to-many settings Good features for neural nets ⌣ X ∈ p G Varoquaux 34

Entity embeddings that distill information across tables KEN: knowledge embedding
with numbers [Cvetkov-Iliev... 2023] X ∈ p soda-inria.github.io/ken embeddings 6 million common entities cities, people, compagnies... Example usage in dirty-cat docs G Varoquaux 35

Representation learning + rich machine learning Can partly automate data
preparation The promise of less manual work But we have replaced one sausage factory (data massaging) by another (opaque representations and models) Why should we trust these? G Varoquaux 36

Valid analysis? More learning or cleaning? [Cvetkov-Iliev... 2022] G Varoquaux
37

More learning versus more cleaning [Cvetkov-Iliev... 2022] A cross-institution study
of salary Comparing entity matching vs embeddings + machine learning Entity matching = 3 days of manual labor, results imperfect Analysis, not prediction Machine learning as flexible estimators of conditional relations Analytic questions can be reformulated eg: Sex pay gap = causal effect of sex ⇒ double ML estimators G Varoquaux 38

More learning versus more cleaning [Cvetkov-Iliev... 2022] A cross-institution study
of salary Comparing entity matching vs embeddings + machine learning Entity matching = 3 days of manual labor, results imperfect Analysis, not prediction Machine learning as flexible estimators of conditional relations Analytic questions can be reformulated eg: Sex pay gap = causal effect of sex ⇒ double ML estimators Validity established via error on observables (cross-validation) Conlusion: Both cleaning & learning help Embedding + learning goes far for little cost G Varoquaux 38

Valid analyses Opinion: More learning rather than cleaning [Cvetkov-Iliev... 2022]
Cleaning, modeling Human auditable Sometimes in the eye of the beholder Learning Validity on observables G Varoquaux 39

The soda team: Machine learning for health and social sciences
Tabular relational learning Relational databases, data lakes Health and social sciences Epidemiology, education, psychology Machine learning for statistics Causal inference, biases, missing values Data-science software scikit-learn, joblib, dirty-cat G Varoquaux 40

Representations of relational data Trees work very well on a
data table - Not tied to smooth geometry / gradients I seek continuous representations of complex discrete objects - Lack of obvious regularities - String-based representations - Embedding a large database graph software: dirty-cat dirty-cat.github.io @GaelVaroquaux

References I I. Balazevic, C. Allen, and T. Hospedales. Multi-relational
poincar´ e graph embeddings. Neural Information Processing Systems, 32:4463, 2019. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013. P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. Transactions in Knowledge and Data Engineering, 2020. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning. IEEE Access, 2022. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational data embeddings for feature enrichment with background information. Machine Learning, pages 1–34, 2023. L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.

References II J. Josse, N. Prost, E. Scornet, and G.
Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. Kaggle. Kaggle industry survey, 2018. URL https://www.kaggle.com/ash316/novice-to-grandmaster. J. M. Kanter and K. Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10, 2015. H. T. Lam, B. Buesser, H. Min, T. N. Minh, M. Wistuba, U. Khurana, G. Bramble, T. Salonidis, D. Wang, and H. Samulowitz. Automated data science for relational data. In International Conference on Data Engineering (ICDE), page 2689. IEEE, 2021. M. Le Morvan, J. Josse, E. Scornet, and G. Varoquaux. What’s a good imputation to predict with missing values? NeurIPS, 2021. A. Perez-Lebel, G. Varoquaux, M. Le Morvan, J. Josse, and J.-B. Poline. Benchmarking missing-values approaches for predictive models on health databases. GigaScience, 11, 2022.

Representation learning on relational data to a...

Representation learning on relational data to automate data preparation

More Decks by Gael Varoquaux

Other Decks in Technology

Featured

Transcript