Representation learning on relational data to automate data preparation

Slide 1

Slide 1 text

Representation learning on relational data to automate data preparation Ga¨ el Varoquaux

Slide 2

Slide 2 text

Data preparation is crucial to analysis Better pipelines can reduce this need Focus on supervised learning: “good“ representations, models = gives good predictions But supervised learning is more: weakly-parametric estimators of conditional relations G Varoquaux 1

Slide 3

Slide 3 text

1 Data tables, not vector spaces Gender Experience Age Employee Position Title M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M NA 62 Bus Operator F 9 yrs 35 Social Worker III F NA 39 Library Assistant II M 8 yrs NA Library Assistant I p

Slide 4

Slide 4 text

In data science most data is tabular G Varoquaux 3

Slide 5

Slide 5 text

Data modeling practices Count, normalize, encode Transform everything to numbers It’s the nature of statistics We must feed the models G Varoquaux 4

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Adapting the data to our models Improving data & knowledge representation: curating it, transforming it, not automated by traditional machine learning Data massaging Mostly pandas and SQL scripts Data preparation = #1 challenge (“Dirty data”) [Kaggle 2018, Lam... 2021] www.kaggle.com/ash316/novice-to-grandmaster G Varoquaux 5

Slide 8

Slide 8 text

Data massaging is exhausting Will deep learning save us? G Varoquaux 6

Slide 9

Slide 9 text

Deep learning underperforms on data tables [Grinsztajn... 2022] Taylored deep-learning architectures But tree-based methods perform best FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost 0.6 0.7 0.8 0.9 1.0 1e+01 1e+03 1e+05 Random search time (seconds) Normalized test accuracy of best model (on valid set) up to this iteration G Varoquaux 7

Slide 10

Slide 10 text

Deep learning underperforms on data tables [Grinsztajn... 2022] Tabular data Various non-Gaussian marginals Many categorical features Trees’ inductive bias: Axis-aligned Each column is meaningful Non smooth 2 0 2 2 0 2 The data’s natural geometry is neither smooth not vectorial Our toolkit is based on smooth optimization in vector spaces G Varoquaux 7

Slide 11

Slide 11 text

Missing Data Frequent in health & social sciences p ∪ {NA} not a vector space G Varoquaux 8

Slide 12

Slide 12 text

Impute & regress [Le Morvan... 2021] Impute: fill in the blanks with likely values Standard statistical inference needs missing at random: missingness is independent from unseen values G Varoquaux 9

Slide 13

Slide 13 text

Impute & regress [Le Morvan... 2021] Impute: fill in the blanks with likely values Standard statistical inference needs missing at random: missingness is independent from unseen values Complete data Imputed data (manifolds) Theorem (informal): a universally consistent learner leads to optimal prediction for all missing data mechanisms and almost all imputation functions. Asymptotically, imputing well is not needed to predict well. G Varoquaux 9

Slide 14

Slide 14 text

Impute & regress [Le Morvan... 2021] Impute: fill in the blanks with likely values Standard statistical inference needs missing at random: missingness is independent from unseen values Theorem (informal): a universally consistent learner leads to optimal prediction for all missing data mechanisms and almost all imputation functions. Asymptotically, imputing well is not needed to predict well. Imputation and regression must be jointly optimized When imputing with ¾[Xmis |Xobs ], the optimal regressor to predict is discontinuous G Varoquaux 9

Slide 15

Slide 15 text

Trees handling missing values MIA (Missing Incorporated Attribute) [Josse... 2019] x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 sklearn HistGradientBoostingClassifier The learner readily handles missing values G Varoquaux 10

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Missing Data p ∪ {NA} not a vector space Imputation is not about finding likely values Rather a representation to facilitate learning [Le Morvan... 2021] G Varoquaux 11

Slide 18

Slide 18 text

Entity representations Open-ended entries G Varoquaux 12 Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III

Slide 19

Slide 19 text

Modeling strings, rather than categories Notion of category ⇔ entity normalization Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 13

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Modeling strings: GapEncoder = string embeddings Factorizing sub-string count matrices 3-gram1 P 3-gram2 ol 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 14 [Cerda and Varoquaux 2020]

Slide 22

Slide 22 text

GapEncoder: String embeddings capturing latent categories ry or st se er ty ue er Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant Categories G Varoquaux 15 Code: dirty-cat.github.io [Cerda and Varoquaux 2020]

Slide 23

Slide 23 text

GapEncoder: String embeddings capturing latent categories Plausible feature names istant, library pment, operator ion, specialist rker, warehouse rogram, manager anic, community rescuer, rescue ection, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant feature nam es Categories G Varoquaux 15 [Cerda and Varoquaux 2020]

Slide 24

Slide 24 text

Representations tailored to the data fasttext: almost as good as GapEncoder, if in the right language 0 20 40 60 80 100 relativescore(% ) One-hot +SVD Similarity encoding FastText +SVD Gamma-Poisson factorization FastText + SVD (d=30) 0 20 40 60 80 100 relative score (%) English French Hungarian G Varoquaux 16 [Cerda and Varoquaux 2020]

Slide 25

Slide 25 text

Vectorizing tables: the TableVectorizer The dirty-cat software dirty-cat.github.io TableVectorizer X = tab vec.fit transform(df) Heuristics for different columns strings with ≥ 30 categories ⇒ GapEncoder date/time ⇒ DateTimeEncoder non-string discrete ⇒ TargetEncoder ... Strong baseline G Varoquaux 17

Slide 26

Slide 26 text

Data tables - Heterogeneous columns - Missing values - Open-ended strings Tree-based models sklearn HistGradientBoosting Column encoding dirty cat TableVectorizer G Varoquaux 18

Slide 27

Slide 27 text

same entities Aggregating Analysis 2 Across tables We often start from many tables

Slide 28

Slide 28 text

Example data-science analysis Real-estate market Expected price of a property? Predict the price from relevant information available age surface area # of rooms floor location ... G Varoquaux 20

Slide 29

Slide 29 text

City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M Vitry 33k Population 2.2M 33k City Rent Paris 1100€ Vitry 700€ Paris 1300€ Population 2.2M 33k 2.2M Example data-science analysis Data may need to be merged across tables G Varoquaux 21

Slide 30

Slide 30 text

City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ GroupBy + Avg Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis Aggregations may be needed across different data granularity G Varoquaux 22

Slide 31

Slide 31 text

City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis Multiple hops may be needed G Varoquaux 23

Slide 32

Slide 32 text

Slide 33

Slide 33 text

City Rent Paris 1100€ Vitry 700€ Paris 1300€ City Pop. Paris 2.2M Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis G Varoquaux 24 We need statistics and learning across tables

Slide 34

Slide 34 text

Relational data challenges statistical learning Statistics and learning use repetitions and regularities Relational data Discrete objects, different tables, different natures properties, person, cities, departments... No clear repetition, regularity, metric, smoothness p G Varoquaux 25

Slide 35

Slide 35 text

Assembling data same entities Aggregating Analysis A “main” table Feature-enrichment tables G Varoquaux 26

Slide 36

Slide 36 text

Deep Feature Synthesis [Kanter and Veeramachaneni 2015] Greedily - starts from a target table - recursively joins related tables, to a given depth One-to-many relations: Computes different aggregations COUNT, SUM, LAST, MAX... City Population City School School Students Palaiseau 33k Palaiseau Lyc´ ee Camille Claudel Lyc´ ee Camille Claudel 800 Palaiseau Lyc´ ee Henri Poincar´ e Lyc´ ee Henri Poincar´ e 1000 Target table Depth 0 City Department Department PovertyRate Palaiseau Essonne Essonne 13.3% Depth 1 Depth 2 City Population COUNT( City.School) City.Department City.Department. PovertyRate SUM(City. School.Students) MAX(City. School.Students Palaiseau 33k 2 Essonne 13.3% 1800 800 Does not scale: # features explodes with depth and # tables G Varoquaux 27

Slide 37

Slide 37 text

Embeddings as assembly Entity embeddings that distill information across tables Object → p KEN: knowledge embedding with numbers [Cvetkov-Iliev... 2023] G Varoquaux 28

Slide 38

Slide 38 text

KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris Sherman County Long Orange County Orange Harris Orange ... Anaheim Name Irvine Houston Santa Ana ... State Ca Ca Tx Ca ... City Sherman Name 36.1 Long Harris 29.5 Lat Orange 33.7 101.5 95.2 117.8 ... ... ... Tx St Tx Ca ... Pop 3k 4.7M 3.2M ... County Triplet representation head relation tail ... Database tables Paris 4.7M Paris Harris County Pop Paris Orange Paris Anaheim City County Training Paris Harris Paris Houston City County negative sampling for entity embedding re opera training dynamics tail Strategy: Convert data to graph (RDF triplets) G Varoquaux 29

Slide 39

Slide 39 text

KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris herman County Long riplet representation head relation tail ... Paris 4.7M Paris Harris County Pop Paris Orange Paris naheim City County Training embeddings Paris Harris Paris Houston City County negative sampling for training numerical attribute embedding entity embedding relation operator training dynamics harris orange sherman Analysis tail County 105 Votes 1285 130 ... Harris Orange Sherman ... Database 2 Transfert Strategy: Convert data to graph (RDF triplets) Adapt knowledge-graph embedding approaches Capture relations and numerical attributes G Varoquaux 29

Slide 40

Slide 40 text

From tables to (knowledge) graphs Knowledge graphs = list of triples (head, relation, tail) or (h, r, t) e.g. (Paris, capitalOf, France) San Francisco San Diego California 0.87M State 1.4M Population City Population State San Francisco 0.87M California San Diego 1.4M California (San Francisco, Population, 0.87M) (San Francisco, State, California) (San Diego, Population, 1.4M) (San Diego, State, California) Table representation Triple / Knowledge graph “Head” column The two representations are (almost) equivalent: G Varoquaux 30

Slide 41

Slide 41 text

Entity embeddings: contextual Contextual: two entities have close embeddings if they co-occur In NLP: word2vec = word co-occurences In knowledge-graphs: RDF2vec G Varoquaux 31 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California

Slide 42

Slide 42 text

Entity embeddings: contextual < relational Contextual: two entities have close embeddings if they co-occur Relational: two entities are close if they have the same relations to other entities G Varoquaux 31 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California

Slide 43

Slide 43 text

Knowledge-graph embeddings to capture relations TransE [Bordes... 2013] represents relation r as a translation vector r ∈ p between entity embeddings h and t: Scoring function: f (h, r, t) = −||h + r − t|| Italy France Paris Rome capitalOf Training: optimize h, r, t to minimize a margin loss: L = (h,r,t)∈G, (h′,t′) s.t.(h′,r,t′) G with h′=h or t=t′ [f (h′, r, t′) − f (h, r, t) + γ]+ G Varoquaux 32

Slide 44

Slide 44 text

KEN: embeddings to distill information [Cvetkov-Iliev... 2023] 1. Capture one-to-many relation Use MuRE [Balazevic... 2019] Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh + bt Contraction / projection Translation Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn n sIn hWorks etts Enables rich relational geometry G Varoquaux 33

Slide 45

Slide 45 text

KEN: embeddings to distill information [Cvetkov-Iliev... 2023] 1. Capture one-to-many relation Use MuRE [Balazevic... 2019] Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh + bt Contraction / projection Translation 2. Embed numerical attributes Attributes-specific mini MLP A numerical relation (attribute) r of value x representation: er(x) = ReLU(x wr + br) Use er(x) in place of tail embedding G Varoquaux 33

Slide 46

Slide 46 text

KEN embeddings do distill the information [Cvetkov-Iliev... 2023] Object → → → p Feature vector almost as performant for analysis as combinatorial feature generations Scalable to million of entries Capture multi-hop information (across multiple tables) Reconstructs well the distributions of numerical attributes mean, percentiles, counts, in one-to-many settings Good features for neural nets ⌣ X ∈ p G Varoquaux 34

Slide 47

Slide 47 text

Entity embeddings that distill information across tables KEN: knowledge embedding with numbers [Cvetkov-Iliev... 2023] X ∈ p soda-inria.github.io/ken embeddings 6 million common entities cities, people, compagnies... Example usage in dirty-cat docs G Varoquaux 35

Slide 48

Slide 48 text

Representation learning + rich machine learning Can partly automate data preparation The promise of less manual work But we have replaced one sausage factory (data massaging) by another (opaque representations and models) Why should we trust these? G Varoquaux 36

Slide 49

Slide 49 text

Valid analysis? More learning or cleaning? [Cvetkov-Iliev... 2022] G Varoquaux 37

Slide 50

Slide 50 text

More learning versus more cleaning [Cvetkov-Iliev... 2022] A cross-institution study of salary Comparing entity matching vs embeddings + machine learning Entity matching = 3 days of manual labor, results imperfect Analysis, not prediction Machine learning as flexible estimators of conditional relations Analytic questions can be reformulated eg: Sex pay gap = causal effect of sex ⇒ double ML estimators G Varoquaux 38

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Valid analyses Opinion: More learning rather than cleaning [Cvetkov-Iliev... 2022] Cleaning, modeling Human auditable Sometimes in the eye of the beholder Learning Validity on observables G Varoquaux 39

Slide 53

Slide 53 text

The soda team: Machine learning for health and social sciences Tabular relational learning Relational databases, data lakes Health and social sciences Epidemiology, education, psychology Machine learning for statistics Causal inference, biases, missing values Data-science software scikit-learn, joblib, dirty-cat G Varoquaux 40

Slide 54

Slide 54 text

Representations of relational data Trees work very well on a data table - Not tied to smooth geometry / gradients I seek continuous representations of complex discrete objects - Lack of obvious regularities - String-based representations - Embedding a large database graph software: dirty-cat dirty-cat.github.io @GaelVaroquaux

Slide 55

Slide 55 text

References I I. Balazevic, C. Allen, and T. Hospedales. Multi-relational poincar´ e graph embeddings. Neural Information Processing Systems, 32:4463, 2019. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013. P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. Transactions in Knowledge and Data Engineering, 2020. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning. IEEE Access, 2022. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational data embeddings for feature enrichment with background information. Machine Learning, pages 1–34, 2023. L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.

Slide 56

Slide 56 text

References II J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. Kaggle. Kaggle industry survey, 2018. URL https://www.kaggle.com/ash316/novice-to-grandmaster. J. M. Kanter and K. Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10, 2015. H. T. Lam, B. Buesser, H. Min, T. N. Minh, M. Wistuba, U. Khurana, G. Bramble, T. Salonidis, D. Wang, and H. Samulowitz. Automated data science for relational data. In International Conference on Data Engineering (ICDE), page 2689. IEEE, 2021. M. Le Morvan, J. Josse, E. Scornet, and G. Varoquaux. What’s a good imputation to predict with missing values? NeurIPS, 2021. A. Perez-Lebel, G. Varoquaux, M. Le Morvan, J. Josse, and J.-B. Poline. Benchmarking missing-values approaches for predictive models on health databases. GigaScience, 11, 2022.