Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Representation learning on relational data to a...

Representation learning on relational data to automate data preparation

In standard data-science practice, a significant effort is spent on preparing the data before statistical learning. One reason is that the data come from various tables, each with its own subject matter, its specificities. This is unlike natural images, or even natural text, where universal regularities have enabled representation learning, fueling the deep learning revolution.

I will present progress on learning representations with data tables, overcoming the lack of simple regularities. I will show how these representations decrease the need for data preparation: matching entities, aggregating the data across tables. Character-level modeling enable statistical learning without normalized entities, as in the dirty-cat library. Representation learning across many tables, describing objects of different nature and varying attributes, can aggregate the distributed information, forming vector representation of entities. As a result, we created general purpose embeddings that enrich many data analyses by summarizing all the numerical and relational information in wikipedia for millions of entities: cities, people, companies, books

[1] Marine Le Morvan, Julie Josse, Erwan Scornet, & Gaël Varoquaux, (2021). What’s a good imputation to predict with missing values?. Advances in Neural Information Processing Systems, 34, 11530-11540.

[2] Patricio Cerda, and Gaël Varoquaux. "Encoding high-cardinality string categorical variables." IEEE Transactions on Knowledge and Data Engineering (2020).

[3] Alexis Cvetkov-Iliev, Alexandre Allauzen, and Gaël Varoquaux. "Analytics on Non-Normalized Data Sources: more Learning, rather than more Cleaning." IEEE Access 10 (2022): 42420-42431.

[4] Alexis Cvetkov-Iliev, Alexandre Allauzen, and Gaël Varoquaux. "Relational data embeddings for feature enrichment with background information." Machine Learning (2023): 1-34.

Gael Varoquaux

April 13, 2023
Tweet

More Decks by Gael Varoquaux

Other Decks in Technology

Transcript

  1. Data preparation is crucial to analysis Better pipelines can reduce

    this need Focus on supervised learning: “good“ representations, models = gives good predictions But supervised learning is more: weakly-parametric estimators of conditional relations G Varoquaux 1
  2. 1 Data tables, not vector spaces Gender Experience Age Employee

    Position Title M 10 yrs 42 Master Police Officer F 23 yrs NA Social Worker IV M 3 yrs 28 Police Officer III F 16 yrs 45 Police Aide M 13 yrs 48 Electrician I M 6 yrs 36 Bus Operator M NA 62 Bus Operator F 9 yrs 35 Social Worker III F NA 39 Library Assistant II M 8 yrs NA Library Assistant I p
  3. Data modeling practices Count, normalize, encode Transform everything to numbers

    It’s the nature of statistics We must feed the models G Varoquaux 4
  4. Adapting the data to our models Improving data & knowledge

    representation: curating it, transforming it, not automated by traditional machine learning Data massaging Mostly pandas and SQL scripts G Varoquaux 5
  5. Adapting the data to our models Improving data & knowledge

    representation: curating it, transforming it, not automated by traditional machine learning Data massaging Mostly pandas and SQL scripts Data preparation = #1 challenge (“Dirty data”) [Kaggle 2018, Lam... 2021] www.kaggle.com/ash316/novice-to-grandmaster G Varoquaux 5
  6. Deep learning underperforms on data tables [Grinsztajn... 2022] Taylored deep-learning

    architectures But tree-based methods perform best FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer FT Transformer GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree GradientBoostingTree MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP MLP RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet Resnet SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT SAINT XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost XGBoost 0.6 0.7 0.8 0.9 1.0 1e+01 1e+03 1e+05 Random search time (seconds) Normalized test accuracy of best model (on valid set) up to this iteration G Varoquaux 7
  7. Deep learning underperforms on data tables [Grinsztajn... 2022] Tabular data

    Various non-Gaussian marginals Many categorical features Trees’ inductive bias: Axis-aligned Each column is meaningful Non smooth 2 0 2 2 0 2 The data’s natural geometry is neither smooth not vectorial Our toolkit is based on smooth optimization in vector spaces G Varoquaux 7
  8. Missing Data Frequent in health & social sciences p ∪

    {NA} not a vector space G Varoquaux 8
  9. Impute & regress [Le Morvan... 2021] Impute: fill in the

    blanks with likely values Standard statistical inference needs missing at random: missingness is independent from unseen values G Varoquaux 9
  10. Impute & regress [Le Morvan... 2021] Impute: fill in the

    blanks with likely values Standard statistical inference needs missing at random: missingness is independent from unseen values Complete data Imputed data (manifolds) Theorem (informal): a universally consistent learner leads to optimal prediction for all missing data mechanisms and almost all imputation functions. Asymptotically, imputing well is not needed to predict well. G Varoquaux 9
  11. Impute & regress [Le Morvan... 2021] Impute: fill in the

    blanks with likely values Standard statistical inference needs missing at random: missingness is independent from unseen values Theorem (informal): a universally consistent learner leads to optimal prediction for all missing data mechanisms and almost all imputation functions. Asymptotically, imputing well is not needed to predict well. Imputation and regression must be jointly optimized When imputing with ¾[Xmis |Xobs ], the optimal regressor to predict is discontinuous G Varoquaux 9
  12. Trees handling missing values MIA (Missing Incorporated Attribute) [Josse... 2019]

    x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 sklearn HistGradientBoostingClassifier The learner readily handles missing values G Varoquaux 10
  13. Trees handling missing values MIA (Missing Incorporated Attribute) [Josse... 2019]

    x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 sklearn HistGradientBoostingClassifier The learner readily handles missing values Benchmarks [Perez-Lebel... 2022] Tree handling of missing values work best Imputation works well, but expensive G Varoquaux 10
  14. Missing Data p ∪ {NA} not a vector space Imputation

    is not about finding likely values Rather a representation to facilitate learning [Le Morvan... 2021] G Varoquaux 11
  15. Entity representations Open-ended entries G Varoquaux 12 Employee Position Title

    Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III
  16. Modeling strings, rather than categories Notion of category ⇔ entity

    normalization Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 13
  17. Modeling strings: GapEncoder = string embeddings Factorizing sub-string count matrices

    3-gram1 P 3-gram2 ol 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol G Varoquaux 14 [Cerda and Varoquaux 2020]
  18. Modeling strings: GapEncoder = string embeddings Factorizing sub-string count matrices

    3-gram1 P 3-gram2 ol 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 14 [Cerda and Varoquaux 2020]
  19. GapEncoder: String embeddings capturing latent categories ry or st se

    er ty ue er Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant Categories G Varoquaux 15 Code: dirty-cat.github.io [Cerda and Varoquaux 2020]
  20. GapEncoder: String embeddings capturing latent categories Plausible feature names istant,

    library pment, operator ion, specialist rker, warehouse rogram, manager anic, community rescuer, rescue ection, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant feature nam es Categories G Varoquaux 15 [Cerda and Varoquaux 2020]
  21. Representations tailored to the data fasttext: almost as good as

    GapEncoder, if in the right language 0 20 40 60 80 100 relativescore(% ) One-hot +SVD Similarity encoding FastText +SVD Gamma-Poisson factorization FastText + SVD (d=30) 0 20 40 60 80 100 relative score (%) English French Hungarian G Varoquaux 16 [Cerda and Varoquaux 2020]
  22. Vectorizing tables: the TableVectorizer The dirty-cat software dirty-cat.github.io TableVectorizer X

    = tab vec.fit transform(df) Heuristics for different columns strings with ≥ 30 categories ⇒ GapEncoder date/time ⇒ DateTimeEncoder non-string discrete ⇒ TargetEncoder ... Strong baseline G Varoquaux 17
  23. Data tables - Heterogeneous columns - Missing values - Open-ended

    strings Tree-based models sklearn HistGradientBoosting Column encoding dirty cat TableVectorizer G Varoquaux 18
  24. Example data-science analysis Real-estate market Expected price of a property?

    Predict the price from relevant information available age surface area # of rooms floor location ... G Varoquaux 20
  25. City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M

    Vitry 33k Population 2.2M 33k City Rent Paris 1100€ Vitry 700€ Paris 1300€ Population 2.2M 33k 2.2M Example data-science analysis Data may need to be merged across tables G Varoquaux 21
  26. City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M

    Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ GroupBy + Avg Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis Aggregations may be needed across different data granularity G Varoquaux 22
  27. City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M

    Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis Multiple hops may be needed G Varoquaux 23
  28. City Rent Paris 1100€ Vitry 700€ City Pop. Paris 2.2M

    Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% Population 2.2M 33k Mean salary 45k€ 36k€ City Rent Paris 1100€ Vitry 700€ Paris 1300€ Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis Joining tables Aggregations Multiple hops G Varoquaux 23 Difficult for humans requires expertise on the data Difficult for machine learning discrete choices, combinatorial optim
  29. City Rent Paris 1100€ Vitry 700€ Paris 1300€ City Pop.

    Paris 2.2M Vitry 33k Person ID City Salary P1 Paris 50k€ P2 Paris 40k€ P3 Vitry 34k€ P4 Vitry 38k€ City Department Paris Paris Vitry-sur-Seine Val-de-Marne Department Poverty rate Paris 15.2% Val-de-Marne 13.3% GroupBy + Avg Poverty rate 15.2% 13.3% 15.2% Population 2.2M 33k 2.2M Mean salary 45k€ 36k€ 45k€ Example data-science analysis G Varoquaux 24 We need statistics and learning across tables
  30. Relational data challenges statistical learning Statistics and learning use repetitions

    and regularities Relational data Discrete objects, different tables, different natures properties, person, cities, departments... No clear repetition, regularity, metric, smoothness p G Varoquaux 25
  31. Deep Feature Synthesis [Kanter and Veeramachaneni 2015] Greedily - starts

    from a target table - recursively joins related tables, to a given depth One-to-many relations: Computes different aggregations COUNT, SUM, LAST, MAX... City Population City School School Students Palaiseau 33k Palaiseau Lyc´ ee Camille Claudel Lyc´ ee Camille Claudel 800 Palaiseau Lyc´ ee Henri Poincar´ e Lyc´ ee Henri Poincar´ e 1000 Target table Depth 0 City Department Department PovertyRate Palaiseau Essonne Essonne 13.3% Depth 1 Depth 2 City Population COUNT( City.School) City.Department City.Department. PovertyRate SUM(City. School.Students) MAX(City. School.Students Palaiseau 33k 2 Essonne 13.3% 1800 800 Does not scale: # features explodes with depth and # tables G Varoquaux 27
  32. Embeddings as assembly Entity embeddings that distill information across tables

    Object → p KEN: knowledge embedding with numbers [Cvetkov-Iliev... 2023] G Varoquaux 28
  33. KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris Sherman County

    Long Orange County Orange Harris Orange ... Anaheim Name Irvine Houston Santa Ana ... State Ca Ca Tx Ca ... City Sherman Name 36.1 Long Harris 29.5 Lat Orange 33.7 101.5 95.2 117.8 ... ... ... Tx St Tx Ca ... Pop 3k 4.7M 3.2M ... County Triplet representation head relation tail ... Database tables Paris 4.7M Paris Harris County Pop Paris Orange Paris Anaheim City County Training Paris Harris Paris Houston City County negative sampling for entity embedding re opera training dynamics tail Strategy: Convert data to graph (RDF triplets) G Varoquaux 29
  34. KEN: Overall approach [Cvetkov-Iliev... 2023] Paris 36.1 Paris herman County

    Long riplet representation head relation tail ... Paris 4.7M Paris Harris County Pop Paris Orange Paris naheim City County Training embeddings Paris Harris Paris Houston City County negative sampling for training numerical attribute embedding entity embedding relation operator training dynamics harris orange sherman Analysis tail County 105 Votes 1285 130 ... Harris Orange Sherman ... Database 2 Transfert Strategy: Convert data to graph (RDF triplets) Adapt knowledge-graph embedding approaches Capture relations and numerical attributes G Varoquaux 29
  35. From tables to (knowledge) graphs Knowledge graphs = list of

    triples (head, relation, tail) or (h, r, t) e.g. (Paris, capitalOf, France) San Francisco San Diego California 0.87M State 1.4M Population City Population State San Francisco 0.87M California San Diego 1.4M California (San Francisco, Population, 0.87M) (San Francisco, State, California) (San Diego, Population, 1.4M) (San Diego, State, California) Table representation Triple / Knowledge graph “Head” column The two representations are (almost) equivalent: G Varoquaux 30
  36. Entity embeddings: contextual Contextual: two entities have close embeddings if

    they co-occur In NLP: word2vec = word co-occurences In knowledge-graphs: RDF2vec G Varoquaux 31 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California
  37. Entity embeddings: contextual < relational Contextual: two entities have close

    embeddings if they co-occur Relational: two entities are close if they have the same relations to other entities G Varoquaux 31 (Facebook, FoundedIn, Massachussetts) (Facebook, HeadquartersIn, California) (MathWorks, FoundedIn, California) (MathWorks, HeadquartersIn, Massachussetts) (Google, FoundedIn, California) (Google, HeadquartersIn, California) (Apple, FoundedIn, California) (Apple, HeadquartesIn, California) Input triples a) Contextual: RDF2vec embeddings b) Relational: knowledge graph embeddings Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn Google Apple FoundedIn HeadquartersIn MathWorks Facebook Massachussetts California
  38. Knowledge-graph embeddings to capture relations TransE [Bordes... 2013] represents relation

    r as a translation vector r ∈ p between entity embeddings h and t: Scoring function: f (h, r, t) = −||h + r − t|| Italy France Paris Rome capitalOf Training: optimize h, r, t to minimize a margin loss: L = (h,r,t)∈G, (h′,t′) s.t.(h′,r,t′) G with h′=h or t=t′ [f (h′, r, t′) − f (h, r, t) + γ]+ G Varoquaux 32
  39. KEN: embeddings to distill information [Cvetkov-Iliev... 2023] 1. Capture one-to-many

    relation Use MuRE [Balazevic... 2019] Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh + bt Contraction / projection Translation Google Apple California Massachussetts Facebook MathWorks FoundedIn HeadquartersIn n sIn hWorks etts Enables rich relational geometry G Varoquaux 33
  40. KEN: embeddings to distill information [Cvetkov-Iliev... 2023] 1. Capture one-to-many

    relation Use MuRE [Balazevic... 2019] Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh + bt Contraction / projection Translation 2. Embed numerical attributes Attributes-specific mini MLP A numerical relation (attribute) r of value x representation: er(x) = ReLU(x wr + br) Use er(x) in place of tail embedding G Varoquaux 33
  41. KEN embeddings do distill the information [Cvetkov-Iliev... 2023] Object →

    → → p Feature vector almost as performant for analysis as combinatorial feature generations Scalable to million of entries Capture multi-hop information (across multiple tables) Reconstructs well the distributions of numerical attributes mean, percentiles, counts, in one-to-many settings Good features for neural nets ⌣ X ∈ p G Varoquaux 34
  42. Entity embeddings that distill information across tables KEN: knowledge embedding

    with numbers [Cvetkov-Iliev... 2023] X ∈ p soda-inria.github.io/ken embeddings 6 million common entities cities, people, compagnies... Example usage in dirty-cat docs G Varoquaux 35
  43. Representation learning + rich machine learning Can partly automate data

    preparation The promise of less manual work But we have replaced one sausage factory (data massaging) by another (opaque representations and models) Why should we trust these? G Varoquaux 36
  44. More learning versus more cleaning [Cvetkov-Iliev... 2022] A cross-institution study

    of salary Comparing entity matching vs embeddings + machine learning Entity matching = 3 days of manual labor, results imperfect Analysis, not prediction Machine learning as flexible estimators of conditional relations Analytic questions can be reformulated eg: Sex pay gap = causal effect of sex ⇒ double ML estimators G Varoquaux 38
  45. More learning versus more cleaning [Cvetkov-Iliev... 2022] A cross-institution study

    of salary Comparing entity matching vs embeddings + machine learning Entity matching = 3 days of manual labor, results imperfect Analysis, not prediction Machine learning as flexible estimators of conditional relations Analytic questions can be reformulated eg: Sex pay gap = causal effect of sex ⇒ double ML estimators Validity established via error on observables (cross-validation) Conlusion: Both cleaning & learning help Embedding + learning goes far for little cost G Varoquaux 38
  46. Valid analyses Opinion: More learning rather than cleaning [Cvetkov-Iliev... 2022]

    Cleaning, modeling Human auditable Sometimes in the eye of the beholder Learning Validity on observables G Varoquaux 39
  47. The soda team: Machine learning for health and social sciences

    Tabular relational learning Relational databases, data lakes Health and social sciences Epidemiology, education, psychology Machine learning for statistics Causal inference, biases, missing values Data-science software scikit-learn, joblib, dirty-cat G Varoquaux 40
  48. Representations of relational data Trees work very well on a

    data table - Not tied to smooth geometry / gradients I seek continuous representations of complex discrete objects - Lack of obvious regularities - String-based representations - Embedding a large database graph software: dirty-cat dirty-cat.github.io @GaelVaroquaux
  49. References I I. Balazevic, C. Allen, and T. Hospedales. Multi-relational

    poincar´ e graph embeddings. Neural Information Processing Systems, 32:4463, 2019. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013. P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. Transactions in Knowledge and Data Engineering, 2020. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data sources: more learning, rather than more cleaning. IEEE Access, 2022. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational data embeddings for feature enrichment with background information. Machine Learning, pages 1–34, 2023. L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  50. References II J. Josse, N. Prost, E. Scornet, and G.

    Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. Kaggle. Kaggle industry survey, 2018. URL https://www.kaggle.com/ash316/novice-to-grandmaster. J. M. Kanter and K. Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10, 2015. H. T. Lam, B. Buesser, H. Min, T. N. Minh, M. Wistuba, U. Khurana, G. Bramble, T. Salonidis, D. Wang, and H. Samulowitz. Automated data science for relational data. In International Conference on Data Engineering (ICDE), page 2689. IEEE, 2021. M. Le Morvan, J. Josse, E. Scornet, and G. Varoquaux. What’s a good imputation to predict with missing values? NeurIPS, 2021. A. Perez-Lebel, G. Varoquaux, M. Le Morvan, J. Josse, and J.-B. Poline. Benchmarking missing-values approaches for predictive models on health databases. GigaScience, 11, 2022.