In standard data-science practice, a significant effort is spent on preparing the data before statistical learning. One reason is that the data come from various tables, each with its own subject matter, its specificities. This is unlike natural images, or even natural text, where universal regularities have enabled representation learning, fueling the deep learning revolution.
I will present progress on learning representations with data tables, overcoming the lack of simple regularities. I will show how these representations decrease the need for data preparation: matching entities, aggregating the data across tables. Character-level modeling enable statistical learning without normalized entities, as in the dirty-cat library. Representation learning across many tables, describing objects of different nature and varying attributes, can aggregate the distributed information, forming vector representation of entities. As a result, we created general purpose embeddings that enrich many data analyses by summarizing all the numerical and relational information in wikipedia for millions of entities: cities, people, companies, books
[1] Marine Le Morvan, Julie Josse, Erwan Scornet, & Gaël Varoquaux, (2021). What’s a good imputation to predict with missing values?. Advances in Neural Information Processing Systems, 34, 11530-11540.
[2] Patricio Cerda, and Gaël Varoquaux. "Encoding high-cardinality string categorical variables." IEEE Transactions on Knowledge and Data Engineering (2020).
[3] Alexis Cvetkov-Iliev, Alexandre Allauzen, and Gaël Varoquaux. "Analytics on Non-Normalized Data Sources: more Learning, rather than more Cleaning." IEEE Access 10 (2022): 42420-42431.
[4] Alexis Cvetkov-Iliev, Alexandre Allauzen, and Gaël Varoquaux. "Relational data embeddings for feature enrichment with background information." Machine Learning (2023): 1-34.