Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Representation learning on relational data to automate data preparation

Representation learning on relational data to automate data preparation

In standard data-science practice, a significant effort is spent on preparing the data before statistical learning. One reason is that the data come from various tables, each with its own subject matter, its specificities. This is unlike natural images, or even natural text, where universal regularities have enabled representation learning, fueling the deep learning revolution.

I will present progress on learning representations with data tables, overcoming the lack of simple regularities. I will show how these representations decrease the need for data preparation: matching entities, aggregating the data across tables. Character-level modeling enable statistical learning without normalized entities, as in the dirty-cat library. Representation learning across many tables, describing objects of different nature and varying attributes, can aggregate the distributed information, forming vector representation of entities. As a result, we created general purpose embeddings that enrich many data analyses by summarizing all the numerical and relational information in wikipedia for millions of entities: cities, people, companies, books

[1] Marine Le Morvan, Julie Josse, Erwan Scornet, & Gaël Varoquaux, (2021). What’s a good imputation to predict with missing values?. Advances in Neural Information Processing Systems, 34, 11530-11540.

[2] Patricio Cerda, and Gaël Varoquaux. "Encoding high-cardinality string categorical variables." IEEE Transactions on Knowledge and Data Engineering (2020).

[3] Alexis Cvetkov-Iliev, Alexandre Allauzen, and Gaël Varoquaux. "Analytics on Non-Normalized Data Sources: more Learning, rather than more Cleaning." IEEE Access 10 (2022): 42420-42431.

[4] Alexis Cvetkov-Iliev, Alexandre Allauzen, and Gaël Varoquaux. "Relational data embeddings for feature enrichment with background information." Machine Learning (2023): 1-34.

Gael Varoquaux

April 13, 2023
Tweet

More Decks by Gael Varoquaux

Other Decks in Technology

Transcript

  1. Representation learning on relational data
    to automate data preparation
    Ga¨
    el Varoquaux

    View Slide

  2. Data preparation
    is crucial to analysis
    Better pipelines can reduce
    this need
    Focus on supervised learning:
    “good“ representations, models
    = gives good predictions
    But supervised learning is more:
    weakly-parametric estimators of
    conditional relations
    G Varoquaux 1

    View Slide

  3. 1 Data tables, not vector spaces
    Gender Experience Age Employee Position Title
    M 10 yrs 42 Master Police Officer
    F 23 yrs NA Social Worker IV
    M 3 yrs 28 Police Officer III
    F 16 yrs 45 Police Aide
    M 13 yrs 48 Electrician I
    M 6 yrs 36 Bus Operator
    M NA 62 Bus Operator
    F 9 yrs 35 Social Worker III
    F NA 39 Library Assistant II
    M 8 yrs NA Library Assistant I
    p

    View Slide

  4. In data science
    most data is tabular
    G Varoquaux 3

    View Slide

  5. Data modeling practices
    Count, normalize, encode
    Transform everything to numbers
    It’s the nature of statistics
    We must feed the models
    G Varoquaux 4

    View Slide

  6. Adapting the data to our models
    Improving data & knowledge representation:
    curating it,
    transforming it,
    not automated by traditional machine learning
    Data massaging Mostly pandas and SQL scripts
    G Varoquaux 5

    View Slide

  7. Adapting the data to our models
    Improving data & knowledge representation:
    curating it,
    transforming it,
    not automated by traditional machine learning
    Data massaging Mostly pandas and SQL scripts
    Data preparation = #1 challenge (“Dirty data”)
    [Kaggle 2018, Lam... 2021]
    www.kaggle.com/ash316/novice-to-grandmaster
    G Varoquaux 5

    View Slide

  8. Data massaging
    is exhausting
    Will deep learning
    save us?
    G Varoquaux 6

    View Slide

  9. Deep learning underperforms on data tables [Grinsztajn... 2022]
    Taylored deep-learning
    architectures
    But tree-based methods
    perform best
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    FT Transformer
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    GradientBoostingTree
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    MLP
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    RandomForest
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    Resnet
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    SAINT
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    XGBoost
    0.6
    0.7
    0.8
    0.9
    1.0
    1e+01 1e+03 1e+05
    Random search time (seconds)
    Normalized test accuracy of best
    model (on valid set) up to this iteration
    G Varoquaux 7

    View Slide

  10. Deep learning underperforms on data tables [Grinsztajn... 2022]
    Tabular data
    Various non-Gaussian marginals
    Many categorical features
    Trees’ inductive bias:
    Axis-aligned
    Each column is meaningful
    Non smooth 2 0 2
    2
    0
    2
    The data’s natural geometry is neither smooth not vectorial
    Our toolkit is based on smooth optimization in vector spaces
    G Varoquaux 7

    View Slide

  11. Missing Data
    Frequent in
    health & social sciences
    p ∪ {NA} not a vector space
    G Varoquaux 8

    View Slide

  12. Impute & regress [Le Morvan... 2021]
    Impute: fill in the blanks with likely values
    Standard statistical inference needs missing at random:
    missingness is independent from unseen values
    G Varoquaux 9

    View Slide

  13. Impute & regress [Le Morvan... 2021]
    Impute: fill in the blanks with likely values
    Standard statistical inference needs missing at random:
    missingness is independent from unseen values
    Complete data Imputed data (manifolds)
    Theorem (informal): a universally consistent learner leads
    to optimal prediction for all missing data mechanisms and
    almost all imputation functions.
    Asymptotically, imputing well is not needed to predict well.
    G Varoquaux 9

    View Slide

  14. Impute & regress [Le Morvan... 2021]
    Impute: fill in the blanks with likely values
    Standard statistical inference needs missing at random:
    missingness is independent from unseen values
    Theorem (informal): a universally consistent learner leads
    to optimal prediction for all missing data mechanisms and
    almost all imputation functions.
    Asymptotically, imputing well is not needed to predict well.
    Imputation and regression must be jointly optimized
    When imputing with ¾[Xmis
    |Xobs
    ],
    the optimal regressor to predict is discontinuous
    G Varoquaux 9

    View Slide

  15. Trees handling missing values
    MIA (Missing Incorporated Attribute)
    [Josse... 2019]
    x10< -1.5 ?
    x2< 2 ?
    Yes/Missing
    x7< 0.3 ?
    No
    ...
    Yes
    ...
    No/Missing
    x1< 0.5 ?
    Yes
    ...
    No/Missing
    ... Predict +1.3
    sklearn
    HistGradientBoostingClassifier
    The learner readily handles
    missing values
    G Varoquaux 10

    View Slide

  16. Trees handling missing values
    MIA (Missing Incorporated Attribute)
    [Josse... 2019]
    x10< -1.5 ?
    x2< 2 ?
    Yes/Missing
    x7< 0.3 ?
    No
    ...
    Yes
    ...
    No/Missing
    x1< 0.5 ?
    Yes
    ...
    No/Missing
    ... Predict +1.3
    sklearn
    HistGradientBoostingClassifier
    The learner readily handles
    missing values
    Benchmarks [Perez-Lebel... 2022]
    Tree handling of missing values work best
    Imputation works well, but expensive
    G Varoquaux 10

    View Slide

  17. Missing Data
    p ∪ {NA} not a vector space
    Imputation is not about finding
    likely values
    Rather a representation to
    facilitate learning
    [Le Morvan... 2021]
    G Varoquaux 11

    View Slide

  18. Entity representations
    Open-ended entries
    G Varoquaux 12
    Employee Position Title
    Master Police Officer
    Social Worker IV
    Police Officer III
    Police Aide
    Electrician I
    Bus Operator
    Bus Operator
    Social Worker III

    View Slide

  19. Modeling strings, rather than categories
    Notion of category ⇔ entity normalization
    Drug Name
    alcohol
    ethyl alcohol
    isopropyl alcohol
    polyvinyl alcohol
    isopropyl alcohol swab
    62% ethyl alcohol
    alcohol 68%
    alcohol denat
    benzyl alcohol
    dehydrated alcohol
    Employee Position Title
    Police Aide
    Master Police Officer
    Mechanic Technician II
    Police Officer III
    Senior Architect
    Senior Engineer Technician
    Social Worker III
    G Varoquaux 13

    View Slide

  20. Modeling strings: GapEncoder = string embeddings
    Factorizing sub-string count matrices 3-gram1
    P
    3-gram2
    ol
    3-gram3
    ic...
    Models strings as a linear combination of substrings
    11111000000000
    00000011111111
    10000001100000
    11100000000000
    11111100000000
    11111000000000
    police
    officer
    pol off
    polis
    policeman
    policier
    er_
    cer
    fic
    off
    _of
    ce_
    ice
    lic
    pol
    G Varoquaux 14
    [Cerda and Varoquaux 2020]

    View Slide

  21. Modeling strings: GapEncoder = string embeddings
    Factorizing sub-string count matrices 3-gram1
    P
    3-gram2
    ol
    3-gram3
    ic...
    Models strings as a linear combination of substrings
    11111000000000
    00000011111111
    10000001100000
    11100000000000
    11111100000000
    11111000000000
    police
    officer
    pol off
    polis
    policeman
    policier
    er_
    cer
    fic
    off
    _of
    ce_
    ice
    lic
    pol

    03078090707907
    00790752700578
    94071006000797
    topics
    030
    007
    940
    009
    100
    000
    documents
    topics
    +
    What substrings
    are in a latent
    category
    What latent categories
    are in an entry
    er_
    cer
    fic
    off
    _of
    ce_
    ice
    lic
    pol
    G Varoquaux 14
    [Cerda and Varoquaux 2020]

    View Slide

  22. GapEncoder: String embeddings capturing latent categories
    ry
    or
    st
    se
    er
    ty
    ue
    er
    Legislative Analyst II
    Legislative Attorney
    Equipment Operator I
    Transit Coordinator
    Bus Operator
    Senior Architect
    Senior Engineer Technician
    Financial Programs Manager
    Capital Projects Manager
    Mechanic Technician II
    Master Police Officer
    Police Sergeant
    Categories
    G Varoquaux 15
    Code: dirty-cat.github.io [Cerda and Varoquaux 2020]

    View Slide

  23. GapEncoder: String embeddings capturing latent categories
    Plausible feature names
    istant,
    library
    pment,
    operator
    ion,
    specialist
    rker,
    warehouse
    rogram,
    manager
    anic,
    community
    rescuer,
    rescue
    ection,
    officer
    Legislative Analyst II
    Legislative Attorney
    Equipment Operator I
    Transit Coordinator
    Bus Operator
    Senior Architect
    Senior Engineer Technician
    Financial Programs Manager
    Capital Projects Manager
    Mechanic Technician II
    Master Police Officer
    Police Sergeant
    feature nam
    es
    Categories
    G Varoquaux 15
    [Cerda and Varoquaux 2020]

    View Slide

  24. Representations tailored to the data
    fasttext: almost as good as GapEncoder, if in the right language
    0
    20
    40
    60
    80
    100
    relativescore(%
    )
    One-hot
    +SVD
    Similarity
    encoding
    FastText
    +SVD
    Gamma-Poisson
    factorization
    FastText + SVD (d=30)
    0
    20
    40
    60
    80
    100
    relative score (%)
    English French Hungarian
    G Varoquaux 16
    [Cerda and Varoquaux 2020]

    View Slide

  25. Vectorizing tables: the TableVectorizer
    The dirty-cat software
    dirty-cat.github.io
    TableVectorizer
    X = tab vec.fit transform(df)
    Heuristics for different columns
    strings with ≥ 30 categories ⇒ GapEncoder
    date/time ⇒ DateTimeEncoder
    non-string discrete ⇒ TargetEncoder
    ...
    Strong baseline
    G Varoquaux 17

    View Slide

  26. Data tables
    - Heterogeneous columns
    - Missing values
    - Open-ended strings
    Tree-based models
    sklearn HistGradientBoosting
    Column encoding
    dirty cat TableVectorizer
    G Varoquaux 18

    View Slide

  27. same entities
    Aggregating
    Analysis
    2 Across tables
    We often start from many tables

    View Slide

  28. Example data-science analysis
    Real-estate market
    Expected price of a property?
    Predict the price from
    relevant information available
    age
    surface area
    # of rooms
    floor
    location
    ...
    G Varoquaux 20

    View Slide

  29. City Rent
    Paris 1100€
    Vitry 700€
    City Pop.
    Paris 2.2M
    Vitry 33k
    Population
    2.2M
    33k
    City Rent
    Paris 1100€
    Vitry 700€
    Paris 1300€
    Population
    2.2M
    33k
    2.2M
    Example data-science analysis
    Data may need to be
    merged across tables
    G Varoquaux 21

    View Slide

  30. City Rent
    Paris 1100€
    Vitry 700€
    City Pop.
    Paris 2.2M
    Vitry 33k
    Person ID City Salary
    P1 Paris 50k€
    P2 Paris 40k€
    P3 Vitry 34k€
    P4 Vitry 38k€
    GroupBy + Avg
    Population
    2.2M
    33k
    Mean salary
    45k€
    36k€
    City Rent
    Paris 1100€
    Vitry 700€
    Paris 1300€
    Population
    2.2M
    33k
    2.2M
    Mean salary
    45k€
    36k€
    45k€
    Example data-science analysis
    Aggregations may be needed across different data granularity
    G Varoquaux 22

    View Slide

  31. City Rent
    Paris 1100€
    Vitry 700€
    City Pop.
    Paris 2.2M
    Vitry 33k
    Person ID City Salary
    P1 Paris 50k€
    P2 Paris 40k€
    P3 Vitry 34k€
    P4 Vitry 38k€
    City Department
    Paris Paris
    Vitry-sur-Seine Val-de-Marne
    Department
    Poverty
    rate
    Paris 15.2%
    Val-de-Marne 13.3%
    GroupBy + Avg
    Poverty rate
    15.2%
    13.3%
    Population
    2.2M
    33k
    Mean salary
    45k€
    36k€
    City Rent
    Paris 1100€
    Vitry 700€
    Paris 1300€
    Poverty rate
    15.2%
    13.3%
    15.2%
    Population
    2.2M
    33k
    2.2M
    Mean salary
    45k€
    36k€
    45k€
    Example data-science analysis
    Multiple hops may be needed
    G Varoquaux 23

    View Slide

  32. City Rent
    Paris 1100€
    Vitry 700€
    City Pop.
    Paris 2.2M
    Vitry 33k
    Person ID City Salary
    P1 Paris 50k€
    P2 Paris 40k€
    P3 Vitry 34k€
    P4 Vitry 38k€
    City Department
    Paris Paris
    Vitry-sur-Seine Val-de-Marne
    Department
    Poverty
    rate
    Paris 15.2%
    Val-de-Marne 13.3%
    GroupBy + Avg
    Poverty rate
    15.2%
    13.3%
    Population
    2.2M
    33k
    Mean salary
    45k€
    36k€
    City Rent
    Paris 1100€
    Vitry 700€
    Paris 1300€
    Poverty rate
    15.2%
    13.3%
    15.2%
    Population
    2.2M
    33k
    2.2M
    Mean salary
    45k€
    36k€
    45k€
    Example data-science analysis
    Joining tables Aggregations Multiple hops
    G Varoquaux 23
    Difficult for humans
    requires expertise on the data
    Difficult for machine learning
    discrete choices, combinatorial optim

    View Slide

  33. City Rent
    Paris 1100€
    Vitry 700€
    Paris 1300€
    City Pop.
    Paris 2.2M
    Vitry 33k
    Person ID City Salary
    P1 Paris 50k€
    P2 Paris 40k€
    P3 Vitry 34k€
    P4 Vitry 38k€
    City Department
    Paris Paris
    Vitry-sur-Seine Val-de-Marne
    Department
    Poverty
    rate
    Paris 15.2%
    Val-de-Marne 13.3%
    GroupBy + Avg
    Poverty rate
    15.2%
    13.3%
    15.2%
    Population
    2.2M
    33k
    2.2M
    Mean salary
    45k€
    36k€
    45k€
    Example data-science analysis
    G Varoquaux 24
    We need statistics and learning
    across tables

    View Slide

  34. Relational data challenges statistical learning
    Statistics and learning use repetitions and regularities
    Relational data
    Discrete objects, different tables, different natures
    properties, person, cities, departments...
    No clear repetition, regularity, metric, smoothness
    p
    G Varoquaux 25

    View Slide

  35. Assembling data
    same entities
    Aggregating
    Analysis
    A “main” table
    Feature-enrichment tables
    G Varoquaux 26

    View Slide

  36. Deep Feature Synthesis [Kanter and Veeramachaneni 2015]
    Greedily - starts from a target table
    - recursively joins related tables, to a given depth
    One-to-many relations: Computes different aggregations
    COUNT, SUM, LAST, MAX...
    City Population City School School Students
    Palaiseau 33k Palaiseau Lyc´
    ee Camille Claudel Lyc´
    ee Camille Claudel 800
    Palaiseau Lyc´
    ee Henri Poincar´
    e Lyc´
    ee Henri Poincar´
    e 1000
    Target table
    Depth 0 City Department Department PovertyRate
    Palaiseau Essonne Essonne 13.3%
    Depth 1 Depth 2
    City Population
    COUNT(
    City.School)
    City.Department
    City.Department.
    PovertyRate
    SUM(City.
    School.Students)
    MAX(City.
    School.Students
    Palaiseau 33k 2 Essonne 13.3% 1800 800
    Does not scale: # features explodes with depth and # tables
    G Varoquaux 27

    View Slide

  37. Embeddings as assembly
    Entity embeddings that
    distill information
    across tables
    Object → p
    KEN: knowledge embedding with
    numbers [Cvetkov-Iliev... 2023]
    G Varoquaux 28

    View Slide

  38. KEN: Overall approach [Cvetkov-Iliev... 2023]
    Paris
    36.1
    Paris
    Sherman County Long
    Orange
    County
    Orange
    Harris
    Orange
    ...
    Anaheim
    Name
    Irvine
    Houston
    Santa Ana
    ...
    State
    Ca
    Ca
    Tx
    Ca
    ...
    City
    Sherman
    Name
    36.1
    Long
    Harris 29.5
    Lat
    Orange 33.7
    101.5
    95.2
    117.8
    ... ... ...
    Tx
    St
    Tx
    Ca
    ...
    Pop
    3k
    4.7M
    3.2M
    ...
    County
    Triplet representation
    head relation tail
    ...
    Database
    tables
    Paris
    4.7M
    Paris
    Harris County Pop
    Paris
    Orange
    Paris
    Anaheim City County
    Training
    Paris
    Harris
    Paris
    Houston City County
    negative sampling for
    entity
    embedding
    re
    opera
    training
    dynamics
    tail
    Strategy:
    Convert data to graph (RDF triplets)
    G Varoquaux 29

    View Slide

  39. KEN: Overall approach [Cvetkov-Iliev... 2023]
    Paris
    36.1
    Paris
    herman County Long
    riplet representation
    head relation tail
    ...
    Paris
    4.7M
    Paris
    Harris County Pop
    Paris
    Orange
    Paris
    naheim City County
    Training embeddings
    Paris
    Harris
    Paris
    Houston City County
    negative sampling for training
    numerical
    attribute
    embedding
    entity
    embedding
    relation
    operator
    training
    dynamics
    harris
    orange
    sherman
    Analysis
    tail
    County
    105
    Votes
    1285
    130
    ...
    Harris
    Orange
    Sherman
    ...
    Database 2
    Transfert
    Strategy:
    Convert data to graph (RDF triplets)
    Adapt knowledge-graph embedding approaches
    Capture relations and numerical attributes
    G Varoquaux 29

    View Slide

  40. From tables to (knowledge) graphs
    Knowledge graphs = list of triples (head, relation, tail) or (h, r, t)
    e.g. (Paris, capitalOf, France)
    San Francisco
    San Diego
    California
    0.87M
    State
    1.4M
    Population
    City Population State
    San Francisco 0.87M California
    San Diego 1.4M California
    (San Francisco, Population, 0.87M)
    (San Francisco, State, California)
    (San Diego, Population, 1.4M)
    (San Diego, State, California)
    Table representation Triple / Knowledge graph
    “Head” column
    The two representations
    are (almost) equivalent:
    G Varoquaux 30

    View Slide

  41. Entity embeddings: contextual
    Contextual: two entities have close embeddings
    if they co-occur
    In NLP: word2vec = word co-occurences
    In knowledge-graphs: RDF2vec
    G Varoquaux 31
    (Facebook, FoundedIn, Massachussetts)
    (Facebook, HeadquartersIn, California)
    (MathWorks, FoundedIn, California)
    (MathWorks, HeadquartersIn,
    Massachussetts)
    (Google, FoundedIn, California)
    (Google, HeadquartersIn, California)
    (Apple, FoundedIn, California)
    (Apple, HeadquartesIn, California)
    Input triples
    a) Contextual:
    RDF2vec embeddings
    b) Relational: knowledge
    graph embeddings
    Google
    Apple
    California
    Massachussetts
    Facebook
    MathWorks
    FoundedIn
    HeadquartersIn
    Google
    Apple
    FoundedIn
    HeadquartersIn
    MathWorks
    Facebook
    Massachussetts
    California

    View Slide

  42. Entity embeddings: contextual < relational
    Contextual: two entities have close embeddings
    if they co-occur
    Relational: two entities are close
    if they have the same relations to other entities
    G Varoquaux 31
    (Facebook, FoundedIn, Massachussetts)
    (Facebook, HeadquartersIn, California)
    (MathWorks, FoundedIn, California)
    (MathWorks, HeadquartersIn,
    Massachussetts)
    (Google, FoundedIn, California)
    (Google, HeadquartersIn, California)
    (Apple, FoundedIn, California)
    (Apple, HeadquartesIn, California)
    Input triples
    a) Contextual:
    RDF2vec embeddings
    b) Relational: knowledge
    graph embeddings
    Google
    Apple
    California
    Massachussetts
    Facebook
    MathWorks
    FoundedIn
    HeadquartersIn
    Google
    Apple
    FoundedIn
    HeadquartersIn
    MathWorks
    Facebook
    Massachussetts
    California

    View Slide

  43. Knowledge-graph embeddings to capture relations
    TransE [Bordes... 2013] represents relation r as a translation
    vector r ∈ p between entity embeddings h and t:
    Scoring function:
    f (h, r, t) = −||h + r − t||
    Italy
    France
    Paris
    Rome
    capitalOf
    Training: optimize h, r, t to minimize a margin loss:
    L =
    (h,r,t)∈G,
    (h′,t′) s.t.(h′,r,t′) G
    with h′=h or t=t′
    [f (h′, r, t′) − f (h, r, t) + γ]+
    G Varoquaux 32

    View Slide

  44. KEN: embeddings to distill information [Cvetkov-Iliev... 2023]
    1. Capture one-to-many relation Use MuRE [Balazevic... 2019]
    Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh
    + bt
    Contraction / projection Translation
    Google
    Apple
    California
    Massachussetts
    Facebook
    MathWorks
    FoundedIn
    HeadquartersIn
    n
    sIn
    hWorks
    etts
    Enables rich relational
    geometry
    G Varoquaux 33

    View Slide

  45. KEN: embeddings to distill information [Cvetkov-Iliev... 2023]
    1. Capture one-to-many relation Use MuRE [Balazevic... 2019]
    Scoring function f (h, r, t) = −d(ρr⊙h, t + rr)2 + bh
    + bt
    Contraction / projection Translation
    2. Embed numerical attributes Attributes-specific mini MLP
    A numerical relation (attribute) r of value x representation:
    er(x) = ReLU(x wr + br)
    Use er(x) in place of tail embedding
    G Varoquaux 33

    View Slide

  46. KEN embeddings do distill the information [Cvetkov-Iliev... 2023]
    Object →

    → p
    Feature vector almost as performant for analysis as
    combinatorial feature generations
    Scalable to million of entries
    Capture multi-hop information (across multiple tables)
    Reconstructs well the distributions of numerical attributes
    mean, percentiles, counts, in one-to-many settings
    Good features for neural nets ⌣ X ∈ p
    G Varoquaux 34

    View Slide

  47. Entity embeddings that
    distill information
    across tables
    KEN: knowledge embedding with
    numbers [Cvetkov-Iliev... 2023]
    X ∈ p
    soda-inria.github.io/ken embeddings
    6 million common entities
    cities, people, compagnies...
    Example usage in dirty-cat docs
    G Varoquaux 35

    View Slide

  48. Representation learning + rich machine learning
    Can partly automate data preparation
    The promise of less manual work
    But we have replaced one sausage factory (data massaging)
    by another (opaque representations and models)
    Why should we trust these?
    G Varoquaux 36

    View Slide

  49. Valid analysis?
    More learning or cleaning?
    [Cvetkov-Iliev... 2022]
    G Varoquaux 37

    View Slide

  50. More learning versus more cleaning [Cvetkov-Iliev... 2022]
    A cross-institution study of salary
    Comparing entity matching vs embeddings + machine learning
    Entity matching = 3 days of manual labor, results imperfect
    Analysis, not prediction
    Machine learning as flexible estimators of conditional relations
    Analytic questions can be reformulated
    eg: Sex pay gap = causal effect of sex ⇒ double ML estimators
    G Varoquaux 38

    View Slide

  51. More learning versus more cleaning [Cvetkov-Iliev... 2022]
    A cross-institution study of salary
    Comparing entity matching vs embeddings + machine learning
    Entity matching = 3 days of manual labor, results imperfect
    Analysis, not prediction
    Machine learning as flexible estimators of conditional relations
    Analytic questions can be reformulated
    eg: Sex pay gap = causal effect of sex ⇒ double ML estimators
    Validity established via error on observables (cross-validation)
    Conlusion: Both cleaning & learning help
    Embedding + learning goes far for little cost
    G Varoquaux 38

    View Slide

  52. Valid analyses
    Opinion: More learning
    rather than cleaning
    [Cvetkov-Iliev... 2022]
    Cleaning, modeling
    Human auditable
    Sometimes in the eye of the
    beholder
    Learning
    Validity on observables
    G Varoquaux 39

    View Slide

  53. The soda team: Machine learning for health and social sciences
    Tabular relational learning
    Relational databases, data lakes
    Health and social sciences
    Epidemiology, education, psychology
    Machine learning for statistics
    Causal inference, biases, missing values
    Data-science software
    scikit-learn, joblib, dirty-cat
    G Varoquaux 40

    View Slide

  54. Representations of relational data
    Trees work very well on a data table
    - Not tied to smooth geometry / gradients
    I seek continuous representations of complex discrete objects
    - Lack of obvious regularities
    - String-based representations
    - Embedding a large database graph
    software: dirty-cat
    dirty-cat.github.io
    @GaelVaroquaux

    View Slide

  55. References I
    I. Balazevic, C. Allen, and T. Hospedales. Multi-relational poincar´
    e graph embeddings.
    Neural Information Processing Systems, 32:4463, 2019.
    A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating
    embeddings for modeling multi-relational data. In Advances in Neural Information
    Processing Systems, pages 2787–2795, 2013.
    P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables.
    Transactions in Knowledge and Data Engineering, 2020.
    A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Analytics on non-normalized data
    sources: more learning, rather than more cleaning. IEEE Access, 2022.
    A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux. Relational data embeddings for feature
    enrichment with background information. Machine Learning, pages 1–34, 2023.
    L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform
    deep learning on typical tabular data? In Thirty-sixth Conference on Neural
    Information Processing Systems Datasets and Benchmarks Track, 2022.

    View Slide

  56. References II
    J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised
    learning with missing values. arXiv preprint arXiv:1902.06931, 2019.
    Kaggle. Kaggle industry survey, 2018. URL
    https://www.kaggle.com/ash316/novice-to-grandmaster.
    J. M. Kanter and K. Veeramachaneni. Deep feature synthesis: Towards automating data
    science endeavors. In IEEE International Conference on Data Science and Advanced
    Analytics (DSAA), pages 1–10, 2015.
    H. T. Lam, B. Buesser, H. Min, T. N. Minh, M. Wistuba, U. Khurana, G. Bramble, T. Salonidis,
    D. Wang, and H. Samulowitz. Automated data science for relational data. In
    International Conference on Data Engineering (ICDE), page 2689. IEEE, 2021.
    M. Le Morvan, J. Josse, E. Scornet, and G. Varoquaux. What’s a good imputation to predict
    with missing values? NeurIPS, 2021.
    A. Perez-Lebel, G. Varoquaux, M. Le Morvan, J. Josse, and J.-B. Poline. Benchmarking
    missing-values approaches for predictive models on health databases. GigaScience,
    11, 2022.

    View Slide