Similarity encoding for learning with dirty categorical variables

Similarity encoding for learning with dirty categorical variables Patricio Cerda,
Gaël Varoquaux, Balázs Kégl September 11, 2018 doi.org/10.1007/s10994-018-5724-2 1/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Two simple observations 1. Tables often contain categorical varibles. Annual
Salary Gender Year hired 42,053 F 2014 53,050 M 2009 63,492 M 1998 70,435 F 1996 75,524 - 2010 97,392 F 1993 102,664 F 1980 But, statistical learning algorithms require as input a numerical feature matrix in order to perform prediction. 2/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Two simple observations 1. Tables often contain categorical varibles. Annual
Salary Gender female Gender male Year hired 42,053 1 0 2014 53,050 0 1 2009 63,492 0 1 1998 70,435 1 0 1996 75,524 0 0 2010 97,392 1 0 1993 102,664 1 0 1980 2/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Two simple observations 1. Tables often contain categorical varibles. 2.
Real world (categorical) data is dirty. Annual Salary Gender Year hired Position Title 42,053 F 2014 Bus Operator 53,050 M 2009 Police Aide 63,492 M 1998 Electrician I 70,435 F 1996 Police Oﬃcer III 75,524 - 2010 Social Worker III 97,392 F 1993 Master Police Oﬃcer 102,664 F 1980 Social Worker IV 2/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Another “dirty data” example: Midwest survey dataset Count In your
own words, in which part of the country you live in now? 675 midwest 72 the midwest 50 upper midwest 33 mid west 23 mid-west 5 upper mid-west 2 central midwest 2 midwesterner 1 the rural mid-west 1 midwwest 1 the midwest duh 1 midwest. the frozen tundra 1 liberal midwest 1 north america mid west 1 mid atlantic mid west 1 kind of midwest 1 the midwest. the rust belt. the snow belt. 1 the mid-west. the heartland, the best location in the nation. 1 the south, but i being so close to missouri, i prefer midwest. ... 3/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Dirty data? Incomplete, erroneous, inaccurate or non standardized data. 4/21
Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Dirty data? Incomplete, erroneous, inaccurate or non standardized data. We
are interested in categorical variables (strings) as part of the feature variables: Overlapping categories Typos High cardinality Rare categories ... A data cleaning problem? 4/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

are interested in categorical variables (strings) as part of the feature variables: Overlapping categories Typos High cardinality Rare categories ... A data cleaning problem? A feature engineering problem? 4/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

are interested in categorical variables (strings) as part of the feature variables: Overlapping categories Typos High cardinality Rare categories ... A data cleaning problem? A feature engineering problem? A problem of representations in high dimension 4/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Our goals To make predictions without performing data cleaning 5/21
Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Our goals To make predictions without performing data cleaning A
statistical view of supervised learning on dirty categories. (Let the learning algorithm decide if “midwest” and “the midwest duh” are the same category or not) 5/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Our goals To make predictions without performing data cleaning A
statistical view of supervised learning on dirty categories. (Let the learning algorithm decide if “midwest” and “the midwest duh” are the same category or not) Fast and readily usable encoder for string categories 5/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Related work: Database cleaning Recognizing / merging entities Record linkage:
[Fellegi and Sunter 1969] matching across diﬀerent (clean) tables Deduplication/fuzzy matching: matching in one dirty table Techniques Supervised learning (known matches) Clustering Expectation Maximization to learn a metric Outputs a “clean” database 6/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Related work: Natural language processing Stemming / normalization Set of
(handcrafted) rules Need to be adapted to new language / new domains 7/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

(handcrafted) rules Need to be adapted to new language / new domains Semantics Formal semantics (by using a knowledge base) Distributional semantics: “a word is characterized by the company it keeps” 7/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

(handcrafted) rules Need to be adapted to new language / new domains Semantics Formal semantics (by using a knowledge base) Distributional semantics: “a word is characterized by the company it keeps” Character-level NLP For entity resolution [Klein... 2003] For semantics [Bojanowski... 2016] 7/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Some string similarities Levenshtein Number of edit operations on one
string to match the other Jaro-Winkler djaro (s1 , s2 ) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters similarity(s1 , s2 ) = #common n-grams #total n-grams 8/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Similarity encoding Generalizing one-hot encoding by using string similarities One-hot
encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 9/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 What if we add string similarities to one-hot encoding? 9/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 What if we add string similarities to one-hot encoding? Similarity encoding London Londres Paris Londres 0.3 1 0 London 1 0.3 0 Paris 0 0 1 string_similarity(Londres, London) 9/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Similarity encoding Generalizing one-hot encoding by using string similarities xi
= sim(si , s(1)), sim(si , s(2)), ..., sim(si , s(k)) ∈ Rk. Categories with similar morphology are closer to each other. Unseen categories can be easily encoded. New feature vector can be interpreted in the same way as one-hot encoding. 10/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Empirical study 7 real world datasets At least one string
categorical variable (feature) with high cardinality Dataset Number of rows Number of categories Most frequent category Least frequent category Prediction type medical charges 1.6E+05 100 3023 613 regression employee salaries 9.2E+03 385 883 1 regression open payments 1.0E+05 973 4016 1 binary-clf midwest survey 2.8E+03 1009 487 1 multiclass-clf traﬃc violations 1.0E+05 3043 7817 1 multiclass-clf road safety 1.0E+04 4617 589 1 binary-clf beer reviews 1.0E+04 4634 25 1 multiclass-clf 11/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Empirical study The number of categories grows (slowly) with the
number of samples. 100 1k 10k 100k 1M Number of rows 100 1 000 10 000 Number of categories beer reviews road safety tra c violations midwest survey open payments employee salaries medical charges 100 √ n 5 log2 (n) 12/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Benchmarking classiﬁers with gradient boosted trees 0.83 0.88 medical charges
3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.7 0.9 open payments 0.6 0.7 midwest survey 0.72 0.78 tra c violations 0.44 0.52 road safety 13/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Benchmarking classiﬁers with gradient boosted trees 13/21 Similarity encoding for
learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Benchmarking classiﬁers with gradient boosted trees 0.83 0.88 medical charges
3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.7 0.9 open payments 0.6 0.7 midwest survey 0.72 0.78 tra c violations 0.44 0.52 road safety 0.3 0.8 beer reviews 1.9 3.0 3.6 3.3 4.3 5.8 6.0 7.6 Average ranking across datasets 13/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Benchmarking classiﬁers with ridge 0.83 0.88 medical charges 3-gram Levenshtein-
ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.3 0.5 open payments 0.6 0.7 midwest survey 0.72 0.78 tra c violations 0.44 0.52 road safety 0.3 0.8 beer reviews 1.1 3.1 3.4 4.1 5.3 6.4 4.7 7.3 Average ranking across datasets 14/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

One-hot v/s Similarity for diﬀerent learners 15/21 Similarity encoding for
learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Problem: Dimensionality 100 1k 10k 100k 1M Number of rows
100 1 000 10 000 Number of categories beer reviews road safety tra c violations midwest survey open payments employee salaries medical charges For one-hot and similarity encoders: # features = # unique categories in the train set 16/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Dimensionality reduction (gradient boosted trees) d=300 d=100 d=300 d=100 d=300
Full d=100 d=300 d=100 d=300 Full One-hot encoding 3-gram similarity encoding Random Most frequent categories K-means Deduplication with K-means Random projections 17/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Dimensionality reduction (gradient boosted trees) d=100 d=300 Full d=100 d=300
d=100 d=300 d=100 d=300 Full d=100 d=300 d=100 d=300 Full d=1251 One-hot encoding 3-gram similarity encoding Bag of -grams Random projections Random projections Most frequent categories K-means Deduplication with K-means Random projections d=2933 d=2 17/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Dimensionality reduction (gradient boosted trees) 0.7 0.9 employee salaries d=100
d=300 Full d=100 d=300 d=100 d=300 d=100 d=300 Full d=100 d=300 d=100 d=300 Full (k=359) d=1251 Cardinality of categorical variable One-hot encoding 3-gram similarity encoding Bag of 3-grams Random projections Random projections Most frequent categories K-means Deduplication with K-means Random projections 0.7 0.8 open payments (k=912) d=2933 0.6 0.7 midwest survey (k=722) d=2330 0.75 0.78 tra c violations (k=2555) d=3810 0.45 0.55 road safety (k=4000) d=4923 17/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Dimensionality reduction (gradient boosted trees) 0.7 0.9 employee salaries d=100
d=300 Full d=100 d=300 d=100 d=300 d=100 d=300 Full d=100 d=300 d=100 d=300 Full (k=359) d=1251 Cardinality of categorical variable One-hot encoding 3-gram similarity encoding Bag of 3-grams Random projections Random projections Most frequent categories K-means Deduplication with K-means Random projections 0.7 0.8 open payments (k=912) d=2933 0.6 0.7 midwest survey (k=722) d=2330 0.75 0.78 tra c violations (k=2555) d=3810 0.45 0.55 road safety (k=4000) d=4923 0.3 0.8 beer reviews (k=4067) d=6553 7.5 5.0 6.7 7.0 4.0 7.2 5.3 7.7 4.3 2.8∗ 13.7 13.0 11.8 10.2 13.9 Average ranking across datasets 17/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Dimensionality reduction (ridge) 0.7 0.9 employee salaries d=100 d=300 Full
d=100 d=300 d=100 d=300 d=100 d=300 Full d=100 d=300 d=100 d=300 Full (k=359) d=1251 Cardinality of categorical variable One-hot encoding 3-gram similarity encoding Bag of 3-grams Random projections Random projections Most frequent categories K-means Deduplication with K-means Random projections 0.3 0.5 open payments (k=912) d=2933 0.6 0.7 midwest survey (k=722) d=2330 0.72 0.78 tra c violations (k=2555) d=3810 0.45 0.55 road safety (k=4000) d=4923 0.3 0.8 beer reviews (k=4067) d=6553 11.2 6.7 6.2 7.8 2.8 8.8 4.2 9.3 4.7 2.7∗ 12.2 8.0 14.7 12.3 8.5 Average ranking across datasets 18/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Just a string similarity? What similarity is deﬁned by our
encoding? (kernel) si , sj sim = k l=1 sim(si , s(l)) sim(sj , s(l)) Reference categories The categories in the train set shape the similarity!

Just a string similarity? What similarity is deﬁned by our
encoding? (kernel) si , sj sim = k l=1 sim(si , s(l)) sim(sj , s(l)) Reference categories The categories in the train set shape the similarity! 0.83 0.88 medical charges 3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.3 0.5 open payments 0.6 0.7 midwest survey 0.72 0.78 tra c violations 0.44 0.52 road safety 0.3 0.8 beer reviews 1.1 3.1 3.4 4.1 5.3 6.4 4.7 7.3 Average ranking across datasets

Conclusion Similarity encoding: Improves prediction performance of the learning task.
It works even with strong dimensionality reduction. No data cleaning of categories is needed. Fast and readily usable embedding. 20/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

It works even with strong dimensionality reduction. No data cleaning of categories is needed. Fast and readily usable embedding. Ongoing work: A theoretical understanding of similarity encoding. More datasets → Proper statistical tests. Several extensions: e.g. considering more than 1 variable. 20/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

It works even with strong dimensionality reduction. No data cleaning of categories is needed. Fast and readily usable embedding. Ongoing work: A theoretical understanding of similarity encoding. More datasets → Proper statistical tests. Several extensions: e.g. considering more than 1 variable. Datasets and examples on learning with dirty categories: dirty-cat.github.io 20/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl

Similarity encoding for learning with dirty cat...

Similarity encoding for learning with dirty categorical variables

Other Decks in Research

Featured

Transcript