Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Similarity encoding for learning with dirty cat...

Avatar for Patricio Cerda Patricio Cerda
September 11, 2018

Similarity encoding for learning with dirty categorical variables

For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. ‘Dirty’ non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.

Avatar for Patricio Cerda

Patricio Cerda

September 11, 2018
Tweet

Other Decks in Research

Transcript

  1. Similarity encoding for learning with dirty categorical variables Patricio Cerda,

    Gaël Varoquaux, Balázs Kégl September 11, 2018 doi.org/10.1007/s10994-018-5724-2 1/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  2. Two simple observations 1. Tables often contain categorical varibles. Annual

    Salary Gender Year hired 42,053 F 2014 53,050 M 2009 63,492 M 1998 70,435 F 1996 75,524 - 2010 97,392 F 1993 102,664 F 1980 But, statistical learning algorithms require as input a numerical feature matrix in order to perform prediction. 2/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  3. Two simple observations 1. Tables often contain categorical varibles. Annual

    Salary Gender female Gender male Year hired 42,053 1 0 2014 53,050 0 1 2009 63,492 0 1 1998 70,435 1 0 1996 75,524 0 0 2010 97,392 1 0 1993 102,664 1 0 1980 2/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  4. Two simple observations 1. Tables often contain categorical varibles. 2.

    Real world (categorical) data is dirty. Annual Salary Gender Year hired Position Title 42,053 F 2014 Bus Operator 53,050 M 2009 Police Aide 63,492 M 1998 Electrician I 70,435 F 1996 Police Officer III 75,524 - 2010 Social Worker III 97,392 F 1993 Master Police Officer 102,664 F 1980 Social Worker IV 2/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  5. Another “dirty data” example: Midwest survey dataset Count In your

    own words, in which part of the country you live in now? 675 midwest 72 the midwest 50 upper midwest 33 mid west 23 mid-west 5 upper mid-west 2 central midwest 2 midwesterner 1 the rural mid-west 1 midwwest 1 the midwest duh 1 midwest. the frozen tundra 1 liberal midwest 1 north america mid west 1 mid atlantic mid west 1 kind of midwest 1 the midwest. the rust belt. the snow belt. 1 the mid-west. the heartland, the best location in the nation. 1 the south, but i being so close to missouri, i prefer midwest. ... 3/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  6. Dirty data? Incomplete, erroneous, inaccurate or non standardized data. 4/21

    Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  7. Dirty data? Incomplete, erroneous, inaccurate or non standardized data. We

    are interested in categorical variables (strings) as part of the feature variables: Overlapping categories Typos High cardinality Rare categories ... A data cleaning problem? 4/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  8. Dirty data? Incomplete, erroneous, inaccurate or non standardized data. We

    are interested in categorical variables (strings) as part of the feature variables: Overlapping categories Typos High cardinality Rare categories ... A data cleaning problem? A feature engineering problem? 4/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  9. Dirty data? Incomplete, erroneous, inaccurate or non standardized data. We

    are interested in categorical variables (strings) as part of the feature variables: Overlapping categories Typos High cardinality Rare categories ... A data cleaning problem? A feature engineering problem? A problem of representations in high dimension 4/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  10. Our goals To make predictions without performing data cleaning 5/21

    Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  11. Our goals To make predictions without performing data cleaning A

    statistical view of supervised learning on dirty categories. (Let the learning algorithm decide if “midwest” and “the midwest duh” are the same category or not) 5/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  12. Our goals To make predictions without performing data cleaning A

    statistical view of supervised learning on dirty categories. (Let the learning algorithm decide if “midwest” and “the midwest duh” are the same category or not) Fast and readily usable encoder for string categories 5/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  13. Related work: Database cleaning Recognizing / merging entities Record linkage:

    [Fellegi and Sunter 1969] matching across different (clean) tables Deduplication/fuzzy matching: matching in one dirty table Techniques Supervised learning (known matches) Clustering Expectation Maximization to learn a metric Outputs a “clean” database 6/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  14. Related work: Natural language processing Stemming / normalization Set of

    (handcrafted) rules Need to be adapted to new language / new domains 7/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  15. Related work: Natural language processing Stemming / normalization Set of

    (handcrafted) rules Need to be adapted to new language / new domains Semantics Formal semantics (by using a knowledge base) Distributional semantics: “a word is characterized by the company it keeps” 7/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  16. Related work: Natural language processing Stemming / normalization Set of

    (handcrafted) rules Need to be adapted to new language / new domains Semantics Formal semantics (by using a knowledge base) Distributional semantics: “a word is characterized by the company it keeps” Character-level NLP For entity resolution [Klein... 2003] For semantics [Bojanowski... 2016] 7/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  17. Some string similarities Levenshtein Number of edit operations on one

    string to match the other Jaro-Winkler djaro (s1 , s2 ) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters similarity(s1 , s2 ) = #common n-grams #total n-grams 8/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  18. Similarity encoding Generalizing one-hot encoding by using string similarities One-hot

    encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 9/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  19. Similarity encoding Generalizing one-hot encoding by using string similarities One-hot

    encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 What if we add string similarities to one-hot encoding? 9/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  20. Similarity encoding Generalizing one-hot encoding by using string similarities One-hot

    encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 What if we add string similarities to one-hot encoding? Similarity encoding London Londres Paris Londres 0.3 1 0 London 1 0.3 0 Paris 0 0 1 string_similarity(Londres, London) 9/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  21. Similarity encoding Generalizing one-hot encoding by using string similarities xi

    = sim(si , s(1)), sim(si , s(2)), ..., sim(si , s(k)) ∈ Rk. Categories with similar morphology are closer to each other. Unseen categories can be easily encoded. New feature vector can be interpreted in the same way as one-hot encoding. 10/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  22. Empirical study 7 real world datasets At least one string

    categorical variable (feature) with high cardinality Dataset Number of rows Number of categories Most frequent category Least frequent category Prediction type medical charges 1.6E+05 100 3023 613 regression employee salaries 9.2E+03 385 883 1 regression open payments 1.0E+05 973 4016 1 binary-clf midwest survey 2.8E+03 1009 487 1 multiclass-clf traffic violations 1.0E+05 3043 7817 1 multiclass-clf road safety 1.0E+04 4617 589 1 binary-clf beer reviews 1.0E+04 4634 25 1 multiclass-clf 11/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  23. Empirical study The number of categories grows (slowly) with the

    number of samples. 100 1k 10k 100k 1M Number of rows 100 1 000 10 000 Number of categories beer reviews road safety tra c violations midwest survey open payments employee salaries medical charges 100 √ n 5 log2 (n) 12/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  24. Benchmarking classifiers with gradient boosted trees 0.83 0.88 medical charges

    3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.7 0.9 open payments 0.6 0.7 midwest survey 0.72 0.78 tra c violations 0.44 0.52 road safety 13/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  25. Benchmarking classifiers with gradient boosted trees 13/21 Similarity encoding for

    learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  26. Benchmarking classifiers with gradient boosted trees 13/21 Similarity encoding for

    learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  27. Benchmarking classifiers with gradient boosted trees 0.83 0.88 medical charges

    3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.7 0.9 open payments 0.6 0.7 midwest survey 0.72 0.78 tra c violations 0.44 0.52 road safety 0.3 0.8 beer reviews 1.9 3.0 3.6 3.3 4.3 5.8 6.0 7.6 Average ranking across datasets 13/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  28. Benchmarking classifiers with ridge 0.83 0.88 medical charges 3-gram Levenshtein-

    ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.3 0.5 open payments 0.6 0.7 midwest survey 0.72 0.78 tra c violations 0.44 0.52 road safety 0.3 0.8 beer reviews 1.1 3.1 3.4 4.1 5.3 6.4 4.7 7.3 Average ranking across datasets 14/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  29. One-hot v/s Similarity for different learners 15/21 Similarity encoding for

    learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  30. Problem: Dimensionality 100 1k 10k 100k 1M Number of rows

    100 1 000 10 000 Number of categories beer reviews road safety tra c violations midwest survey open payments employee salaries medical charges For one-hot and similarity encoders: # features = # unique categories in the train set 16/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  31. Dimensionality reduction (gradient boosted trees) d=300 d=100 d=300 d=100 d=300

    Full d=100 d=300 d=100 d=300 Full One-hot encoding 3-gram similarity encoding Random Most frequent categories K-means Deduplication with K-means Random projections 17/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  32. Dimensionality reduction (gradient boosted trees) d=100 d=300 Full d=100 d=300

    d=100 d=300 d=100 d=300 Full d=100 d=300 d=100 d=300 Full d=1251 One-hot encoding 3-gram similarity encoding Bag of -grams Random projections Random projections Most frequent categories K-means Deduplication with K-means Random projections d=2933 d=2 17/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  33. Dimensionality reduction (gradient boosted trees) 0.7 0.9 employee salaries d=100

    d=300 Full d=100 d=300 d=100 d=300 d=100 d=300 Full d=100 d=300 d=100 d=300 Full (k=359) d=1251 Cardinality of categorical variable One-hot encoding 3-gram similarity encoding Bag of 3-grams Random projections Random projections Most frequent categories K-means Deduplication with K-means Random projections 0.7 0.8 open payments (k=912) d=2933 0.6 0.7 midwest survey (k=722) d=2330 0.75 0.78 tra c violations (k=2555) d=3810 0.45 0.55 road safety (k=4000) d=4923 17/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  34. Dimensionality reduction (gradient boosted trees) 0.7 0.9 employee salaries d=100

    d=300 Full d=100 d=300 d=100 d=300 d=100 d=300 Full d=100 d=300 d=100 d=300 Full (k=359) d=1251 Cardinality of categorical variable One-hot encoding 3-gram similarity encoding Bag of 3-grams Random projections Random projections Most frequent categories K-means Deduplication with K-means Random projections 0.7 0.8 open payments (k=912) d=2933 0.6 0.7 midwest survey (k=722) d=2330 0.75 0.78 tra c violations (k=2555) d=3810 0.45 0.55 road safety (k=4000) d=4923 17/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  35. Dimensionality reduction (gradient boosted trees) 0.7 0.9 employee salaries d=100

    d=300 Full d=100 d=300 d=100 d=300 d=100 d=300 Full d=100 d=300 d=100 d=300 Full (k=359) d=1251 Cardinality of categorical variable One-hot encoding 3-gram similarity encoding Bag of 3-grams Random projections Random projections Most frequent categories K-means Deduplication with K-means Random projections 0.7 0.8 open payments (k=912) d=2933 0.6 0.7 midwest survey (k=722) d=2330 0.75 0.78 tra c violations (k=2555) d=3810 0.45 0.55 road safety (k=4000) d=4923 0.3 0.8 beer reviews (k=4067) d=6553 7.5 5.0 6.7 7.0 4.0 7.2 5.3 7.7 4.3 2.8∗ 13.7 13.0 11.8 10.2 13.9 Average ranking across datasets 17/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  36. Dimensionality reduction (ridge) 0.7 0.9 employee salaries d=100 d=300 Full

    d=100 d=300 d=100 d=300 d=100 d=300 Full d=100 d=300 d=100 d=300 Full (k=359) d=1251 Cardinality of categorical variable One-hot encoding 3-gram similarity encoding Bag of 3-grams Random projections Random projections Most frequent categories K-means Deduplication with K-means Random projections 0.3 0.5 open payments (k=912) d=2933 0.6 0.7 midwest survey (k=722) d=2330 0.72 0.78 tra c violations (k=2555) d=3810 0.45 0.55 road safety (k=4000) d=4923 0.3 0.8 beer reviews (k=4067) d=6553 11.2 6.7 6.2 7.8 2.8 8.8 4.2 9.3 4.7 2.7∗ 12.2 8.0 14.7 12.3 8.5 Average ranking across datasets 18/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  37. Just a string similarity? What similarity is defined by our

    encoding? (kernel) si , sj sim = k l=1 sim(si , s(l)) sim(sj , s(l)) Reference categories The categories in the train set shape the similarity!
  38. Just a string similarity? What similarity is defined by our

    encoding? (kernel) si , sj sim = k l=1 sim(si , s(l)) sim(sj , s(l)) Reference categories The categories in the train set shape the similarity! 0.83 0.88 medical charges 3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.3 0.5 open payments 0.6 0.7 midwest survey 0.72 0.78 tra c violations 0.44 0.52 road safety 0.3 0.8 beer reviews 1.1 3.1 3.4 4.1 5.3 6.4 4.7 7.3 Average ranking across datasets
  39. Conclusion Similarity encoding: Improves prediction performance of the learning task.

    It works even with strong dimensionality reduction. No data cleaning of categories is needed. Fast and readily usable embedding. 20/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  40. Conclusion Similarity encoding: Improves prediction performance of the learning task.

    It works even with strong dimensionality reduction. No data cleaning of categories is needed. Fast and readily usable embedding. Ongoing work: A theoretical understanding of similarity encoding. More datasets → Proper statistical tests. Several extensions: e.g. considering more than 1 variable. 20/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl
  41. Conclusion Similarity encoding: Improves prediction performance of the learning task.

    It works even with strong dimensionality reduction. No data cleaning of categories is needed. Fast and readily usable embedding. Ongoing work: A theoretical understanding of similarity encoding. More datasets → Proper statistical tests. Several extensions: e.g. considering more than 1 variable. Datasets and examples on learning with dirty categories: dirty-cat.github.io 20/21 Similarity encoding for learning with dirty categorical variables P Cerda, G Varoquaux, B Kégl