How to encode categorical features for GBDT

by Jack

Slide 1

Slide 1 text

How to encode categorical features for GBDT Ryuji Sakata (Jack) Competitions Grandmaster

Slide 2

Slide 2 text

Self Introduction • Name: Ryuji Sakata • Kaggle account: Jack (Japan) (https://www.kaggle.com/rsakata) • Work at Panasonic Corporation as a data scientist and a researcher • Coauthor of the Kaggle book in Japan

Slide 3

Slide 3 text

Agenda 1. Overview of encoding technique of categorical features • one-hot encoding • label encoding • target encoding • categorical feature support of LightGBM • other encoding technique 2. Experiments and the results

Slide 4

Slide 4 text

1. Overview of encoding technique of categorical features

Slide 5

Slide 5 text

One-hot Encoding

Slide 6

Slide 6 text

Label Encoding

Slide 7

Slide 7 text

Target Encoding (basic idea)

Slide 8

Slide 8 text

Leakage risk of Target Encoding

Slide 9

Slide 9 text

Target Encoding (out-of-fold)

Slide 10

Slide 10 text

Categorical Feature Support of LightGBM ● From LightGBM document (https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features) ○ The basic idea is to sort the categories according to the training objective at each split. ○ More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram. ● There is the overfitting risk as mentioned before, and the official document encourages to use parameters ‘min_data_per_group’ or ‘cat_smooth’ for very high cardinality features to avoid overfitting. this idea is the same as target encoding

Slide 11

Slide 11 text

Other encoding technique ● feature hashing (hash encoding) ● frequency encoding ● embedding

Slide 12

Slide 12 text

2. Experiments and the results

Slide 13

Slide 13 text

Question For GBDT, label encoding and target encoding are typically used to handling categorical features. Should we use different method depending on the characteristic of dataset?

Slide 14

Slide 14 text

Synthetic Datasets for Experiments ● I created datasets which have only categorical features and a real-valued target variable. ( rows and categorical features) ● Each categorical features has the common number of levels (), and it was randomly chosen which value is assigned for each row. ● Each level of categorical features has a corresponding real value, and the target variable is calculated as the sum of those values and a random noise.

Slide 15

Slide 15 text

Synthetic Datasets for Experiments v1 v2 … vM target B C … B 6.52 A A … A -1.28 B B … A -5.46 C A … A -3.83 : : : : : -0.77 0.05 … 0.49 sum up these corresponding values and add a random noise (follows (0, 32)) level v1 v2 … vM A -0.58 -0.38 … -0.24 B -0.77 -0.70 … 0.49 C -0.71 0.05 … 0.39 Example of dataset ( = 3) corresponding values each value was sampled from (−1, 1)

Slide 16

Slide 16 text

Overview of experiments Modeling: ● Algorithm: LightGBM ● learning_rate: 0.05 ● num_leaves: 4, 8, 16, 32, 64 ● lambda_l2: 1 ● early_stopping_rounds: 500 ● other parameters: default Dataset: ● = 100,000 (both training and test) ● = 25 Encoding: ● one-hot, label, target, LGBM ● N folds for target encoding: 5

Slide 17

Slide 17 text

Results of One-hot Encoding ( ≤ ) If categorical variables have higher cardinality, RMSE become worse. (intuitively obvious) But the performance drop is limited with fewer number of leaves. (Because of the simple data structure, using complex tree structures is likely to overfit.) theoretical limit (because of the random noise (0, 32)) = (about 5000 data points per level)

Slide 18

Slide 18 text

Results of Label Encoding ( ≤ ) It seems that label encoding is more likely to overfit compared with one-hot encoding. At = 4 or 5, however, label encoding had a slightly better performance with less than 16 num_leaves. = (about 5000 data points per level)

Slide 19

Slide 19 text

Results of Target Encoding ( ≤ ) There are smaller gaps of performance by the difference of the number of leaves. When the cardinality of categorical features is high and we use complex structures of tree, it seems that using target encoding is a better strategy. = (about 5000 data points per level)

Slide 20

Slide 20 text

Results of LGBM Encoding ( ≤ ) The tendency was very similar to that of target encoding. With simple tree structures, LGBM encoding was better, and with complex tree structures, target encoding was better. = (about 5000 data points per level)

Slide 21

Slide 21 text

Summary of performances ( ≤ ) num_leaves = 4 num_leaves = 8 num_leaves = 16 One- hot Label Target LGBM One- hot Label Target LGBM One- hot Label Target LGBM 3 3.002 3.002 3.002 3.002 3.005 3.006 3.006 3.005 3.009 3.011 3.009 3.009 4 3.003 3.002 3.003 3.003 3.009 3.009 3.008 3.009 3.014 3.015 3.013 3.013 5 3.004 3.003 3.005 3.005 3.013 3.011 3.010 3.010 3.017 3.018 3.015 3.016 10 3.009 3.010 3.010 3.009 3.019 3.019 3.018 3.017 3.031 3.034 3.025 3.023 15 3.012 3.016 3.012 3.011 3.021 3.025 3.020 3.019 3.033 3.041 3.027 3.030 20 3.023 3.027 3.019 3.015 3.032 3.039 3.028 3.027 3.042 3.052 3.036 3.037

Slide 22

Slide 22 text

Summary of performances ( ≤ ) num_leaves = 32 num_leaves = 64 One- hot Label Target LGBM One- hot Label Target LGBM 3 3.014 3.015 3.015 3.015 3.019 3.023 3.019 3.019 4 3.019 3.022 3.018 3.019 3.026 3.029 3.023 3.026 5 3.025 3.027 3.021 3.020 3.032 3.037 3.027 3.028 10 3.044 3.050 3.032 3.033 3.060 3.066 3.040 3.040 15 3.050 3.060 3.034 3.041 3.069 3.082 3.042 3.047 20 3.057 3.074 3.044 3.049 3.076 3.101 3.050 3.059

Slide 23

Slide 23 text

Learning Curves ( = , = ) Target encoding converged the most quickly. Label encoding converged the most slowly.

Slide 24

Slide 24 text

What happens when the cardinality of categorical features is higher?

Slide 25

Slide 25 text

Results of Label Encoding ( ≤ ) The num_leaves parameter made a large difference to the performance. It seems that larger num_leaves causes severer overfitting especially with high cardinality categorical variables. = (about 500 data points per level)

Slide 26

Slide 26 text

Results of Target Encoding ( ≤ ) The result is much different from that of label encoding. In contrast with label encoding, larger num_leaves did not deteriorate the performance when the cardinality of categorical features is high. = (about 500 data points per level)

Slide 27

Slide 27 text

Results of LGBM Encoding ( ≤ ≤ ) The tendency was similar to that of target encoding, but there was a slightly stronger dependency on num_leaves. With simple tree structures, LGBM encoding was better, and with complex tree structures, target encoding was better. = (about 500 data points per level)

Slide 28

Slide 28 text

Comparison ( ≤ ) target encoding was better LGBM encoding was better target encoding LGBM encoding = 200 = 150 = 100 = 50

Slide 29

Slide 29 text

Summary of performances ( ≤ ≤ ) num_leaves = 4 num_leaves = 8 num_leaves = 16 Label Target LGBM Label Target LGBM Label Target LGBM 50 3.084 3.039 3.028 3.112 3.048 3.044 3.132 3.057 3.060 100 3.175 3.076 3.050 3.229 3.085 3.066 3.269 3.092 3.083 150 3.270 3.113 3.071 3.355 3.122 3.095 3.406 3.129 3.116 200 3.342 3.129 3.093 3.436 3.139 3.122 3.500 3.145 3.142

Slide 30

Slide 30 text

Summary of performances ( ≤ ≤ ) num_leaves = 32 num_leaves = 64 Label Target LGBM Label Target LGBM 50 3.154 3.064 3.073 3.194 3.071 3.092 100 3.302 3.100 3.105 3.341 3.107 3.127 150 3.454 3.136 3.142 3.500 3.144 3.173 200 3.554 3.152 3.162 3.609 3.161 3.199

Slide 31

Slide 31 text

Learning Curves ( = , = ) Very slow convergence!

Slide 32

Slide 32 text

Learning Curves ( = , = )

Slide 33

Slide 33 text

Much higher cardinality!

Slide 34

Slide 34 text

Results of Target & LGBM Encoding ( ≤ ) = (about 20 data points per level) target encoding LGBM encoding

Slide 35

Slide 35 text

Summary of performances ( ≤ ≤ ) num_leaves = 4 num_leaves = 8 num_leaves = 16 num_leaves = 32 num_leaves = 64 Target LGBM Target LGBM Target LGBM Target LGBM Target LGBM 500 3.276 3.268 3.283 3.291 3.290 3.287 3.296 3.296 3.302 3.330 1000 3.437 3.393 3.443 3.391 3.449 3.393 3.454 3.406 3.460 3.446 1500 3.559 3.574 3.565 3.566 3.571 3.558 3.578 3.559 3.580 3.579 2000 3.641 3.696 3.647 3.688 3.652 3.679 3.656 3.671 3.661 3.671 3000 3.758 3.897 3.763 3.890 3.768 3.879 3.772 3.867 3.775 3.858 5000 3.905 4.067 3.909 4.066 3.911 4.064 3.915 4.054 3.918 4.048

Slide 36

Slide 36 text

Comparison ( ≤ ) target encoding was better LGBM encoding was better target encoding lgbm encoding = 5000 = 3000 = 1000 = 1500 = 2000

Slide 37

Slide 37 text

If there are many useless categorical features, how does the result change? Add additional useless 25 categorical features which do not effect on the target.

Slide 38

Slide 38 text

Results of One-hot Encoding ( ≤ ) deterioration due to useless features higher cardinality

Slide 39

Slide 39 text

Results of Label Encoding ( ≤ )

Slide 40

Slide 40 text

Results of Target Encoding ( ≤ ) almost no deterioration!

Slide 41

Slide 41 text

Results of LGBM Encoding ( ≤ )

Slide 42

Slide 42 text

Conclusion Experiment conditions: ● The effect of each features on the target is additive ● No feature interaction ● Levels of each categorical feature are evenly distributed Insight from the experiment results: ● If cardinality of categorical features is high, it is difficult to capture all the effect by one- hot encoding or label encoding, so target encoding or LGBM encoding is preferable. ● If cardinality is much higher, LGBM encoding also causes overfitting. ● Even if there are many useless features, target encoding is not affected by them.