How to encode categorical features for GBDT

How to encode categorical features for GBDT Ryuji Sakata (Jack)
Competitions Grandmaster

Self Introduction • Name: Ryuji Sakata • Kaggle account: Jack
(Japan) (https://www.kaggle.com/rsakata) • Work at Panasonic Corporation as a data scientist and a researcher • Coauthor of the Kaggle book in Japan

Agenda 1. Overview of encoding technique of categorical features •
one-hot encoding • label encoding • target encoding • categorical feature support of LightGBM • other encoding technique 2. Experiments and the results

1. Overview of encoding technique of categorical features

One-hot Encoding

Label Encoding

Target Encoding (basic idea)

Leakage risk of Target Encoding

Target Encoding (out-of-fold)

Categorical Feature Support of LightGBM • From LightGBM document (https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features)
◦ The basic idea is to sort the categories according to the training objective at each split. ◦ More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram. • There is the overfitting risk as mentioned before, and the official document encourages to use parameters ‘min_data_per_group’ or ‘cat_smooth’ for very high cardinality features to avoid overfitting. this idea is the same as target encoding

Other encoding technique • feature hashing (hash encoding) • frequency
encoding • embedding

2. Experiments and the results

Question For GBDT, label encoding and target encoding are typically
used to handling categorical features. Should we use different method depending on the characteristic of dataset?

Synthetic Datasets for Experiments • I created datasets which have
only categorical features and a real-valued target variable. ( rows and categorical features) • Each categorical features has the common number of levels (), and it was randomly chosen which value is assigned for each row. • Each level of categorical features has a corresponding real value, and the target variable is calculated as the sum of those values and a random noise.

Synthetic Datasets for Experiments v1 v2 … vM target B
C … B 6.52 A A … A -1.28 B B … A -5.46 C A … A -3.83 : : : : : -0.77 0.05 … 0.49 sum up these corresponding values and add a random noise (follows (0, 32)) level v1 v2 … vM A -0.58 -0.38 … -0.24 B -0.77 -0.70 … 0.49 C -0.71 0.05 … 0.39 Example of dataset ( = 3) corresponding values each value was sampled from (−1, 1)

Overview of experiments Modeling: • Algorithm: LightGBM • learning_rate: 0.05
• num_leaves: 4, 8, 16, 32, 64 • lambda_l2: 1 • early_stopping_rounds: 500 • other parameters: default Dataset: • = 100,000 (both training and test) • = 25 Encoding: • one-hot, label, target, LGBM • N folds for target encoding: 5

Results of One-hot Encoding ( ≤ ) If categorical variables
have higher cardinality, RMSE become worse. (intuitively obvious) But the performance drop is limited with fewer number of leaves. (Because of the simple data structure, using complex tree structures is likely to overfit.) theoretical limit (because of the random noise (0, 32)) = (about 5000 data points per level)

Results of Label Encoding ( ≤ ) It seems that
label encoding is more likely to overfit compared with one-hot encoding. At = 4 or 5, however, label encoding had a slightly better performance with less than 16 num_leaves. = (about 5000 data points per level)

Results of Target Encoding ( ≤ ) There are smaller
gaps of performance by the difference of the number of leaves. When the cardinality of categorical features is high and we use complex structures of tree, it seems that using target encoding is a better strategy. = (about 5000 data points per level)

Results of LGBM Encoding ( ≤ ) The tendency was
very similar to that of target encoding. With simple tree structures, LGBM encoding was better, and with complex tree structures, target encoding was better. = (about 5000 data points per level)

Summary of performances ( ≤ ) num_leaves = 4 num_leaves
= 8 num_leaves = 16 One- hot Label Target LGBM One- hot Label Target LGBM One- hot Label Target LGBM 3 3.002 3.002 3.002 3.002 3.005 3.006 3.006 3.005 3.009 3.011 3.009 3.009 4 3.003 3.002 3.003 3.003 3.009 3.009 3.008 3.009 3.014 3.015 3.013 3.013 5 3.004 3.003 3.005 3.005 3.013 3.011 3.010 3.010 3.017 3.018 3.015 3.016 10 3.009 3.010 3.010 3.009 3.019 3.019 3.018 3.017 3.031 3.034 3.025 3.023 15 3.012 3.016 3.012 3.011 3.021 3.025 3.020 3.019 3.033 3.041 3.027 3.030 20 3.023 3.027 3.019 3.015 3.032 3.039 3.028 3.027 3.042 3.052 3.036 3.037

Summary of performances ( ≤ ) num_leaves = 32 num_leaves
= 64 One- hot Label Target LGBM One- hot Label Target LGBM 3 3.014 3.015 3.015 3.015 3.019 3.023 3.019 3.019 4 3.019 3.022 3.018 3.019 3.026 3.029 3.023 3.026 5 3.025 3.027 3.021 3.020 3.032 3.037 3.027 3.028 10 3.044 3.050 3.032 3.033 3.060 3.066 3.040 3.040 15 3.050 3.060 3.034 3.041 3.069 3.082 3.042 3.047 20 3.057 3.074 3.044 3.049 3.076 3.101 3.050 3.059

Learning Curves ( = , = ) Target encoding converged
the most quickly. Label encoding converged the most slowly.

What happens when the cardinality of categorical features is higher?

Results of Label Encoding ( ≤ ) The num_leaves parameter
made a large difference to the performance. It seems that larger num_leaves causes severer overfitting especially with high cardinality categorical variables. = (about 500 data points per level)

Results of Target Encoding ( ≤ ) The result is
much different from that of label encoding. In contrast with label encoding, larger num_leaves did not deteriorate the performance when the cardinality of categorical features is high. = (about 500 data points per level)

Results of LGBM Encoding ( ≤ ≤ ) The tendency
was similar to that of target encoding, but there was a slightly stronger dependency on num_leaves. With simple tree structures, LGBM encoding was better, and with complex tree structures, target encoding was better. = (about 500 data points per level)

Comparison ( ≤ ) target encoding was better LGBM encoding
was better target encoding LGBM encoding = 200 = 150 = 100 = 50

Summary of performances ( ≤ ≤ ) num_leaves = 4
num_leaves = 8 num_leaves = 16 Label Target LGBM Label Target LGBM Label Target LGBM 50 3.084 3.039 3.028 3.112 3.048 3.044 3.132 3.057 3.060 100 3.175 3.076 3.050 3.229 3.085 3.066 3.269 3.092 3.083 150 3.270 3.113 3.071 3.355 3.122 3.095 3.406 3.129 3.116 200 3.342 3.129 3.093 3.436 3.139 3.122 3.500 3.145 3.142

num_leaves = 64 Label Target LGBM Label Target LGBM 50 3.154 3.064 3.073 3.194 3.071 3.092 100 3.302 3.100 3.105 3.341 3.107 3.127 150 3.454 3.136 3.142 3.500 3.144 3.173 200 3.554 3.152 3.162 3.609 3.161 3.199

Learning Curves ( = , = ) Very slow convergence!

Learning Curves ( = , = )

Much higher cardinality!

Results of Target & LGBM Encoding ( ≤ ) =
(about 20 data points per level) target encoding LGBM encoding

num_leaves = 8 num_leaves = 16 num_leaves = 32 num_leaves = 64 Target LGBM Target LGBM Target LGBM Target LGBM Target LGBM 500 3.276 3.268 3.283 3.291 3.290 3.287 3.296 3.296 3.302 3.330 1000 3.437 3.393 3.443 3.391 3.449 3.393 3.454 3.406 3.460 3.446 1500 3.559 3.574 3.565 3.566 3.571 3.558 3.578 3.559 3.580 3.579 2000 3.641 3.696 3.647 3.688 3.652 3.679 3.656 3.671 3.661 3.671 3000 3.758 3.897 3.763 3.890 3.768 3.879 3.772 3.867 3.775 3.858 5000 3.905 4.067 3.909 4.066 3.911 4.064 3.915 4.054 3.918 4.048

Comparison ( ≤ ) target encoding was better LGBM encoding
was better target encoding lgbm encoding = 5000 = 3000 = 1000 = 1500 = 2000

If there are many useless categorical features, how does the
result change? Add additional useless 25 categorical features which do not effect on the target.

Results of One-hot Encoding ( ≤ ) deterioration due to
useless features higher cardinality

Results of Label Encoding ( ≤ )

Results of Target Encoding ( ≤ ) almost no deterioration!

Results of LGBM Encoding ( ≤ )

Conclusion Experiment conditions: • The effect of each features on
the target is additive • No feature interaction • Levels of each categorical feature are evenly distributed Insight from the experiment results: • If cardinality of categorical features is high, it is difficult to capture all the effect by one- hot encoding or label encoding, so target encoding or LGBM encoding is preferable. • If cardinality is much higher, LGBM encoding also causes overfitting. • Even if there are many useless features, target encoding is not affected by them.

How to encode categorical features for GBDT

How to encode categorical features for GBDT

More Decks by Jack

Other Decks in Technology

Featured

Transcript