Pro Yearly is on sale from $80 to $50! »

How to encode categorical features for GBDT

94c4261d3ac90c39ce6add9d03670aa6?s=47 Jack
December 11, 2019

How to encode categorical features for GBDT

94c4261d3ac90c39ce6add9d03670aa6?s=128

Jack

December 11, 2019
Tweet

Transcript

  1. How to encode categorical features for GBDT Ryuji Sakata (Jack)

    Competitions Grandmaster
  2. Self Introduction • Name: Ryuji Sakata • Kaggle account: Jack

    (Japan) (https://www.kaggle.com/rsakata) • Work at Panasonic Corporation as a data scientist and a researcher • Coauthor of the Kaggle book in Japan
  3. Agenda 1. Overview of encoding technique of categorical features •

    one-hot encoding • label encoding • target encoding • categorical feature support of LightGBM • other encoding technique 2. Experiments and the results
  4. 1. Overview of encoding technique of categorical features

  5. One-hot Encoding

  6. Label Encoding

  7. Target Encoding (basic idea)

  8. Leakage risk of Target Encoding

  9. Target Encoding (out-of-fold)

  10. Categorical Feature Support of LightGBM • From LightGBM document (https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features)

    ◦ The basic idea is to sort the categories according to the training objective at each split. ◦ More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram. • There is the overfitting risk as mentioned before, and the official document encourages to use parameters ‘min_data_per_group’ or ‘cat_smooth’ for very high cardinality features to avoid overfitting. this idea is the same as target encoding
  11. Other encoding technique • feature hashing (hash encoding) • frequency

    encoding • embedding
  12. 2. Experiments and the results

  13. Question For GBDT, label encoding and target encoding are typically

    used to handling categorical features. Should we use different method depending on the characteristic of dataset?
  14. Synthetic Datasets for Experiments • I created datasets which have

    only categorical features and a real-valued target variable. ( rows and categorical features) • Each categorical features has the common number of levels (), and it was randomly chosen which value is assigned for each row. • Each level of categorical features has a corresponding real value, and the target variable is calculated as the sum of those values and a random noise.
  15. Synthetic Datasets for Experiments v1 v2 … vM target B

    C … B 6.52 A A … A -1.28 B B … A -5.46 C A … A -3.83 : : : : : -0.77 0.05 … 0.49 sum up these corresponding values and add a random noise (follows (0, 32)) level v1 v2 … vM A -0.58 -0.38 … -0.24 B -0.77 -0.70 … 0.49 C -0.71 0.05 … 0.39 Example of dataset ( = 3) corresponding values each value was sampled from (−1, 1)
  16. Overview of experiments Modeling: • Algorithm: LightGBM • learning_rate: 0.05

    • num_leaves: 4, 8, 16, 32, 64 • lambda_l2: 1 • early_stopping_rounds: 500 • other parameters: default Dataset: • = 100,000 (both training and test) • = 25 Encoding: • one-hot, label, target, LGBM • N folds for target encoding: 5
  17. Results of One-hot Encoding ( ≤ ) If categorical variables

    have higher cardinality, RMSE become worse. (intuitively obvious) But the performance drop is limited with fewer number of leaves. (Because of the simple data structure, using complex tree structures is likely to overfit.) theoretical limit (because of the random noise (0, 32)) = (about 5000 data points per level)
  18. Results of Label Encoding ( ≤ ) It seems that

    label encoding is more likely to overfit compared with one-hot encoding. At = 4 or 5, however, label encoding had a slightly better performance with less than 16 num_leaves. = (about 5000 data points per level)
  19. Results of Target Encoding ( ≤ ) There are smaller

    gaps of performance by the difference of the number of leaves. When the cardinality of categorical features is high and we use complex structures of tree, it seems that using target encoding is a better strategy. = (about 5000 data points per level)
  20. Results of LGBM Encoding ( ≤ ) The tendency was

    very similar to that of target encoding. With simple tree structures, LGBM encoding was better, and with complex tree structures, target encoding was better. = (about 5000 data points per level)
  21. Summary of performances ( ≤ ) num_leaves = 4 num_leaves

    = 8 num_leaves = 16 One- hot Label Target LGBM One- hot Label Target LGBM One- hot Label Target LGBM 3 3.002 3.002 3.002 3.002 3.005 3.006 3.006 3.005 3.009 3.011 3.009 3.009 4 3.003 3.002 3.003 3.003 3.009 3.009 3.008 3.009 3.014 3.015 3.013 3.013 5 3.004 3.003 3.005 3.005 3.013 3.011 3.010 3.010 3.017 3.018 3.015 3.016 10 3.009 3.010 3.010 3.009 3.019 3.019 3.018 3.017 3.031 3.034 3.025 3.023 15 3.012 3.016 3.012 3.011 3.021 3.025 3.020 3.019 3.033 3.041 3.027 3.030 20 3.023 3.027 3.019 3.015 3.032 3.039 3.028 3.027 3.042 3.052 3.036 3.037
  22. Summary of performances ( ≤ ) num_leaves = 32 num_leaves

    = 64 One- hot Label Target LGBM One- hot Label Target LGBM 3 3.014 3.015 3.015 3.015 3.019 3.023 3.019 3.019 4 3.019 3.022 3.018 3.019 3.026 3.029 3.023 3.026 5 3.025 3.027 3.021 3.020 3.032 3.037 3.027 3.028 10 3.044 3.050 3.032 3.033 3.060 3.066 3.040 3.040 15 3.050 3.060 3.034 3.041 3.069 3.082 3.042 3.047 20 3.057 3.074 3.044 3.049 3.076 3.101 3.050 3.059
  23. Learning Curves ( = , = ) Target encoding converged

    the most quickly. Label encoding converged the most slowly.
  24. What happens when the cardinality of categorical features is higher?

  25. Results of Label Encoding ( ≤ ) The num_leaves parameter

    made a large difference to the performance. It seems that larger num_leaves causes severer overfitting especially with high cardinality categorical variables. = (about 500 data points per level)
  26. Results of Target Encoding ( ≤ ) The result is

    much different from that of label encoding. In contrast with label encoding, larger num_leaves did not deteriorate the performance when the cardinality of categorical features is high. = (about 500 data points per level)
  27. Results of LGBM Encoding ( ≤ ≤ ) The tendency

    was similar to that of target encoding, but there was a slightly stronger dependency on num_leaves. With simple tree structures, LGBM encoding was better, and with complex tree structures, target encoding was better. = (about 500 data points per level)
  28. Comparison ( ≤ ) target encoding was better LGBM encoding

    was better target encoding LGBM encoding = 200 = 150 = 100 = 50
  29. Summary of performances ( ≤ ≤ ) num_leaves = 4

    num_leaves = 8 num_leaves = 16 Label Target LGBM Label Target LGBM Label Target LGBM 50 3.084 3.039 3.028 3.112 3.048 3.044 3.132 3.057 3.060 100 3.175 3.076 3.050 3.229 3.085 3.066 3.269 3.092 3.083 150 3.270 3.113 3.071 3.355 3.122 3.095 3.406 3.129 3.116 200 3.342 3.129 3.093 3.436 3.139 3.122 3.500 3.145 3.142
  30. Summary of performances ( ≤ ≤ ) num_leaves = 32

    num_leaves = 64 Label Target LGBM Label Target LGBM 50 3.154 3.064 3.073 3.194 3.071 3.092 100 3.302 3.100 3.105 3.341 3.107 3.127 150 3.454 3.136 3.142 3.500 3.144 3.173 200 3.554 3.152 3.162 3.609 3.161 3.199
  31. Learning Curves ( = , = ) Very slow convergence!

  32. Learning Curves ( = , = )

  33. Much higher cardinality!

  34. Results of Target & LGBM Encoding ( ≤ ) =

    (about 20 data points per level) target encoding LGBM encoding
  35. Summary of performances ( ≤ ≤ ) num_leaves = 4

    num_leaves = 8 num_leaves = 16 num_leaves = 32 num_leaves = 64 Target LGBM Target LGBM Target LGBM Target LGBM Target LGBM 500 3.276 3.268 3.283 3.291 3.290 3.287 3.296 3.296 3.302 3.330 1000 3.437 3.393 3.443 3.391 3.449 3.393 3.454 3.406 3.460 3.446 1500 3.559 3.574 3.565 3.566 3.571 3.558 3.578 3.559 3.580 3.579 2000 3.641 3.696 3.647 3.688 3.652 3.679 3.656 3.671 3.661 3.671 3000 3.758 3.897 3.763 3.890 3.768 3.879 3.772 3.867 3.775 3.858 5000 3.905 4.067 3.909 4.066 3.911 4.064 3.915 4.054 3.918 4.048
  36. Comparison ( ≤ ) target encoding was better LGBM encoding

    was better target encoding lgbm encoding = 5000 = 3000 = 1000 = 1500 = 2000
  37. If there are many useless categorical features, how does the

    result change? Add additional useless 25 categorical features which do not effect on the target.
  38. Results of One-hot Encoding ( ≤ ) deterioration due to

    useless features higher cardinality
  39. Results of Label Encoding ( ≤ )

  40. Results of Target Encoding ( ≤ ) almost no deterioration!

  41. Results of LGBM Encoding ( ≤ )

  42. Conclusion Experiment conditions: • The effect of each features on

    the target is additive • No feature interaction • Levels of each categorical feature are evenly distributed Insight from the experiment results: • If cardinality of categorical features is high, it is difficult to capture all the effect by one- hot encoding or label encoding, so target encoding or LGBM encoding is preferable. • If cardinality is much higher, LGBM encoding also causes overfitting. • Even if there are many useless features, target encoding is not affected by them.