Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to encode categorical features for GBDT

Jack
December 11, 2019

How to encode categorical features for GBDT

Jack

December 11, 2019
Tweet

More Decks by Jack

Other Decks in Technology

Transcript

  1. How to encode
    categorical features for GBDT
    Ryuji Sakata (Jack)
    Competitions
    Grandmaster

    View Slide

  2. Self Introduction
    • Name: Ryuji Sakata
    • Kaggle account: Jack (Japan)
    (https://www.kaggle.com/rsakata)
    • Work at Panasonic Corporation as a
    data scientist and a researcher
    • Coauthor of the Kaggle book in Japan

    View Slide

  3. Agenda
    1. Overview of encoding technique of categorical features
    • one-hot encoding
    • label encoding
    • target encoding
    • categorical feature support of LightGBM
    • other encoding technique
    2. Experiments and the results

    View Slide

  4. 1. Overview of encoding technique of categorical features

    View Slide

  5. One-hot Encoding

    View Slide

  6. Label Encoding

    View Slide

  7. Target Encoding (basic idea)

    View Slide

  8. Leakage risk of Target Encoding

    View Slide

  9. Target Encoding (out-of-fold)

    View Slide

  10. Categorical Feature Support of LightGBM
    ● From LightGBM document (https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features)
    ○ The basic idea is to sort the categories according to the training objective at each split.
    ○ More specifically, LightGBM sorts the histogram (for a categorical feature) according
    to its accumulated values (sum_gradient / sum_hessian) and then finds the best split
    on the sorted histogram.
    ● There is the overfitting risk as mentioned before, and the official document encourages
    to use parameters ‘min_data_per_group’ or ‘cat_smooth’ for very high cardinality
    features to avoid overfitting.
    this idea is the same as target encoding

    View Slide

  11. Other encoding technique
    ● feature hashing (hash encoding)
    ● frequency encoding
    ● embedding

    View Slide

  12. 2. Experiments and the results

    View Slide

  13. Question
    For GBDT, label encoding and target encoding are typically used to
    handling categorical features.
    Should we use different method depending on the characteristic of
    dataset?

    View Slide

  14. Synthetic Datasets for Experiments
    ● I created datasets which have only categorical features and a real-valued
    target variable. ( rows and categorical features)
    ● Each categorical features has the common number of levels (), and it was
    randomly chosen which value is assigned for each row.
    ● Each level of categorical features has a corresponding real value, and the
    target variable is calculated as the sum of those values and a random noise.

    View Slide

  15. Synthetic Datasets for Experiments
    v1 v2 … vM target
    B C … B 6.52
    A A … A -1.28
    B B … A -5.46
    C A … A -3.83
    : : : : :
    -0.77 0.05 … 0.49
    sum up these
    corresponding
    values and add a
    random noise
    (follows (0, 32))
    level v1 v2 … vM
    A -0.58 -0.38 … -0.24
    B -0.77 -0.70 … 0.49
    C -0.71 0.05 … 0.39
    Example of dataset ( = 3)
    corresponding values
    each value was sampled from (−1, 1)

    View Slide

  16. Overview of experiments
    Modeling:
    ● Algorithm: LightGBM
    ● learning_rate: 0.05
    ● num_leaves: 4, 8, 16, 32, 64
    ● lambda_l2: 1
    ● early_stopping_rounds: 500
    ● other parameters: default
    Dataset:
    ● = 100,000 (both training and test)
    ● = 25
    Encoding:
    ● one-hot, label, target, LGBM
    ● N folds for target encoding: 5

    View Slide

  17. Results of One-hot Encoding ( ≤ )
    If categorical variables have
    higher cardinality, RMSE become
    worse. (intuitively obvious)
    But the performance drop is
    limited with fewer number of
    leaves. (Because of the simple
    data structure, using complex
    tree structures is likely to overfit.)
    theoretical limit
    (because of the random noise (0, 32))
    =
    (about 5000 data points per level)

    View Slide

  18. Results of Label Encoding ( ≤ )
    It seems that label encoding is
    more likely to overfit compared
    with one-hot encoding.
    At = 4 or 5, however, label
    encoding had a slightly better
    performance with less than 16
    num_leaves.
    =
    (about 5000 data points per level)

    View Slide

  19. Results of Target Encoding ( ≤ )
    There are smaller gaps of
    performance by the difference of
    the number of leaves.
    When the cardinality of
    categorical features is high and
    we use complex structures of tree,
    it seems that using target
    encoding is a better strategy.
    =
    (about 5000 data points per level)

    View Slide

  20. Results of LGBM Encoding ( ≤ )
    The tendency was very similar to
    that of target encoding.
    With simple tree structures,
    LGBM encoding was better, and
    with complex tree structures,
    target encoding was better.
    =
    (about 5000 data points per level)

    View Slide

  21. Summary of performances ( ≤ )

    num_leaves = 4 num_leaves = 8 num_leaves = 16
    One-
    hot
    Label Target LGBM
    One-
    hot
    Label Target LGBM
    One-
    hot
    Label Target LGBM
    3 3.002 3.002 3.002 3.002 3.005 3.006 3.006 3.005 3.009 3.011 3.009 3.009
    4 3.003 3.002 3.003 3.003 3.009 3.009 3.008 3.009 3.014 3.015 3.013 3.013
    5 3.004 3.003 3.005 3.005 3.013 3.011 3.010 3.010 3.017 3.018 3.015 3.016
    10 3.009 3.010 3.010 3.009 3.019 3.019 3.018 3.017 3.031 3.034 3.025 3.023
    15 3.012 3.016 3.012 3.011 3.021 3.025 3.020 3.019 3.033 3.041 3.027 3.030
    20 3.023 3.027 3.019 3.015 3.032 3.039 3.028 3.027 3.042 3.052 3.036 3.037

    View Slide

  22. Summary of performances ( ≤ )

    num_leaves = 32 num_leaves = 64
    One-
    hot
    Label Target LGBM
    One-
    hot
    Label Target LGBM
    3 3.014 3.015 3.015 3.015 3.019 3.023 3.019 3.019
    4 3.019 3.022 3.018 3.019 3.026 3.029 3.023 3.026
    5 3.025 3.027 3.021 3.020 3.032 3.037 3.027 3.028
    10 3.044 3.050 3.032 3.033 3.060 3.066 3.040 3.040
    15 3.050 3.060 3.034 3.041 3.069 3.082 3.042 3.047
    20 3.057 3.074 3.044 3.049 3.076 3.101 3.050 3.059

    View Slide

  23. Learning Curves ( = ,
    = )
    Target encoding converged the most quickly.
    Label encoding converged the most slowly.

    View Slide

  24. What happens when the cardinality of
    categorical features is higher?

    View Slide

  25. Results of Label Encoding ( ≤ )
    The num_leaves parameter
    made a large difference to the
    performance.
    It seems that larger num_leaves
    causes severer overfitting
    especially with high cardinality
    categorical variables.
    =
    (about 500 data points per level)

    View Slide

  26. Results of Target Encoding ( ≤ )
    The result is much different from
    that of label encoding.
    In contrast with label encoding,
    larger num_leaves did not
    deteriorate the performance
    when the cardinality of
    categorical features is high.
    =
    (about 500 data points per level)

    View Slide

  27. Results of LGBM Encoding ( ≤ ≤ )
    The tendency was similar to that
    of target encoding, but there was
    a slightly stronger dependency on
    num_leaves.
    With simple tree structures,
    LGBM encoding was better, and
    with complex tree structures,
    target encoding was better.
    =
    (about 500 data points per level)

    View Slide

  28. Comparison ( ≤ )
    target encoding was better
    LGBM encoding was better
    target encoding LGBM encoding
    = 200
    = 150
    = 100
    = 50

    View Slide

  29. Summary of performances ( ≤ ≤ )

    num_leaves = 4 num_leaves = 8 num_leaves = 16
    Label Target LGBM Label Target LGBM Label Target LGBM
    50 3.084 3.039 3.028 3.112 3.048 3.044 3.132 3.057 3.060
    100 3.175 3.076 3.050 3.229 3.085 3.066 3.269 3.092 3.083
    150 3.270 3.113 3.071 3.355 3.122 3.095 3.406 3.129 3.116
    200 3.342 3.129 3.093 3.436 3.139 3.122 3.500 3.145 3.142

    View Slide

  30. Summary of performances ( ≤ ≤ )

    num_leaves = 32 num_leaves = 64
    Label Target LGBM Label Target LGBM
    50 3.154 3.064 3.073 3.194 3.071 3.092
    100 3.302 3.100 3.105 3.341 3.107 3.127
    150 3.454 3.136 3.142 3.500 3.144 3.173
    200 3.554 3.152 3.162 3.609 3.161 3.199

    View Slide

  31. Learning Curves ( = ,
    = )
    Very slow convergence!

    View Slide

  32. Learning Curves ( = ,
    = )

    View Slide

  33. Much higher cardinality!

    View Slide

  34. Results of Target & LGBM Encoding ( ≤ )
    =
    (about 20 data points per level)
    target encoding LGBM encoding

    View Slide

  35. Summary of performances ( ≤ ≤ )

    num_leaves = 4 num_leaves = 8 num_leaves = 16 num_leaves = 32 num_leaves = 64
    Target LGBM Target LGBM Target LGBM Target LGBM Target LGBM
    500 3.276 3.268 3.283 3.291 3.290 3.287 3.296 3.296 3.302 3.330
    1000 3.437 3.393 3.443 3.391 3.449 3.393 3.454 3.406 3.460 3.446
    1500 3.559 3.574 3.565 3.566 3.571 3.558 3.578 3.559 3.580 3.579
    2000 3.641 3.696 3.647 3.688 3.652 3.679 3.656 3.671 3.661 3.671
    3000 3.758 3.897 3.763 3.890 3.768 3.879 3.772 3.867 3.775 3.858
    5000 3.905 4.067 3.909 4.066 3.911 4.064 3.915 4.054 3.918 4.048

    View Slide

  36. Comparison ( ≤ )
    target encoding was better
    LGBM encoding was better
    target encoding lgbm encoding = 5000
    = 3000
    = 1000
    = 1500
    = 2000

    View Slide

  37. If there are many useless categorical features,
    how does the result change?
    Add additional useless 25 categorical features
    which do not effect on the target.

    View Slide

  38. Results of One-hot Encoding ( ≤ )
    deterioration
    due to
    useless features
    higher cardinality

    View Slide

  39. Results of Label Encoding ( ≤ )

    View Slide

  40. Results of Target Encoding ( ≤ )
    almost no deterioration!

    View Slide

  41. Results of LGBM Encoding ( ≤ )

    View Slide

  42. Conclusion
    Experiment conditions:
    ● The effect of each features on the target is additive
    ● No feature interaction
    ● Levels of each categorical feature are evenly distributed
    Insight from the experiment results:
    ● If cardinality of categorical features is high, it is difficult to capture all the effect by one-
    hot encoding or label encoding, so target encoding or LGBM encoding is preferable.
    ● If cardinality is much higher, LGBM encoding also causes overfitting.
    ● Even if there are many useless features, target encoding is not affected by them.

    View Slide