Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kaggle_Days_Tokyo_-_Feature_Engineering_and_GBDT_Implementation.pdf

threecourse
December 11, 2019
6.9k

 Kaggle_Days_Tokyo_-_Feature_Engineering_and_GBDT_Implementation.pdf

threecourse

December 11, 2019
Tweet

Transcript

  1. Feature Engineering Techniques
    and GBDT Implementation
    Daisuke Kadowaki

    View full-size slide

  2. Self introduction
    • https://www.kaggle.com/threecourse
    • Kaggle Competitions Master
    (Walmart Recruiting II Winner, Coupon Purchase Prediction 3rd)
    • author of “Data Analysis Techniques to Win Kaggle” (book written in Japanese)
    • organizer of Kaggle Meetup Tokyo
    • freelance engineer(?), used to be an actuary

    View full-size slide

  3. Agenda
    • I. Feature Engineering Techniques
    About “Data Analysis Techniques to Win Kaggle”
    Some feature engineering techniques
    • II. Gradient Boosting Decision Tree Implementation
    Algorithm overview
    Implementation (in Kaggle Kernel)
    Algorithm variations and computational complexity

    View full-size slide

  4. I. Feature Engineering Techniques

    View full-size slide

  5. Agenda - I. Feature Engineering Techniques
    1. Introduce “Data Analysis Techniques to Win Kaggle”
    2. Categorize feature engineering techniques
    3. Aggregate and calculate statistics
    4. Other techniques and ideas

    View full-size slide

  6. “Data Analysis Techniques to Win Kaggle”
    • book written in Japanese, published in Oct. 2019.
    (https://www.amazon.co.jp/dp/4297108437)
    • authors are threecourse (me), jack, hskksk, maxwell
    • table of contents in English is here (my blog).
    • sold much more than expected
    (more than 10,000 copies in the first month,
    possibly one of the best-selling IT Development
    book of the year)

    View full-size slide

  7. Why sold so well?
    - simply great
    readable and comprehensive
    - covering intermediate level contents
    books for beginners are too many, but for intermediates are few.
    - Kaggle is catchy
    many data scientists know and are interested in Kaggle,
    even though might not want to participate.
    - insights from Kaggle are useful not only for competition, also for business
    especially, evaluation metrics and validation methods are well received.

    View full-size slide

  8. Categorize feature engineering techniques
    We categorized feature engineering techniques into:
    • a. transform variable (ex. onehot-encoding, rankgauss)
    • b. merge tables
    • c. aggregate and calculate statistics (see following slides)
    • d. time-series (ex. lag/lead feature)
    • e. dimension reduction and unsupervised (ex. UMAP, clustering)
    • f. other techniques and ideas (see following slides)

    View full-size slide

  9. Aggregate and calculate statistics
    (excerpt from 3.9 aggregation and statistics)
    How to aggregate and create feature from transaction data?
    user master user log – to be aggregated

    View full-size slide

  10. Aggregate and calculate statistics
    a. simple statistics
    - count, unique count, exist or not
    - sum, average, ratio
    - max, min, std, median, quantile, kurtosis, skewness
    b. statistics using temporal information (ex. for log data)
    - first, most recent
    - interval, frequency
    - interval and record just after key events
    - focusing on order, transition, cooccurrence, repetition

    View full-size slide

  11. Aggregate and calculate statistics
    c. filter
    - filter by types of logs (ex. types of events or purchased products)
    - filter by time or period (ex. within a week, only holidays)
    d. change unit for aggregation
    - for example, aggregate not only by user,
    but also by same gender/ages/occupation/location users
    e. focus not only users, but also items
    - aggregate by items
    - group items in the same category
    - focus special types of products (ex. organic, Asian food)

    View full-size slide

  12. Other techniques and ideas
    (excerpt some topics from 3.12 other techniques)
    3.12.1 focus on mechanisms underlying
    • consider user's behavior
    • consider service provider’s behavior
    • check common practice in the industry (ex. disease diagnostic criteria)
    • combining variables to create an index (ex. Body Mass Index from height and weight)
    • consider mechanism of natural phenomena
    • try out the service of competition host by yourself

    View full-size slide

  13. Other techniques and ideas
    3.12.2 focus on relationship between records
    example: Caterpillar Tube Pricing
    • Task was to predict the price for
    combination of Tube and Quantity (=amount purchased).
    • There were multiple Quantity records for each tube.
    Quantity combination had some patterns.
    (ex. some tubes has 4 records where Quantity is [1, 2, 10, 20]
    others have 3 records where Quantity is [1, 5, 10])
    • Here, feature that which Quantity pattern the Tube belong to was effective.
    (cf. https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/16264#91207)
    tube-id quantity target
    tube-001 1 2
    tube-001 2 4
    tube-001 10 20
    tube-001 20 40
    tube-002 1 2
    tube-002 5 4
    tube-002 10 6
    tube-003 1 3
    tube-003 5 6
    tube-003 10 9
    tube-004 1 3
    tube-004 2 6
    tube-004 10 30
    tube-004 20 60

    View full-size slide

  14. Other techniques and ideas
    3.12.2 focus on relationship between records
    example: Quora Question Pairs
    • Task was classification whether question pair has the same content or not.
    • A question often appeared in another question pair.
    Here, when questions (A, B) are the same, and (B, C) are the same,
    it can be deduced question (A, C) are the same.
    • Also, #vertex of maximum clique that contains a question was used as a feature.
    (cf. https://www.slideshare.net/tkm2261/quora-76995457)
    https://qard.is.tohoku.ac.jp/T-Wave/?p=434

    View full-size slide

  15. Other techniques and ideas
    3.12.2 focus on relationship between records
    example: Bosch Production Line Performance
    • Task was classification whether product was good or bad.
    • Products pass through multiple sensors,
    when and which sensors product passed was offered.
    Features below were used (cf. https://www.slideshare.net/hskksk/kaggle-bosch):
    • Patterns based on which sensors they passed.
    It was visible that there were several sensor-passing patterns.
    • other products that have just passed the sensor.

    View full-size slide

  16. Other techniques and ideas
    3.12.3 focus on relative values
    • difference and ratio of price compared to
    average of same product/category/user/location
    (ex. Avito Demand Prediction Challenge, 9th solution)
    • loan amount compared to average of same occupation users.
    (ex. Home Credit Default Risk)
    • relative return compared to market average return
    (ex. Two Sigma Financial Modeling Challenge)

    View full-size slide

  17. II. Gradient Boosting Implementation

    View full-size slide

  18. Agenda - Gradient Boosting Implementation
    • Algorithm overview
    • Implementation (see Kaggle Kernel)
    • Algorithm variations and computational complexity

    View full-size slide

  19. Algorithm - overview
    learn from difference between target and predicted value from existing trees,
    iteratively add decision tree
    ・・・
    1 2
    M
    target y
    predicted ො
    (1)
    target y
    3
    predicted ො
    (2)
    target y
    predicted ො
    (−1)
    target y

    View full-size slide

  20. Algorithm overview - prediction
    ・・・
    1 2 M
    1
    3

    3
    2
    predicted value y= σm=1


    View full-size slide

  21. Algorithm overview - prediction
    How to predict with weights of the trees?
    Regression:
    predicted value is ( is node weight of k-th tree and i-th data)
    Classification:
    predicted probability is

    View full-size slide

  22. Algorithm overview – gradient and hessian
    How to calculate gradients and hessian? (note: regularization omitted)
    Regression – objective function RMSE:
    objective:
    gradient:
    hessian:
    Classification – objective function logloss:
    objective:
    gradient :
    hessian :
    here,

    View full-size slide

  23. Algorithm overview – pre-sorted based algorithm
    Explain xgboost’s pre-sorted based algorithm as below
    (It’s simplified, some optimization techniques are omitted.)
    1. Pre-sort data
    for each feature: pre-sort data to efficiently iterate over split values.
    2. Construct trees
    for each tree construction:
    i. update prediction with existing trees
    ii. calculate gradient and hessian of each record
    for each depth and for each node
    iii-a. find best split – iterate over features and possible split values
    iii-b. update node and create new child nodes
    iii-c. assign records to child nodes

    View full-size slide

  24. Algorithm overview – find best split
    Finding best split is the key of the algorithm. How to decide the best split?
    • Best weight and loss of the group can be approximately calculated
    with sum of gradient and hessian of the group.
    (note: constant term and regularization omitted)
    • Thus, when the split is decided and divide groups into two nodes,
    the best weight and loss of the new two nodes can be calculated.
    • Find iteratively over features and possible split values, efficiently with pre-sorted data.
    • The best split which yields the minimum sum loss of the new two groups is chosen.
    (cf. XGBoost: A Scalable Tree Boosting System 2.2 Gradient Tree Boosting)
    best weight and loss:
    objective function:

    View full-size slide

  25. Implementation
    Here, explain simple implementation of xgboost’s pre-sorted based algorithm .
    see Kaggle Kernel:
    https://www.kaggle.com/threecourse/gbdt-implementation-kaggle-days-tokyo
    https://www.kaggle.com/threecourse/gbdt-implementation-cython-kaggle-days-tokyo

    View full-size slide

  26. Implementation – Data class
    field/method type/return type description
    values np.ndarray
    [float, ndim=2]
    values for each feature
    target np.ndarray
    [float, ndim=1]
    target
    sorted_id np.ndarray
    [int, ndim=2]
    sorted index(=pointer to record) for each feature
    def __init__(self,
    x: np.ndarray[float, ndim=2],
    y: np.ndarray[float, ndim=1]):
    None initializer

    View full-size slide

  27. Implementation – Node class
    field/method type/return type description
    id int node id
    weight float weight
    feature_id int split feature id
    feature_value float split feature value
    def __init__
    (self, id: int, weight: float):
    None initializer
    def is_leaf(self) bool whether the node is leaf

    View full-size slide

  28. Implementation – TreeUtil class
    (all methods are @classmethod)
    field/method type/return type description
    def left_child_id(cls, id: int) int id of left child node
    def right_child_id(cls, id: int) int id of right child node
    def loss(cls,
    sum_grad: float,
    sum_hess: float)
    float best loss of the group
    with sum gradient and hessian
    def weight(cls,
    sum_grad: float,
    sum_hess: float)
    float best weight of the group
    with sum gradient and hessian
    def node_ids_depth(self, d: int) List[int] node ids belong to the depth

    View full-size slide

  29. Implementation – Tree class
    field/method type/return type description
    def __init__(self, params: dict): None initializer
    def construct(self,
    data: Data,
    grad: np.ndarray[float, ndim=1],
    hess: np.ndarray[float, ndim=1])
    None for each depth, for each node
    1. find best split
    2. update nodes and create new nodes
    3. update node ids records belong to
    def predict(self,
    x: np.ndarray[float, ndim=2])
    np.ndarray
    [float, ndim=1]
    predict with constructed tree

    View full-size slide

  30. Implementation – GBDTEstimator class
    field/method type/return type description
    def __init__(self, params: dict): None initializer
    def calc_grad(self,
    y_true: np.ndarray
    [float, ndim=1],
    y_pred: np.ndarray
    [float, ndim=1])
    Tuple[
    np.ndarray
    [float, ndim=1],
    np.ndarray
    [float, ndim=1]
    ]
    calculate gradient and hessian from target and
    prediction
    (abstract method,
    implemented in Regressor/Classifier)
    def fit(self,
    x: np.ndarray[float, ndim=2],
    y: np.ndarray[float, ndim=1]):
    None train by constructing trees
    def predict(self,
    x: np.ndarray[float, ndim=2])
    np.ndarray
    [float, ndim=1]
    predict with constructed trees

    View full-size slide

  31. Implementation – GBDTClassifier class
    (inherits GBDTEstimator class)
    field/method type/return type description
    def calc_grad(self,
    y_true: np.ndarray
    [float, ndim=1],
    y_pred: np.ndarray
    [float, ndim=1])
    Tuple[
    np.ndarray
    [float, ndim=1],
    np.ndarray
    [float, ndim=1]
    ]
    calculate gradient and hessian from target and
    prediction
    def predict_proba
    (self,
    x: np.ndarray[float, ndim=2])
    np.ndarray
    [float, ndim=1]
    predict probability with constructed trees

    View full-size slide

  32. Implementation – use Cython
    Python implementation is not fast, Cython can be used for fast calculation.
    Tips are:
    • focus on inner loops
    • make function called frequently into c-function
    • designate variable used frequently with c-type
    • make foreach loop into simple range loop

    View full-size slide

  33. Implementation - results
    Experimented on data of Otto Group Product Classification Challenge
    (converted into binary classification, class1-5 as 0 and class 6-9 as 1)
    train/valid: 10,000 records each, cpu: i9-9900K
    parameter time logloss
    xgboost
    round:125, max_depth_5,
    eta:0.1, subsample: 0.8, colsample_bytree:0.8,
    n_thread:1
    2.8s 0.1706
    simpleBoost
    round:125, max_depth_5,
    eta:0.1
    1318.4s 0.1706
    simpleBoost
    (w/ Cython)
    round:125, max_depth_5,
    eta:0.1
    10.6s 0.1707

    View full-size slide

  34. Implementation - things to improve
    This implementation has a lot of things to improve.
    • efficient calculation
    • handle of missing value
    • sparse data
    • regularization, pruning
    • monitoring, early stopping
    • multi-class classification
    • histogram-based algorithm
    • LightGBM features – leaf-wise, categorical feature, exclusive feature bundling

    View full-size slide

  35. Algorithm variations and computational complexity
    – pre-sort based algorithm
    This algorithm is used in xgboost with “exact” tree_method (usually default option).
    1. for each feature, presort records
    2. for each tree construction
    for each depth (for each node)
    a. find best split – iterate over records for each feature
    b. create new nodes and assign records to them
    Finding best split is dominant in computational complexity,
    Complexity is O(#trees x #depths x #records x #features)

    View full-size slide

  36. Algorithm variations and computational complexity
    – histogram-based algorithm
    This algorithm is used in lightgbm and xgboost with “hist” tree_method.
    1. for each feature
    assign records to histogram bins based on their values of the feature.
    2. for each tree construction
    1. for each depth (for each node)
    1. make histogram for each feature.
    histogram contains gradient and hessian.
    2. find split based on histograms
    Making histogram is dominant in computational complexity,
    complexity is O(#trees x #depths x #records x #feature) same as presorted algorithm.
    Making histogram is less costly to finding split,
    that is why histogram-based is faster than presorted method.
    (cf. https://lightgbm.readthedocs.io/en/latest/Features.html)

    View full-size slide

  37. This technique is used in lightgbm.
    Bundle “exclusive features” to reduce computational complexity.
    “exclusive features” are features which don’t have values other than 0 at the same time.
    with EFB,
    • complexity of making histogram
    O(#feature x #records) -> O(#bundles x #records)
    • when finding splits, bundled features are unbundled and
    original features are used.
    Algorithm variations and computational complexity
    – EFB(Exclusive Feature Bundling)

    View full-size slide

  38. XGBoost: A Scalable Tree Boosting System
    https://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf
    LightGBM: A Highly Efficient Gradient Boosting Decision Tree
    https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
    XGBoost documenatation
    https://xgboost.readthedocs.io/en/latest/index.html
    LightGBM documenatation
    https://lightgbm.readthedocs.io/en/latest
    A Gentle Introduction to XGBoost for Applied Machine Learning
    https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
    XGBoost Mathematics Explained
    https://towardsdatascience.com/xgboost-mathematics-explained-58262530904a
    A Kaggle Master Explains Gradient Boosting
    http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
    (in Japanese)
    NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
    https://www.slideshare.net/tkm2261/nips2017-lightgbm-a-highly-efficient-gradient-boosting-decision-tree
    XGBoostのお気持ちを一部理解する
    https://qiita.com/kenmatsu4/items/226f926d87de86c28089
    References

    View full-size slide