Practical Usage of Spark GBDT at ele.me

Practical Usage of Spark GBDT at ele.me

F2a5d82918d6f08f73a22fa49f83595a?s=128

David Chen

August 06, 2018
Tweet

Transcript

  1. Practical Usage of Spark GBDT at ele.me David Chen senior

    data engineer http://mvj3.com Aug 2018
  2. GBDT: Gradient Boosting Decision Tree 1. Classification tree 2. Regression

    tree
  3. Ensemble Learning Bagging Boosting Random Forest Adaptive Boosting Gradient Boosting

    GBDT * copied from one ppt of my colleague * http://qr.ae/TUIJi8
  4. Gradient Boosting * http://explained.ai/gradient-boosting/L2-loss.html#sec:2.3

  5. Feature Importance

  6. Predicting Models Delivery Time Route Plan MAE <= 3 min

    MAE <= 10 min Accuracy 83.5%
  7. Some Data training time 20min-2hours training sample 3-30million DAG tasks

    20-50+ model configuration {“numIterations”: 200, “maxDepth”: 5, “maxBins”:28} daily requests 10million-30+million single predict /response time 2ms / 5-12ms serialised model size 50KB-1.5MB features size 40-100 spark version 2.1.0 alternative framework XGBoost, TensorFlowDNN, Facebook GBDT+LR
  8. Useful Links 1. https://spark.apache.org/docs/latest/mllib-decision-tree.html 2. https://spark.apache.org/docs/latest/mllib-ensembles.html