Slide 1

Slide 1 text

Practical Usage of Spark GBDT at ele.me David Chen senior data engineer http://mvj3.com Aug 2018

Slide 2

Slide 2 text

GBDT: Gradient Boosting Decision Tree 1. Classification tree 2. Regression tree

Slide 3

Slide 3 text

Ensemble Learning Bagging Boosting Random Forest Adaptive Boosting Gradient Boosting GBDT * copied from one ppt of my colleague * http://qr.ae/TUIJi8

Slide 4

Slide 4 text

Gradient Boosting * http://explained.ai/gradient-boosting/L2-loss.html#sec:2.3

Slide 5

Slide 5 text

Feature Importance

Slide 6

Slide 6 text

Predicting Models Delivery Time Route Plan MAE <= 3 min MAE <= 10 min Accuracy 83.5%

Slide 7

Slide 7 text

Some Data training time 20min-2hours training sample 3-30million DAG tasks 20-50+ model configuration {“numIterations”: 200, “maxDepth”: 5, “maxBins”:28} daily requests 10million-30+million single predict /response time 2ms / 5-12ms serialised model size 50KB-1.5MB features size 40-100 spark version 2.1.0 alternative framework XGBoost, TensorFlowDNN, Facebook GBDT+LR

Slide 8

Slide 8 text

Useful Links 1. https://spark.apache.org/docs/latest/mllib-decision-tree.html 2. https://spark.apache.org/docs/latest/mllib-ensembles.html