KDD2016勉強会_菊田_XGBoost

Slide 1

Slide 1 text

KDD2016 ษڧձ XGBoost: A Scalable Tree Boosting System T. Chen and C. Guestrin ٠ాངฏ [email protected] https://www.facebook.com/yohei.kikuta.3  2016/10/01

Slide 2

Slide 2 text

࿦จ֓ཁ ߴ͍ੑೳΛތΔ tree boosting Λ࣮૷ͨ͠ϥΠϒϥϦ͸਺ଟ͘ଘࡏ ͦΕΒͷதͰ XGBoost ͷͲ͕͜ͲΕ͘Β͍༏Ε͍ͯΔͷ͔Λઆ໌ → scalability ͱ speed ʹඞཁͳཁૉΛ໢ཏతʹ࣮૷  ˞ਫ਼౓ࣗମ͸ଞϥΠϒϥϦͱಉఔ౓ ද͸ݪ࿦จΑΓҾ༻ Table 1: Comparison of major tree boosting systems. System exact greedy approximate global approximate local out-of-core sparsity aware parallel XGBoost yes yes yes yes yes yes pGBRT no no yes no no yes Spark MLLib no yes no no partially yes H2O no yes no no partially yes scikit-learn yes no no no no no R GBM yes no no no partially no t choosing 216 examples per block balances the erty and parallelization. cks for Out-of-core Computation There are several existing works on parallelizing ing [22, 19]. Most of these algorithms fall in proximate framework described in this paper. is also possible to partition data by columns [2 tree ͷ෼ׂʹ ࡍͯ͠Մೳͳ ૊߹ͤΛશ୳ࡧ ಛ௃ྔΛ཭ࢄԽͯۙ͠ࣅతʹѻ͏ global ͸શͯͷࢬͰಉ͡ѻ͍ local ͸ split ຖʹѻ͍Λมߋ ϝϞϦʹ৐Βͳ͍৔߹ ʹ֎෦ഔମ͔Βͷ ಡΈࠐΈͰಈ࡞ ෼ࢄॲཧͷ࣮૷ εύʔεͳม਺ʹର͢Δ efficient ͳॲཧͷ࣮૷ yes 2/10

Slide 3

Slide 3 text

ੑೳΛཧղ͢Δ্Ͱ伴ͱͳΔॲཧ tree splitting ʹࡍ͢Δ֤਺஋ม਺ͷධՁ → Ͳͷ਺஋ม਺ͷͲͷ஋Ͱ split ͢Δ͔Λղੳ → splitting ͷج४͸໨తม਺ͷ࠷খԽ ɹ ※໨తؔ਺͸൚ؔ਺Ͱ͋Γ Taylor ల։ͷೋ࣍ۙࣅΛ࢖༻ → ֤ม਺ͷ֤஋Ͱޯ഑܎਺Λܭࢉ  → શͯͷՄೳͳ split Λޮ཰ྑ͘୳ࡧ͢ΔͨΊʹม਺ͷ஋Ͱ sort ਤ͸ݪ࿦จΑΓҾ༻ Block structure for parallel learning. Each column in a block is sorted by the correspond linear scan over one column in the block is su cient to enumerate all the split points. 3/10

Slide 4

Slide 4 text

ੑೳݕূͰ࢖༻͢Δσʔλ ɾAllstate : one hot encoding ༝དྷͷεύʔεੑͷߴ͍ม਺ͷॲཧੑೳݕূ ɾHiggs Boson : ʢม਺͕গͳ͍ঢ়گͰͷʣ௨ৗͷ෼ྨੑೳݕূ ɾYahoo LTRC : ϥϯΫֶशͷੑೳݕূ ɾCriteo : σʔλαΠζ͕େ͖͍ͷͰ෼ࢄॲཧͷੑೳݕূ ද͸ݪ࿦จΑΓҾ༻ Table 2: Dataset used in the Experiments. Dataset n m Task Allstate 10 M 4227 Insurance claim classification Higgs Boson 10 M 28 Event classification Yahoo LTRC 473K 700 Learning to Rank Criteo 1.7 B 67 Click through rate prediction We used four datasets in our experiments. A summary of these datasets is given in Table 2. In some of the experiments, we use a randomly selected subset of the data either due to slow baselines or to demonstrate the performance of the algorithm with varying dataset size. We use a su x to denote the size in these cases. For example Allstate-10K means a subset of the Allstate dataset with 10K instances. The first dataset we use is the Allstate insurance claim dataset8. The task is to predict the likelihood and cost of Table 500 t Meth XGB XGB sciki R.gb Ϩίʔυ਺ ม਺ͷ਺ 4/10

Slide 5

Slide 5 text

exact greedy vs. approx. greedy XGBoost ͸ exact ΋ approx. ΋྆ํ࣮૷͍ͯ͠ΔͨΊൺֱ approx. Ͱ΋͋Δఔ౓ͷ෼ׂ਺ʹ͢Ε͹े෼ͳਫ਼౓Λୡ੒ ※ͬ͘͟Γݴͬͯ 1/eps ͕཭ࢄԽͨ͠ࡍͷ෼ׂ਺ ਤ͸ݪ࿦จΑΓҾ༻ 0 10 20 30 40 50 60 70 80 90 Number of Iterations 0.75 0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83 Test AUC exact greedy global eps=0.3 local eps=0.3 global eps=0.05 Figure 3: Comparison of test AUC convergence on Higgs 10M dataset. The eps parameter corresponds Figur exam when ture a data. Dataset: Higgs Boson 5/10

Slide 6

Slide 6 text

out-of-core experiment ͷલஈ XGBoost ͷΈ͕֎෦ഔମ͔ΒͷಡΈࠐΈͰ΋ಈ࡞Մೳ σʔλ͸ compressed column (CSC) ܗࣜͰҰఆ਺ͷϒϩοΫ୯ҐͰ࣋ͭ → ޯ഑܎਺ͷऔಘͰඇ࿈ଓతͳϝϞϦΞΫηε͕ൃੜ͋͋͋ → େྔσʔλͰͷ଎౓֬อͷͨΊʹ cache-aware access Λ࣮૷ ਤ͸ݪ࿦จΑΓҾ༻ 1 2 4 8 16 Number of Threads 8 16 32 64 128 Time per Tree(sec) Basic algorithm Cache-aware algorithm (a) Allstate 10M 1 2 4 8 16 Number of Threads 8 16 32 64 128 256 Time per Tree(sec) Basic algorithm Cache-aware algorithm (b) Higgs 10M 1 2 4 8 16 Number of Threads 0.25 0.5 1 2 4 8 Time per Tree(sec) Basic algorithm Cache-aware algorithm (c) Allstate 1M 1 2 4 8 16 Number of Threads 0.25 0.5 1 2 4 8 Time per Tree(sec) Basic algorithm Cache-aware algorithm (d) Higgs 1M igure 7: Impact of cache-aware prefetching in exact greedy algorithm. We ﬁnd that the cache-miss e↵ec mpacts the performance on the large datasets (10 million instances). Using cache aware prefetching improve he performance by factor of two when the dataset is large. 6/10 σʔλྔ͕ଟ͘ͳ͚Ε͹খ͞ͳࠩ σʔλྔ͕ଟ͚Ε͹ݦஶͳࠩ

Slide 7

Slide 7 text

out-of-core experiment XGBoost ͷΈ͕֎෦ഔମ͔ΒͷಡΈࠐΈͰ΋ಈ࡞Մೳ σʔλ͸ compressed column (CSC) ܗࣜͰҰఆ਺ͷϒϩοΫ୯ҐͰ࣋ͭ ͞Βʹ͜ͷσʔλϒϩοΫͷ sharding ΋࣮૷ͯ͠ಡΈࠐΈ଎౓Λվળ ਤ͸ݪ࿦จΑΓҾ༻ Dataset: Criteo Figure 11: Comparison of out-of-core methods on (a) 7/10 ਺ഒͷ ଎౓վળ શσʔλ 1.7 billion

Slide 8

Slide 8 text

sparsity-aware experiment XGBoost ͸ sparse ͳม਺ʹରͯ͠΋ optimize ͞Ε͍ͯΔ missing Ͱͳ͍஋ʹରͯ͠ͷΈ split finding Λ࣮ࢪ  → missing Ͱͳ͍ཁૉ਺ʹରͯ͠ઢܗ࣌ؒͰಈ࡞ ߹Θͤͯ split ͷࡍʹ஋͕ͳ͍৔߹ͷ default ΋ܾఆ ਤ͸ݪ࿦จΑΓҾ༻ Dataset: Allstate h column in a block is sorted by the corresponding feature is su cient to enumerate all the split points. 1 2 4 8 16 Number of Threads 0.03125 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 Time per Tree(sec) Sparsity aware algorithm Basic algorithm Figure 5: Impact of the sparsity aware algorithm 8/10 50ഒఔ౓ͷ ଎౓վળ

Slide 9

Slide 9 text

distributed experiment AWS m3.2xlarge instance 32ݸͰ YARN cluster Λߏங ଞͷϥΠϒϥϦͱൺ΂ͯѹ౗తʹૣ͍ ϝϞϦʹ৐Βͳ͘ͳΔ৔߹΋εϜʔζʹεέʔϧ ਤ͸ݪ࿦จΑΓҾ༻ Dataset: Criteo 128 256 512 1024 2048 Number of Training Examples (million) 128 256 512 1024 2048 4096 8192 16384 32768 Total Running Time (sec) Spark MLLib H2O XGBoost (a) End-to-end time cost include data loading 9/10 શσʔλ 1.7 billion 10ഒҎ্ͷ ଎౓վળ

Slide 10

Slide 10 text

ͦͷଞॏཁͳϙΠϯτ 10/10 • XGBoost͸͋ΒΏΔྖҬͰ state-of-the-art ͷੑೳΛൃش͍ͯ͠Δ࣮੷༗  kaggle ΍ KDD cup ͷ winning solutions Ͱ࠷΋࢖ΘΕ͍ͯΔ • overfitting Λ๷͙ͨΊͷ࣮૷΋ἧ͍ͬͯΔ  ໨తؔ਺ͷਖ਼ଇԽ߲΍ shrinkage, column subsampling ͳͲ • ϥϯΫֶश΋ಉ༷ʹߴ͍ੑೳ  ͜Ε΋ଞͷϥΠϒϥϦͱൺ΂Δͱ਺ഒૣ͍ • ۙࣅతͳ splitting ॲཧʹඞཁͳ weighted quantile sketch Λಋೖ  weighted data ʹର͢Δ quantile summary ΛఆࣜԽ  ε-approximate quantile summary ʹରͯ͠ merge ͱ prune ૢ࡞Λಋೖ  ͦΕͧΕͷૢ࡞Ͱͷ ε ఻೻ଇΛಋग़  ূ໌͸ arXiv version https://arxiv.org/abs/1603.02754 ʹهࡌ • ෼ࢄॲཧͰ͸ machineͷ਺Λ૿΍ͯ͠΋εέʔϧ͢Δ  slightly super linear (!?) ͱॻ͍ͯ͋Δ