2019 Data Science Bowl competition solution

DSBίϯϖࢀઓه 2020/05/20 Yu Sato @Yust Kaggle LTձ

ࣗݾ঺հ ▪ 2019೥2݄͜Ζ͔ΒkaggleʹࢀՃɻ ▪ ޷͖ɿ΋΋Ϋϩɺ kaggleɺα΢φɺ͏ͳ͗ ▪ ؾʹͳΔɿϗοτΫοΫ

2019DataScienceBowlίϯϖͱ͸ʁ 3 ࢠڙ޲͚ΞϓϦͷϩά͔Βɺग़୊͞ΕΔ՝୊ͷਖ਼౴཰Λਪఆ͢Δɻ train test installation_id title event_code event_data accuracy_group
0006a69f Cart Balancer 4010 4010 4010 4010 Bird Measurer NG OK NG NG 2 0 015776b4 Bird Measurer 4010 Mushroom Sorter OK 3 ໨తม਺ɻ 3ɿ1ճ໨ͰOK 2ɿ2ճ໨ͰOK 1ɿ3ճ໨Ҏ্ͰOK 0ɿall NG ? ࠷ޙͷtitleͷ accuracy_groupΛਪఆ

ํ਑ 4 2िؒͱ͍͏ݶΒΕͨظؒͰɺޮ཰తʹਐΊΔํ਑Λཱͯͨɻ ɾFE ɹ- kernelಛ௃ྔΛར༻ɻϦʔΫ͍ͯ͠Δ΋ͷ͕ͳ͍͔͸νΣοΫɻ ɹ- null importanceͰ༗ޮͳಛ௃ྔΛબ୒ɻ ɾmodel
ɹ- LGBMͷmodelΛseed averagingͰΞϯαϯϒϧɻ ɹ- qwkͷᮢ஋͸࠷దԽɻ ɾfinal submit ɹ- publicLB, cv͕ߴ͍΋ͷΛ1ͭͣͭબ୒ɻ

My Team Solution 5 Feature Engineering Feature Selection LGBM LGBM
LGBM Threshold Ensemble train, test submit - kernel + my team - Null importance - KFold - Seed Averaging - Optimized Rounder - Optimized coefficients - public:0.532 / private:0.545 Threshold Threshold

FeatureEngineering (kernel) 6 targetͱͷ૬͕ؔۃ୺ʹߴ͍΋ͷʢ=OverfitͷՄೳੑ͕͋Δʣ͕ͳ͍͔Λ֬ೝɻ →ۃ୺ʹߴ͍΋ͷ͸ଘࡏͤͣɻ

FeatureEngineering (my team) 7 installation_id಺ͷɺ࣮ࢪͨ͠Ξηεϝϯτͷ਺ͱਖ਼౴཰Λूܭɻ 1st → 4ճ໨ʹOK  2nd →
2ճ໨ʹOK 3rd → 1ճ໨ͰOK 1st → 4ճશͯNG ࣍΋1ճ໨ͰOKʹͳΓͦ͏ ࣍΋શͯNG͔ɺ ྑͯ͘΋਺ճ໨ʹOK͔ʁ

FeatureSelection 8 targetΛshuffleͨ͠΋ͷͱ࣮ࡍͷtargetͰFeatureImportanceΛൺֱ͢Δɻ →໿1000ݸͷಛ௃ྔͷ͏ͪɺ໿200ݸΛ࡟ͬͨɻ ࢀߟɿhttps://www.kaggle.com/ogrellier/feature-selection-with-null-importances ࣮ࡍͷσʔλͷͱ͖ͷ ΈɺImportance͕ߴ͍ →ॏཁͳಛ௃ྔ Shuffleͯ͠΋Importance ͕มΘΒͳ͍
→ϊΠζಛ௃ྔ

ᮢ஋Ͱ෼ྨ Threshold 9 qwkͷ༧ଌ͸ɺ·ͣ͸ճؼͰਪఆͦ͠ͷ஋Λᮢ஋Ͱ෼ྨͨ͠ɻ(kaggleຊɹɹɹ p.100ࢀর) ͦͷࡍͷᮢ஋ͷܾఆํ๏Λݕ౼ͨ͠ɻ target 0.45 1.12 2.90
2.04 0.80 1.77 target 0 1 3 2 0 1 threshold 0.00 ~ 0.80 → 0 0.81 ~ 1.80 → 1 1.81 ~ 2.50 → 2 2.51 ~ 3.00 → 3 f_1 f_2 f_3 f_4 ճؼͰ༧ଌ

Threshold 10 trainͰ࠷దԽ͞Εͨᮢ஋ΛٻΊɺͦΕΛ༻͍ͯtestΛ෼ྨɻ threshold 0.00 ~ 0.80 → 0 0.81
~ 1.80 → 1 1.81 ~ 2.50 → 2 2.51 ~ 3.00 → 3 train LGBM OptimizedRounder test LGBM OptimizedRounder prob_target target 0.45 1 1.12 2 2.90 2 2.04 0 target 0.45 1.12 2.90 2.04 ɾ࠷దͳᮢ஋ͷಋग़ ɾճؼ஋͔Β෼ྨ Λ࣮ࢪ͢Δɻ ίʔυ͸ΞϥΠ͞Μ͕·ͱΊͯ ͘Ε͍ͯΔ(*)ɻ (*)https://qiita.com/kaggle_master-arai-san/items/d59b2fb7142ec7e270a5 target 0 1 3 2

submit 11 PublicLB͕࠷΋ߴ͍΋ͷɺcv͕࠷΋ߴ͍΋ͷͷ2ͭΛબ୒ɻ →cv͕࠷΋ߴ͍#5͕ɺPrivateLBʹ͍ͭͯ΋࠷΋ߴ͔ͬͨɻ # PublicLB cv final sub PrivateLB
kernel 1 0.550 - x 0.520 2 0.542 - 0.520 3 0.537 - 0.534 myTeam 4 0.533 0.60357 0.544 5 0.532 0.60630 x 0.545 6 0.532 0.60333 0.544 7 0.527 0.60583 0.543 ۜϝμϧGET!!

12 Kagglerʹݹ͔͘Β఻ΘΔ֨ݴɻ rank1Ґͷbestfittingࢯ΋ɺҎԼͷΑ͏ʹड़΂͍ͯΔɻ A good CV is half of success.
I won’t go to the next step if I can’t find a good way to evaluate my model. ʢྑ͍CVཱ͕֬Ͱ͖Ε͹ɺ੒ޭͷಓ൒͹ͱݴ͑Δɻࢲ͸ɺϞσϧͷྑ͍ධՁํ๏͕෼͔Δ·Ͱɺ࣍ͷεςοϓ΁ਐ·ͳ͍ɻʣ Trust your CV.

·ͱΊ 13 ▪ ୹ظؒͷνϟϨϯδͰ΋ɺ࠷௿ݶ΍Δ͜ͱ͸΍Δɻ ▪ աڈͷྨࣅίϯϖ͸νΣοΫ͢Δɻ ▪ Trust your CV

͋Γ͕ͱ͏͍͟͝·ͨ͠ɻ ▪ Kaggle: @Yust ▪ Twitter: @yust_kaggle ▪ e-mail: [email protected]

[ิ଍] ධՁࢦඪʢQWKʣ 15 ɾQuadratic Weighted KappaʢॏΈ෇͖Χού܎਺ʣ ɾԕ͍෼ྨͰ֎ͨ࣌͠ͷଛࣦ͕େ͖͍ɻ ࢀߟɿhttps://bellcurve.jp/statistics/blog/14200.html

2019 Data Science Bowl competition solution

2019 Data Science Bowl competition solution

Yust0724

More Decks by Yust0724

Other Decks in Programming

Featured

Transcript

DSBίϯϖࢀઓه 2020/05/20 Yu Sato @Yust Kaggle LTձ

ࣗݾ঺հ ▪ 2019೥2݄͜Ζ͔ΒkaggleʹࢀՃɻ ▪ ޷͖ɿ΋΋Ϋϩɺ kaggleɺα΢φɺ͏ͳ͗ ▪ ؾʹͳΔɿϗοτΫοΫ

2019DataScienceBowlίϯϖͱ͸ʁ 3 ࢠڙ޲͚ΞϓϦͷϩά͔Βɺग़୊͞ΕΔ՝୊ͷਖ਼౴཰Λਪఆ͢Δɻ train test installation_id title event_code event_data accuracy_group

ํ਑ 4 2िؒͱ͍͏ݶΒΕͨظؒͰɺޮ཰తʹਐΊΔํ਑Λཱͯͨɻ ɾFE ɹ- kernelಛ௃ྔΛར༻ɻϦʔΫ͍ͯ͠Δ΋ͷ͕ͳ͍͔͸νΣοΫɻ ɹ- null importanceͰ༗ޮͳಛ௃ྔΛબ୒ɻ ɾmodel

My Team Solution 5 Feature Engineering Feature Selection LGBM LGBM

FeatureEngineering (kernel) 6 targetͱͷ૬͕ؔۃ୺ʹߴ͍΋ͷʢ=OverfitͷՄೳੑ͕͋Δʣ͕ͳ͍͔Λ֬ೝɻ →ۃ୺ʹߴ͍΋ͷ͸ଘࡏͤͣɻ

FeatureEngineering (my team) 7 installation_id಺ͷɺ࣮ࢪͨ͠Ξηεϝϯτͷ਺ͱਖ਼౴཰Λूܭɻ 1st → 4ճ໨ʹOK  2nd →

ᮢ஋Ͱ෼ྨ Threshold 9 qwkͷ༧ଌ͸ɺ·ͣ͸ճؼͰਪఆͦ͠ͷ஋Λᮢ஋Ͱ෼ྨͨ͠ɻ(kaggleຊɹɹɹ p.100ࢀর) ͦͷࡍͷᮢ஋ͷܾఆํ๏Λݕ౼ͨ͠ɻ target 0.45 1.12 2.90

Threshold 10 trainͰ࠷దԽ͞Εͨᮢ஋ΛٻΊɺͦΕΛ༻͍ͯtestΛ෼ྨɻ threshold 0.00 ~ 0.80 → 0 0.81

submit 11 PublicLB͕࠷΋ߴ͍΋ͷɺcv͕࠷΋ߴ͍΋ͷͷ2ͭΛબ୒ɻ →cv͕࠷΋ߴ͍#5͕ɺPrivateLBʹ͍ͭͯ΋࠷΋ߴ͔ͬͨɻ # PublicLB cv final sub PrivateLB

12 Kagglerʹݹ͔͘Β఻ΘΔ֨ݴɻ rank1Ґͷbestfittingࢯ΋ɺҎԼͷΑ͏ʹड़΂͍ͯΔɻ A good CV is half of success.

·ͱΊ 13 ▪ ୹ظؒͷνϟϨϯδͰ΋ɺ࠷௿ݶ΍Δ͜ͱ͸΍Δɻ ▪ աڈͷྨࣅίϯϖ͸νΣοΫ͢Δɻ ▪ Trust your CV

͋Γ͕ͱ͏͍͟͝·ͨ͠ɻ ▪ Kaggle: @Yust ▪ Twitter: @yust_kaggle ▪ e-mail: [email protected]

[ิ଍] ධՁࢦඪʢQWKʣ 15 ɾQuadratic Weighted KappaʢॏΈ෇͖Χού܎਺ʣ ɾԕ͍෼ྨͰ֎ͨ࣌͠ͷଛࣦ͕େ͖͍ɻ ࢀߟɿhttps://bellcurve.jp/statistics/blog/14200.html