Upgrade to Pro — share decks privately, control downloads, hide ads and more …

輪講_Kaggleで勝つデータ分析の技術_第5章

Yust0724
February 05, 2020

 輪講_Kaggleで勝つデータ分析の技術_第5章

Kaggleで勝つデータ分析の技術の社内輪講会で使用する資料です。
今回は、第5章についてです。

Yust0724

February 05, 2020
Tweet

More Decks by Yust0724

Other Decks in Programming

Transcript

  1. ͲͷΑ͏ͳํ਑Ͱ࡞Δ͔ʁ train test (Private) model fit test (Public) predict valid

    local PC (∞ճ/day) trainͷத͔Βɺtrainɿtestͱಉؔ͡܎ʹͳΔΑ͏ͳvalidΛ࡞੒͢Δɻ →kaggleຊΛಡΜͰɺద੾ͳvalidͷ࡞੒ํ๏ΛֶͿʂ
  2. Validationͷख๏ train/test͕࣌ܥྻͰ෼͔Ε͍ͯΔ͔ʹΑͬͯେ͖͘ҟͳΔɻ ɾҰൠతͳख๏ →p.273 ~ p.280 ɾ࣌ܥྻσʔλͷࡍͷख๏ →p.281 ~ p.289

    ิ଍ u++ࢯͷblogʹख๏͝ͱͷਤ͕ࡌ͍ͬͯΔͨΊɺࢀߟʹͳΔɻ https://upura.hatenablog.com/entry/2018/12/04/224436
  3. 2019 Data Science Bowl ࢠڙ޲͚ΞϓϦͷਖ਼౴཰Λ༧ଌ͢Δίϯϖɻ installation_id͝ͱʹTrain/Test͕ผΕ͓ͯΓɺPublicͷαϯϓϧ਺͕গͳ͔ͬͨɻ ্ҐਞؚΊɺGroupKFoldͷ࢖༻͕ଟ͔ͬͨɻ train test (Private)

    test (Public) time installation_id Solution ɾ1stɿGroupKFold, Nested CV ɾ2ndɿStratifiedGroupKFold ɾ8rdɿGroupKFold train / test → by installation_id
  4. 2019 Data Science Bowl train test (Private) test (Public) time

    installation_id Solution ɾ1stɿGroupKFold, Nested CV ɾ2ndɿStratifiedGroupKFold ɾ8rdɿGroupKFold …ͪͳΈʹ զʑ͸ɺKFoldΛ࢖༻͠ɺ114/3497 (Silver) ࢠڙ޲͚ΞϓϦͷਖ਼౴཰Λ༧ଌ͢Δίϯϖɻ installation_id͝ͱʹTrain/Test͕ผΕ͓ͯΓɺPublicͷαϯϓϧ਺͕গͳ͔ͬͨɻ ্ҐਞؚΊɺGroupKFoldͷ࢖༻͕ଟ͔ͬͨɻ train / test → by installation_id
  5. ASHRAE - Great Energy Predictor III ֤ࠃͷݐ෺ͷ࢖༻ిྔ΍ਫྔΛ༧ଌ͢Δίϯϖɻ ࣌ܥྻͰ෼͚ͨ΋ͷ͕ଟ͔ͬͨɻ·ͨɺLeakΛValidʹ࢖͏LeakValidation΋ྲྀߦͨ͠ɻ ্Ґਞ͸ɺid͝ͱʹModelΛ࡞੒͍ͯͨ͠ɻid͝ͱʹ܏޲͕େ͖͘ҟͳΔͨΊͱࢥΘΕΔɻ train

    test (Private) test (Public) 2016 time 2017 2018 site_id, building_id, meter Solution ɾ1st Model : Site, Meter, Building_id+Meter cv : ࣌ܥྻͷCrossValidationʢ୯७ʹ࣌ؒ෼ׂʣ ɾ2nd Model : Site+Meter cv : LeakValidation ɾ5th Model : Building_id+Meter cv(fit):TimeSplit(1-5 / 9-12) cv(predict) : use all train train / test → by time
  6. ASHRAE - Great Energy Predictor III Solution ɾ1st Model :

    Site, Meter, Building_id+Meter cv : ࣌ܥྻͷCrossValidationʢ୯७ʹ࣌ؒ෼ׂʣ ɾ2nd Model : Site+Meter cv : LeakValidation ɾ5th Model : Building_id+Meter cv(fit):TimeSplit(1-5 / 9-12) cv(predict) : use all train train test (Private) test (Public) 2016 time 2017 2018 site_id, building_id, meter …ͪͳΈʹ զʑ͸ɺStratifiedKFoldΛ࢖༻͠ɺ535/3614 (124→535) ڭ܇ɿStratifiedKFold͸ճؼʹ࢖Θͳ͍ʂ ֤ࠃͷݐ෺ͷ࢖༻ిྔ΍ਫྔΛ༧ଌ͢Δίϯϖɻ ࣌ܥྻͰ෼͚ͨ΋ͷ͕ଟ͔ͬͨɻ·ͨɺLeakΛValidʹ࢖͏LeakValidation΋ྲྀߦͨ͠ɻ ্Ґਞ͸ɺid͝ͱʹModelΛ࡞੒͍ͯͨ͠ɻid͝ͱʹ܏޲͕େ͖͘ҟͳΔͨΊͱࢥΘΕΔɻ train / test → by time
  7. train test (Private) test (Public) time User train test (Private)

    test (Public) time User औҾͷτϥϯβΫγϣϯ͕࠮͔ٗͲ͏͔Λ༧ଌ͢Δίϯϖɻ ࣌ؒͰ෼͔Ε͓ͯΓɺUserͷID͕ଘࡏ͠ͳ͔ͬͨɻ ͔࣮͠͠ࡍ͸ɺಉ͡UserͷऔҾ͸1ϲ݄Ҏ಺ʹूத͍ͯͨ͠ɻ զʑͷ૝ఆ trainͱtest͸ɺશUserʹରͯ࣌͠ܥྻͰ෼ׂ ࣮ࡍ trainͱtest͸ɺ࣮࣭UserͰ෼ׂ IEEE-CIS Fraud Detection
  8. IEEE-CIS Fraud Detection Solution ɾ1st cv (fit) : time holdout

    cv (predict) : GropKFold by Month ɾ2nd cv (fit) : time holdout cv (predict) : 1. use all train 2.Time KFold ɾ5th cv : GropKFold by Month …ͪͳΈʹ զʑ͸use all trainΛ࢖༻͠ɺ 1237/6381 (796→1237) train test (Private) test (Public) time User औҾͷτϥϯβΫγϣϯ͕࠮͔ٗͲ͏͔Λ༧ଌ͢Δίϯϖɻ ࣌ؒͰ෼͔Ε͓ͯΓɺUserͷID͕ଘࡏ͠ͳ͔ͬͨɻ ͔࣮͠͠ࡍ͸ɺಉ͡UserͷऔҾ͸1ϲ݄Ҏ಺ʹूத͍ͯͨ͠ɻ train / test → by user(≒ Month)