先日開催された、Kaggle(M5-Forecasting)の当方のSolution資料です。
LBHHMFOBNF*)JSPBLJ.'PSFDBTUJOH"DDVSBDZ6ODFSUBJOUZ
View Slide
࣍ɿ1. ࣗݾհ2. ݁Ռ3. ࠓճͷऔΓΈͱߟ͑4. Ϟσϧ֓ཁ5. σʔλ୳ࡧ6. ಛྔબ7. Ϟσϧৄࡉ8. লͱ՝
̍ɽࣗݾհ
̎ɽ݁Ռίϯϕͷ֓ཁͪ͜ΒΛࢀরɿhttps://www.kaggle.com/c/m5-forecasting-accuracy/overviewίϯϕͷ֓ཁͪ͜ΒΛࢀরɿhttps://www.kaggle.com/c/m5-forecasting-uncertainty/overview
̏ɽࠓճͷऔΓΈͱߟ͑ʻऔΓΈʼɾॳίϯϖɻɾ3݄த०ʙ6݄ͷίϯϖऴྃ·Ͱͷ̏ϲ݄΄΅ٳΈͳ͠ͰରԠɻɾҰฏۉ̍̎ʙ̍̒࣌ؒΛίϯϖʹ๋͛Δɻʻߟ͑ʼAccuracyɿɾ༧ଌΛͬͨಛྔٴͼલͷ༧ଌΛ༻͍ͨཌͷ༧ଌʢ࠶ؼతΞϓϩʔνʣߦΘͳ͍ɻʢಛʹ࠶ؼతΞϓϩʔν̎ɺ̏ͷ༧ଌͳΒ༗ޮ͔͠Εͳ͍͕̎̔ͷ༧ଌͩͱޡࠩͷੵ͕େ͖͘ͳΓ͗͢ΔՄೳੑ͕͋Δɻʣɾલͷ28ؒTrainDataͱͯ͠༻͢ΔɻʢաֶशɺֶशෆͷڪΕ͕͋Δ͜ͱ͔ΒҙΛ͍ϞσϧΛ࡞͢Δඞཁ͕͋ΔɻʣUncertaintyɿɾAccuracyͰͷ࠷ऴఏग़ΛҐͷ̑̌ˋͱ͢ΔɻɾAccuracyϞσϧʹ͓͚ΔValidationظؒͷ࣮ͱ༧ଌͱͷֹࠩΛෆ࣮֬ੑͱͯ͠༻͢ΔɻɾΑͬͯAccuracyʹ͓͍ͯ൚Խੑೳͷߴ͍Ϟσϧͷ࡞͕ॏཁͱͳΔɻ
̐ɽϞσϧ֓ཁ "DDVSBDZ 6ODFSUBJOUZϞσϧɿ LightGBMͷΈΛ༻Ϟσϧߏ : 28Λਖ਼֬ʹ༧ଌ͢ΔͨΊʹ1ຖʹݸผͷϞσϧΛ࡞ɻ·ͨϝϞϦͷ͋Γɺstore_idຖʹϞσϧΛׂɻ߹ܭ 28 day × 10 id = 280 modelsॏཁͳಛྔ:ಛྔʹؔͯ͋͠·Γಛผͳͷͳ͘ඪ४తͳͷͷΈͱͳͬͨɻex) Basic Lagʢmean, max, ,min, std, medianʣAverage Encoding ʢ֤ϨϕϧຖʣIDʢTrainDataʹͯ༩͑ΒΕͨIDʣֶश࣌ؒɿ8ʙ9ʢՄೳͳݶΓϦεΫΛഉআ্ͨ͠Ͱͷ࣌ؒʣ※ֶश࣌ؒΛॖ͢ΔͨΊͷํ๏ɻʢ༧ଌ͕গ͠ߥ͘ͳΔ͕ͦ͜·Ͱμϝʔδ͕ͳ͍ͷʣɾLearningRateΛେ͖͘͠ɺnum_iterΛݮΒ͢ɻʢlr0.03ͳΒiter500~700ఔʣɾBasicLagಛྔΛআ͢Δɻʢಛʹmulti_2, 3, 5, ʣɾstore_id୯ҐϞσϧΛͳ͘͢ɻʢͨͩ͠ಛྔΛेݮΒ͞ͳ͍ͱϝϞϦͷൃੜʣϞσϧ : AccuracyΛ࡞͢Δࡍʹ༻ͨ͠ModelΛ༻ɻࢉग़ํ๏ :Ґͷ͏ͪ̑̌ˋʹؔͯ͠AccuracyͷFinal SubmissionΛ͏ɻͦͷଞ̔ʹؔͯ͠Accuracyʹͯࢉग़ͨ͠Validationظؒʹ͓͚Δ࣮ͱ༧ଌͷࠩΛෆ࣮֬ੑͱ͠ɺల։͢Δɻ
̑. σʔλ୳ࡧച্ݸͷϓϩοτʢ߹ܭʣҰݟ͢Δͱશମʹ্ͬͯঢͰ͋ΔΑ͏ʹݟ͑ΔɻຖͷొΞΠςϜຖʹΞΠςϜ͕Ճ͞Ε͓ͯΓTotalͷ্ঢͷཁҼͱͳ͍ͬͯΔ͜ͱ͕ఆ͞ΕΔɻ30490ʢ̍ʣτϨϯυ্ਤɿຖͷച্ݸͷ߹ܭਪҠԼਤɿຖͷΞΠςϜొਪҠ্ਤΛݟΔͱҰݟ௨ظʹΘͨͬͯ૿Ճ͍ͯ͠ΔΑ͏ʹݟ͑Δ͕ԼਤͰΞΠςϜ͕ʑొ͞Ε͍ͯΔ͜ͱ͕Θ͔ΔɻΑͬͯ͜ΕΒͷ৽͘͠ೖͬͨΞΠςϜʹΑΓ্ঢ͕ݟΒΕΔ͜ͱ͕ߟ͑ΒΕɺ͜ͷ߹্ਤͰΛଊ͑Δ͜ͱ͕Ͱ͖ͳ͍ɻΑͬͯ࣍ʹΞΠςϜొผʢച্։࢝ʣͷຖͷച্ݸͷ߹ܭਪҠΛݟͯΈΔɻ
̑. σʔλ୳ࡧച্։࢝ผͷച্ݸͷϓϩοτਤɿച্։࢝ผͷചΓ্͛ݸͷ߹ܭਪҠͲͷਤʹ͓͍ͯ2015લ·Ͱݮগʹ͋Δͷʹ͔͔ΘΒͣɺ2015ޙ͔Β2016ʹ͔͚ͯ૿Ճ͍ͯ͠Δ͜ͱ͕Θ͔Δɻ͜ΕԿ͔͠ΒτϨϯυ͕มΘͬͨ͜ͱΛද͍ͯ͠ΔՄೳੑ͕͋ΓValidationͷऔΔظؒϞσϧͷߏஙํ๏ʹؾΛ͚ͭΔඞཁ͕͋Δɻ͔͠͠ɺاۀଆͷԿ͔ࢼ࡞ʹΑΔͷͳͷ͔ɺফඅτϨϯυʹΑΔͷͳͷ͔͕ෆ໌Ͱ͋ΓɺࠓճͷίϯϖΛߟ͑Δ্Ͱ͍͠ͱ͜Ζͱͳͬͨɻʢ̍ʣτϨϯυ2011 2012 2013 2014 2015 2016
̑. σʔλ୳ࡧਤɿ28ຖͷച্ݸͷ߹ܭਪҠʢάϥϑstore_idຖ͓Αͼച্։࢝ຖͰ͋Δʣ28ؒʹ͓͚Δ߹ܭച্ݸͷਪҠͲ͏มಈ͍ͯ͠Δͷ͔ΛݟͨάϥϑͰ͋Δ͕ɺΓधཁ͋ΔఔҰఆͰ͋Δ͜ͱ͔Β͔ɺٸܹͳ্ঢͷ͋ͱͷ28͋Δఔ͑ΒΕௐ͞Ε͍ͯΔΑ͏ʹݟ͑Δɻxʹ̓̌PublicLBظؒͰ͋Δ͕ଟ͘ͷάϥϑͰٸܹͳ্ঢΛԋ͍ͯ͡ΔɻΑͬͯݟͨͰ༧͢ΔʹɺPrivateظؒͷ28ؒͷ߹ܭച্ݸPublicLBظؒʹൺͯݮগ͢ΔՄೳੑ͕͋Δఔ͋Δ͜ͱ͕૾Ͱ͖Δɻʢ͜Εʹؔͯ͠LagಛྔͷRollingʹͯϞσϧʹ৫ΓࠐΊΔ͔ʁʣ̎̔ຖͷച্ݸͷϓϩοτʢstore_idຖʣʢ̍ʣτϨϯυ
̑. σʔλ୳ࡧ̎̔ຖͷച্ݸͷϓϩοτʢstore_idຖʣʢ̍ʣτϨϯυ
̑. σʔλ୳ࡧਤɿ֤ΞΠςϜʹ͓͚Δ͍Ζ͍ΖͳθϩͷύλʔϯΛάϥϑԽͨ͠ͷɻDiscussionͰθϩύλʔϯʹର͢Δҙݟ͕ඇৗʹଟ͔ͬͨͱࢥ͏ɻࠓճͷ࣌ܥྻʹଟ͘ͷθϩ͕͋Δ͕ઓུతɺඞવతͳθϩ͕ଟؚ͘·Ε͍ͯͨɻاۀʹࡏݿઓུɺઓུ͕͋ΓͦΕΒຖมΘΓ͏ΔɻͦͷͨΊࡏݿઓུɺઓུ͕Θ͔Βͳ͍ঢ়ଶͰθϩύλʔϯΛ༧ଌ͢Δ͜ͱͦΕͳΓʹϦεΫ͕͋Δͱײ͡Δɻ·ͨࡏݿΕͨ·ͨ·ച্͕ͳ͔ͬͨͳͲͷθϩΛ༧͢Δʹͯ͠ධՁࢦඪ্1ͷζϨڐ͞Εͳ͍͜ͱ͔ΒɺΓθϩύλʔϯΛ༧͢ΔϦεΫେ͖͍ɻࡏݿઓུɺઓུΛΒͳ͍ঢ়ଶͰθϩύλʔϯΛ༧͖͢Ͱͳ͍ʁ?Changestrategy? IrregularLong termʢ̎ʣ͍Ζ͍Ζͳθϩύλʔϯ
̒. ಛྔબॏཁͳಛྔɾجຊతͳLagಛྔɹˠstore_id × item_idʹ͓͚ΔLagಛྔɹˠstore_id × item_id͔༵ͭ୯Ґʹ͓͚ΔLagಛྔɾฏۉɹˠstore_id × item_id, state_id × item_id, item_idʹ͓͚Δ༵୯Ґͷฏۉʢ݄ʙʣɹˠstore_id × item_id, state_id × item_id, item_idʹ͓͚Δ୯Ґͷฏۉʢ̍ʙ̏̍ʣɾՁ֨มಈɾTrainDataʹͯ༩͑ΒΕͨIDࢼ͕ͨ͠͏·͍͔͘ͳ͔ͬͨಛྔɾ༧ଌΛ༻ͨ͠ಛྔ(ച্θϩύλʔϯΛԽͨ͠ಛetc…)ɾΫϥελϦϯάʹΑΔ৽ͨͳΧςΰϦ͚ʢྨࣅɺิʣɾ֎෦σʔλetc…..
̒. ಛྔબpred_day11ͷϞσϧͱ28ͷϞσϧॏཁ͕ߴ͍ಛྔ͕͔ͳΓҟͳΔɻ1ʹ͍ۙ΄ͲLagܥ͕ߴ͘ɺ28ʹۙͮ͘΄ͲฏۉIDͳͲͷΑΓҰൠԽ͞Εͨಛྔͷॏཁ্͕͕ΔɻϞσϧΛ28ݸʹ͚Δ͖ࠜڌʹͳΔɻ※ಛྔ໊ͷઆ໌࣍ͷεϥΠυFeature Importance Plot - Top 20pred_day28
̒. ಛྔબಛྔ໊ͷઆ໌• sales_residual_diff_28_roll_365 : Targetʢৄࡉ࣍ͷεϥΠυʣ• multi_5_sales_residual_diff_28_roll_365_shift_1_roll_4_mean :Code: df[“Target_shift_1”] = df.groupby([“id”])[“Target”].transform(lambda x : x.shift(1))df.groupby([“id”, “multi_5”])[“Target_shift_1”].transform(lambda x: x.rolling(4).mean())• private_sales_residual_diff_28_roll_365_enc_week(day)_LEVEL12_mean:privateɿϓϥΠϕʔτظؒͷલ·ͰͷσʔλΛ༻͢Δɻenc_week(day)_LEVEL12_meanɿLEVEL12ͷ༵()ͷฏۉചΓ্͛• sell_price_minority12 : sell_priceͷগୈҰҐͱೋҐex) 10.58345 => 58• id_serial : ֤ID୯Ґʹઃఆͨ͠0 ~ 30489ͷ࿈൪
̓. Ϟσϧৄࡉ TARGET = TARGET - TARGET.shift(28).rolling(365)ʢ̍ʣτϨϯυআڈܾఆܥͷϞσϧΛ͏߹ɺকདྷ༧ଌΛ͢ΔʹτϨϯυΛ͘ඞཁ͕͋Δͱ͍ͬͨ༰͕Discussionʹ͋ͬͨΑ͏ʹࠓճ༩͑ΒΕͨσʔλͷτϨϯυΛऔΓআ͘͜ͱʹͨ͠ɻ͔͠͠ɺػցֶशͳͲͰ༧ͨ͠༧ଌΛτϨϯυআڈͷࡐྉͱͯ͠͏͜ͱϦεΫ͕͋ΔͨΊ༻ͨ͘͠ͳ͔ͬͨɻ࣮ࡍ༧ଌʹΑΔτϨϯυͷআڈࢼ͕ͨ͠τϨϯυʹͯΊΔࣜʹΑΓɺকདྷͷ༧ଌʹେ͖ͳ͕ࠩ͋ͬͨɻͦͷͨΊ࣮ΛͬͨআڈΛߟ͑ΔதͰҰ൪҆ఆ͍ͯͨ͠TARGET͔ΒTARGET.shift(28)rolling(365)Λݮͨ͡ͷΛTARGETͱ͢Δ͜ͱͱͨ͠ɻ͔͠͠ɺ࣮ΛͬͨͨΊશʹτϨϯυΛऔΓআ͚͓ͯΒͣޮՌݶఆతͰ͋ͬͨͱײ͍ͯ͡ΔɻͨͩखݩͰݕূ͢ΔݶΓ̎̔ؒͷ༧ଌͷ͏ͪޙʢ28͍ۙͷ༧ଌʣʹͳΔʹͭΕτϨϯυআڈޙͷํ͕҆ఆੑ͕ߴ͔ͬͨɻTARGETTARGET.shift(28).rolling(365)TARGET - TARGET.shift(28).rolling(365)
̓. Ϟσϧৄࡉlightgbm.Datasets( x_train, y_train, weight = myweight )ʢ̎ʣweightobjective : regressionධՁࢦඪͰ͋ΔWRMSSEΛೋͨ͠ͷͷޯΛܭࢉ͠Λlightgbm.DatasetsͷWEIGHTͱͯͨ͠͠ɻWEIGHT^2÷SCALED͋Β͔͡Ί42840ݸΛܭࢉ͓͖ͯ͠30490ΞΠςϜʹల։ͦ͠ͷ߹ܭͱͨ͠ɻ4284013049012Ϩϕϧ304901શϨϕϧʢ42840ݸʣͷʢWeight^2 ÷ ScaledʣΛܭࢉ͢Δɻ30490ΞΠςϜ×12Ϩϕϧʹม30490Ҏ֎ͷΞΠςϜΛ֤IDΧςΰϦຖʹׂΓৼΔɻϨϕϧํʹ߹ܭΛࢉग़͢Δɻ
̓. Ϟσϧৄࡉʢ̏ʣΠςϨʔγϣϯճֶश࣌ؒΛߟ͑ΕLearningRate→0.03Iter→ 500 ~ 700ͰΑ͔͕ͬͨstore_idຖ·ͨظؒʹΑͬͯऩଋͷλΠϛϯάͷζϨ͕͢͜͠େ͖͔ͬͨͷͰࠓճίϯϖͱ͍͏͜ͱ͋ΓɺLearningRate→0.01Iter→ 1200 & 1500(Blend)Λ࠾༻ͨ͠ɻ
̓. Ϟσϧৄࡉʢ̐ʣ day-by-day ϞσϧΓ1ͷϞσϧͷํ͕είΞ͕͔ͳΓྑ͘ͳ͍ͬͯΔɻಛʹ̍ʙ̏ͷӨڹ͕େ͖͘ɺਫ਼ΛٻΊΔͳΒ̎̔Ϟσϧॏཁͱײ͡Δɻ0.016
̓. Ϟσϧৄࡉ• ݕূظؒʢݕূظؒ̍ʣ2016-04-25 ~ 2016-05-22 : score 0.53(Public LB)ʢݕূظؒ̎ʣ2016-03-28 ~ 2016-04-24 : score 0.51ʢݕূظؒ̏ʣ2016-02-29 ~ 2016-03-27 : score 0.60ʢςετظؒʣ2016-05-23 ~ 2016-06-19 : score 0.576 (Private LB)ɹɹ=>ݕূظؒʹؔͯ͠ຖʹΞΠςϜ͕࣍ʑʹೖ͞Ε͍ͯΔͨΊɺ·ͨۙʹτϨϯυ͕มΘͬͯɹɹɹɹɹ͍ΔՄೳੑ͕͋Δ͜ͱ͔ΒͳΔ͘લΛͬͨɻ• ύϥϝʔλʔstore_idʹΑͬͯগ͠มߋɻ• ϝτϦοΫϊʔτϒοΫΛࢀߟʹ࡞ʢߦྻܭࢉΛ༻͍ͯ͠ΔͨΊܭࢉ͕͍ʣɹ (https://www.kaggle.com/girmdshinsei/for-japanese-beginner-with-wrmsse-in-lgbm)• ࠶ؼతΞϓϩʔνɺͷ༻ͳ͠• ޙॲཧͳ͠ʢ̑ʣ ͦͷଞ
̓. Ϟσϧৄࡉʢ̍ʣ̑̌ˋͷࢉग़̑̌ˋɺM5 - Accuracy ʹ͓͚Δ࠷ऴఏग़ͱ͢Δɻ·ͨɺߟ͑ํͱͯ͠Accuracyͷ༧ଌϞσϧʹؔͯ͠ݕূظؒͷWRMSSEɹ㲈ɹςετظؒͷWRMSSEͳΒݕূظؒͷޡࠩʢෆ࣮֬ੑʣɹ㲈ɹςετظؒͷޡࠩʢෆ࣮֬ੑʣAccuracyͷ༧ଌϞσϧ͕ҰൠԽ͞Ε͍ͯɺAccuracyͷϞσϧͦͷ··ෆ࣮֬ੑͱͯ͑͠Δɻ
̓. Ϟσϧৄࡉʢ̎ʣෆ࣮֬ྖҬͷࢉग़ํ๏ʢ̑̌ˋҎ֎ͷࢉग़ʣAccuracyͷ࠷ऴఏग़Λࢉग़ͨ͠ϞσϧΛ༻ͯ͠ݕূظؒʹ͓͚Δޡࠩʹʛ࣮ʔ༧ଌʛΛͱΓɺޡࠩΛঢॱʹฒΔࠓճݕূظؒΛ3ͭઃఆͨͨ͠Ί߹ܭ̎̔ˎ̏ʹ̔̐ݸͷޡ͕ࠩੜ͡Δɻex) diff = [0.5, 0.7, 1.4, 1.6, 1.7, 2.2, 2.6 ɾɾɾ 8.2, 8.5] ̔̐̐̎ ̔̍άϥϑԽ̑̒ ̔̐99.5% 0.5%ͷෆ࣮֬ੑABCD97.5% 2.5%ͷෆ࣮֬ੑ75.0% 25.0%ͷෆ࣮֬ੑ83.5% 16.5%ͷෆ࣮֬ੑ50.0% ʔ D ʹ 0.5%50.0% ʔ C ʹ 2.5%50.0% ʔ B ʹ 16.5%50.0% ʔ A ʹ 25.0%Accuracyͷ࠷ऴఏग़ ʹ 50.0%50.0% ʴ A ʹ 75.0%50.0% ʴ B ʹ 83.5%50.0% ʴ C ʹ 97.5%50.0% ʴ D ʹ 99.5%ঢॱԽͨ͠ޡࠩͷ͏ͪ̎̑ˋɺ̓̑ˋʹ͋ͨΔޡࠩʢ̐̎൪ͷޡࠩʣΛ̑̌ˋ͔Β૿ݮͤͨ͞ͷΛ̎̑ˋɺ̓̑ˋͱ͠ɺଞಉ༷ʹల։͢Δɻ※͜ͷޡ͕ࠩ͜ͷϞσϧʹ͓͚Δෆ࣮֬ੑͱͳΔ5SVF1SFE
̓. ϞσϧৄࡉࠓճݕূظؒΛ̎̔×̏Ͱߦͳ͕ͬͨຊདྷ֎Ε͕͋ͬͨ߹ͷճආߟ͑Δͱഒͷ̎̔×̒͋ͬͨํ͕Α͔ͬͨͱײ͡Δɻ͕͔͔ͨͩ࣌ؒΓ͗͢ΔͨΊɺaccuracyͷϞσϧΛΑΓ্ܰͨ͘͠Ͱਫ਼Λग़͢͜ͱ͕͍Ζ͍ΖͳҙຯͰͷվળͷ༨ͱͳΔɻʢࠓޙͷ՝ʣࠓճίϯϖͰͷݕূظؒͷ༧ଌʹearlystop=100, lr =0.08ͱ͠গ͠ߥͷઃఆͰߦ͍ͬͯΔɻʢaccuracyଆͷաֶशɺֶशෆϦεΫରࡦɻʣʢ̏ʣ༧ଌຖͷෆ࣮֬ੑ༧ଌʹԠͯ͡ෆ࣮֬ੑͷେ͖͞ҟͳΔɻࠓճAccuracyʹ͓͍ͯຖͷϞσϧ(̎̔Ϟσϧ)Λ࡞͓ͯ͠Γɺਫ਼̍ͷϞσϧͷํ͕̎̔ͷϞσϧΑΓྑ͘ͳΔɻͦͷͨΊෆ࣮֬ੑʹ͓͍ͯ̎̔ϞσϧͦΕͧΕʹ͓͚ΔޡࠩʢલϖʔδʣΛࢉग़͠ɺల։͢Δ͜ͱ͕·͍͠ɻ※දͷAɺBɺCɺDલϖʔδͷͦΕΒͱಉ͡ҙຯ߹͍ɻ
̔. লͱ՝ֶश࣌ؒɿͬͱݕূΛ͏·͘ΕɺείΞΛ΄ͱΜͲམͱͣ͞ʹֶश࣌ؒΛେ෯ʹ͘Ͱ͖ͨͱࢥ͏ɻɾಛྔΛݮΒͯ͠ɺstore_id୯ҐͷϞσϧΛͳ͘͢ɻɾLearningRateͱIterationճͷௐEtcValidationͷେࣄ͞ɿίϯϖং൫ɺPublicLBͷείΞʹؾΛऔΒΕ͗ͯ͢ɺޙ͔Βߟ͑ΕΔ͖Ͱͳ͍͜ͱʹ࣌ؒΛ͔͚ͯ͠·ͬͨɻ͜ͷίϯϖͰValidationͷେ͞Λ௧ײͰ͖ͨ͜ͱΑ͔ͬͨɻେͳσʔλͷॲཧɿಛʹং൫ϝϞϦͷ੍ݶͷதͲ͏Δ͔Ͱ͔ͳΓ࿑ྗΛͬͨɻػցֶशҎલʹࢄॲཧσʔλܕͳͲͬͱษڧ͠ͳ͚Ε͍͚ͳ͍͜ͱ͕ͨ͘͞Μ͋ΔɻධՁࢦඪͷཧղɿ·ͣॳΊʹධՁࢦඪͷཧղΛਂΊͳ͚Ε͍͚ͳ͍͜ͱΛ௧ײͨ͠ɻॳධՁࢦඪͷཧղ͕ᐆດͷ··ਐΜͰ͍ͨͨΊɺΔ͖Ͱͳ͍͜ͱΛଟ͍ͬͯͨ͘ɻධՁࢦඪʹΑͬͯ࡞Δ͖Ϟσϧ͕େ͖͘ҟͳΔ͜ͱ͕Θ͔ͬͨɻ