3

Rossmann Store Kaggle Challenge my experience of 55th/3303 and What
the tops did.

about Kaggle host: company research non-profit

Why Kaggle PROS and CONS

Not Practical. 沒有資源的限制，不管是 memory, cpu, time, maintain, robust, latency, ……………
壞處學習 Kaggle討論區很好玩，自己在公司裡遇不到這些人知道自己在哪里.（and let others know where you are ! ）好處

Rossmann Store

Problem & Data

time series features of each store ( cross-sectional data) Data
forecast 6 weeks of daily sales for 856 stores. Training data : 1,115 stores with near 3 year time series data. Problem

The data fields • Time Series each Store ◦ Date
◦ Sales - The turnover for any given day (variable to be predicted) ◦ Customers - The number of customers on a given day ◦ Open - An indicator for whether the store was open: ◦ Promotion i. Promo - Indicates whether a store is running a promo on that day ii. Promo2 - Continuing and consecutive promotion for some stores: iii. Promo2Since [Year/Week] - Describes the year and calendar week when the store started participating in Promo2 iv. PromoInterval - Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. • Holiday Information ◦ StateHoliday - Indicates a state holiday ◦ SchoolHoliday - Indicates if the (Store, Date) was affected by the closure of public schools • Store Information ◦ StoreID - Unique Id for each store ◦ Assortment - Describes an assortment level: ◦ StoreType - Differentiates between 4 different store models • Store Competitor information ◦ CompetitionDistance - Distance in meters to the nearest competitor store ◦ CompetitionOpenSince [Month/Year] - Approximate year and month of the time the nearest competitor was opened

Feature and Validation

Features “ the model should have features on 1) past
information 2) temporal information 3) trends.” -- winner Gert

The features past information past information • data of ◦
yesterday ◦ last week ◦ ... • measures of centrality and measures of variability by ◦ last quarter, ◦ last half year, ◦ last year ◦ ... Recursive Prediction. It give me much worse result. but the 30th say it works,

Temporal information “The day counter indicates either the number of
days before, after or within the event.” -- Gert 我有抽取的： • 距離下個event還有多久，距離上個 event過了多久； • 目前state經過多久，剩多久結束， • state長度多久。 The features Temporal information

trends 我使用moving average & moving STD in various scales for
past data. 可是少了一個重點-- curent trends： • “I fit a store specific linear model on the day number - to extrapolate the trend into the six week period . “ -- winner Gert The features trends

鄰近商店把商店在 1.同一州裡， 2.且具有同年同月的競爭對手， 3.且競爭對手距離總和小於 x，標記為同一群。 pay day 7th:
“As for other interesting tricks, one was "payday loans" feature engineering. We noticed that if the month {28,29,30} is Monday, OR {28} day is either Thursday of Friday - there were evident increase in Sales. So one could reason that people are taking short-term loans before their paydays to buy stuff 。” 整修觀察後猜測比賽給的資料並沒有標示出所有的優惠活動 clearance and grand opening sale The features 一些比較有創意的feature

Variance-stabilizing transformation log transform (討論板上數學魔人的理論證明是有假設的 , 在這份data假設不成立，我自己的實驗結果是log後反而變差) 討論版上另一個數學魔人證明要乘上神奇的0.985去近似真實的sales，許多人只要乘上這個數
字名次都會往前跳幾百名。 The features

有些比賽允許外部資料，規定你要用的話需要公開在討論板上。在討論板上可以看到： • German States derived from StateHoliday • detailed
Holiday • Weather • Google Trends • 體育賽事 • 物價指數 • 股市 • 失業率 • and more…… 所以這個比賽資料無限，你的scraper大軍可以出動了。問題是商店是匿名的，很多資料沒辦法用，好在有人把商店的州名給還原，所以還是有很多外部資料可以連結 The features external data

1. CompetitionDistance 2. MeanSales 3. MaxSales 4. SalesByWeekDay_EMA21_lag49 5. Open_EMA14_lag-7
6. CustomersByWeekDay_Diff_lag49 7. MaxCustomers 8. Open_EMA14_lag7 9. CloudCover 10. CustomersByWeekDay_Diff_lag147 11. Assortment 12. maxS/C 13. StateSpan_Open 14. TimeToEnd_Open 15. SalesByWeekDay_Diff_lag49 16. Store 17. YearDay 18. CustomersByWeekDay_EMA21_lag49 19. MonthDay 20. TimeSincePromo2Open 21. CloudCover_lag-1 22. TimeSinceCompetitionOpen 23. MeanCustomers 24. CustomersByWeekDay_Diff_lag51 25. CustomersByWeekDay_Diff_lag50 26. TimeFromStart_Open 27. Promo The features 420 100 RFE

CrossValidation

CrossValidation train train train train valid valid valid valid time

change CV method open script private score open script public
score RF GB Mean add many feautres ARIMA RF GBT

Data Feature Engineering Feature Selection Impute GB ARIMAs ARIMAs ARIMAs
ARIMAs RF GB GB average • a GB with lucky seed from script board. • another 2 GBs with different seed and feature.

Ensemble

一些發現 • average時，分數好的model不代表優於分數差的。 ◦ 我的 RF 分數優於 luckyGB, 可是 ◦
加入RF 分數變差。 ◦ 加入luckyGB 分數變好。 • 重點還是feature，第一名說他最好的單一model就可以拿前三。 • 前幾名的model其實可以很精簡, 第四名 use weighted average of 6 GBT and a special post processing, the best single model has only hand-picked 22 features and can take 5th rank ， c("WeekOfMonth","month","week","day","Store","Promo","DayOfWeek","year"," SchoolHoliday","CompDist0","CompOpenSince0","Promo2Since0","MeanLogSalesByStore"," MeanLogSalesByState","MeanLogSalesByStateHoliday","MeanLogSalesByAssortment"," MeanLogSalesByPromoInterval","MeanLogSalesByStorePromoDOW"," MeanLogCustByStorePromoDOW","MeanLogSalesBySchoolHoliday2Type"," Max_TemperatureC","SONNENSCHEINDAUER") Ensemble

Tree缺點之一:extrapolation

Ensemble 1th’s final ensemble consists the harmonic mean of: •2
* all features (different seeds) and 2 * all features (using month ahead features) •1 * sales model •1 * customer model •REPEAT all six models for months May to Septemper •For September, all of the 2*6 models used month ahead features

linear model linear model linear model Features Sale GB Customer
GB x1115 + + Multiplexer main GB x4 leave out recent feature Data prediction only same months of test all Sale GB Customer GB main GB x4 same as left

conclusion you can enter 10% if doing well on Feature
Engineering. programming skill still matters.

outliner , typo?

3

3

superChing

More Decks by superChing

Featured

Transcript

Rossmann Store Kaggle Challenge my experience of 55th/3303 and What

about Kaggle host: company research non-profit

Why Kaggle PROS and CONS

Not Practical. 沒有資源的限制，不管是 memory, cpu, time, maintain, robust, latency, ……………

Rossmann Store

Problem & Data

time series features of each store ( cross-sectional data) Data

The data fields • Time Series each Store ◦ Date

Feature and Validation

Features “ the model should have features on 1) past

The features past information past information • data of ◦

Temporal information “The day counter indicates either the number of

trends 我使用moving average & moving STD in various scales for

鄰近商店把商店在 1.同一州裡， 2.且具有同年同月的競爭對手， 3.且競爭對手距離總和小於 x，標記為同一群。 pay day 7th:

有些比賽允許外部資料，規定你要用的話需要公開在討論板上。在討論板上可以看到： • German States derived from StateHoliday • detailed

1. CompetitionDistance 2. MeanSales 3. MaxSales 4. SalesByWeekDay_EMA21_lag49 5. Open_EMA14_lag-7

CrossValidation

CrossValidation train train train train valid valid valid valid time

change CV method open script private score open script public

Data Feature Engineering Feature Selection Impute GB ARIMAs ARIMAs ARIMAs

Ensemble

一些發現 • average時，分數好的model不代表優於分數差的。 ◦ 我的 RF 分數優於 luckyGB, 可是 ◦

Tree缺點之一:extrapolation

Ensemble 1th’s final ensemble consists the harmonic mean of: •2

linear model linear model linear model Features Sale GB Customer

conclusion you can enter 10% if doing well on Feature

outliner , typo?