3 - Speaker Deck

Slide 1

Slide 1 text

Rossmann Store Kaggle Challenge my experience of 55th/3303 and What the tops did.

Slide 2

Slide 2 text

about Kaggle host: company research non-profit

Slide 3

Slide 3 text

Why Kaggle PROS and CONS

Slide 4

Slide 4 text

Not Practical. 沒有資源的限制，不管是 memory, cpu, time, maintain, robust, latency, …………… 壞處學習 Kaggle討論區很好玩，自己在公司裡遇不到這些人知道自己在哪里.（and let others know where you are ! ）好處

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Rossmann Store

Slide 7

Slide 7 text

Problem & Data

Slide 8

Slide 8 text

time series features of each store ( cross-sectional data) Data forecast 6 weeks of daily sales for 856 stores. Training data : 1,115 stores with near 3 year time series data. Problem

Slide 9

Slide 9 text

The data fields ● Time Series each Store ○ Date ○ Sales - The turnover for any given day (variable to be predicted) ○ Customers - The number of customers on a given day ○ Open - An indicator for whether the store was open: ○ Promotion i. Promo - Indicates whether a store is running a promo on that day ii. Promo2 - Continuing and consecutive promotion for some stores: iii. Promo2Since [Year/Week] - Describes the year and calendar week when the store started participating in Promo2 iv. PromoInterval - Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. ● Holiday Information ○ StateHoliday - Indicates a state holiday ○ SchoolHoliday - Indicates if the (Store, Date) was affected by the closure of public schools ● Store Information ○ StoreID - Unique Id for each store ○ Assortment - Describes an assortment level: ○ StoreType - Differentiates between 4 different store models ● Store Competitor information ○ CompetitionDistance - Distance in meters to the nearest competitor store ○ CompetitionOpenSince [Month/Year] - Approximate year and month of the time the nearest competitor was opened

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Feature and Validation

Slide 12

Slide 12 text

Features “ the model should have features on 1) past information 2) temporal information 3) trends.” -- winner Gert

Slide 13

Slide 13 text

The features past information past information ● data of ○ yesterday ○ last week ○ ... ● measures of centrality and measures of variability by ○ last quarter, ○ last half year, ○ last year ○ ... Recursive Prediction. It give me much worse result. but the 30th say it works,

Slide 14

Slide 14 text

Temporal information “The day counter indicates either the number of days before, after or within the event.” -- Gert 我有抽取的： ● 距離下個event還有多久，距離上個 event過了多久； ● 目前state經過多久，剩多久結束， ● state長度多久。 The features Temporal information

Slide 15

Slide 15 text

trends 我使用moving average & moving STD in various scales for past data. 可是少了一個重點-- curent trends： ● “I fit a store specific linear model on the day number - to extrapolate the trend into the six week period . “ -- winner Gert The features trends

Slide 16

Slide 16 text

鄰近商店把商店在 1.同一州裡， 2.且具有同年同月的競爭對手， 3.且競爭對手距離總和小於 x，標記為同一群。 pay day 7th: “As for other interesting tricks, one was "payday loans" feature engineering. We noticed that if the month {28,29,30} is Monday, OR {28} day is either Thursday of Friday - there were evident increase in Sales. So one could reason that people are taking short-term loans before their paydays to buy stuff 。” 整修觀察後猜測比賽給的資料並沒有標示出所有的優惠活動 clearance and grand opening sale The features 一些比較有創意的feature

Slide 17

Slide 17 text

Variance-stabilizing transformation log transform (討論板上數學魔人的理論證明是有假設的 , 在這份data假設不成立，我自己的實驗結果是log後反而變差) 討論版上另一個數學魔人證明要乘上神奇的0.985去近似真實的sales，許多人只要乘上這個數字名次都會往前跳幾百名。 The features

Slide 18

Slide 18 text

有些比賽允許外部資料，規定你要用的話需要公開在討論板上。在討論板上可以看到： ● German States derived from StateHoliday ● detailed Holiday ● Weather ● Google Trends ● 體育賽事 ● 物價指數 ● 股市 ● 失業率 ● and more…… 所以這個比賽資料無限，你的scraper大軍可以出動了。問題是商店是匿名的，很多資料沒辦法用，好在有人把商店的州名給還原，所以還是有很多外部資料可以連結 The features external data

Slide 19

Slide 19 text

1. CompetitionDistance 2. MeanSales 3. MaxSales 4. SalesByWeekDay_EMA21_lag49 5. Open_EMA14_lag-7 6. CustomersByWeekDay_Diff_lag49 7. MaxCustomers 8. Open_EMA14_lag7 9. CloudCover 10. CustomersByWeekDay_Diff_lag147 11. Assortment 12. maxS/C 13. StateSpan_Open 14. TimeToEnd_Open 15. SalesByWeekDay_Diff_lag49 16. Store 17. YearDay 18. CustomersByWeekDay_EMA21_lag49 19. MonthDay 20. TimeSincePromo2Open 21. CloudCover_lag-1 22. TimeSinceCompetitionOpen 23. MeanCustomers 24. CustomersByWeekDay_Diff_lag51 25. CustomersByWeekDay_Diff_lag50 26. TimeFromStart_Open 27. Promo The features 420 100 RFE

Slide 20

Slide 20 text

CrossValidation

Slide 21

Slide 21 text

CrossValidation train train train train valid valid valid valid time

Slide 22

Slide 22 text

change CV method open script private score open script public score RF GB Mean add many feautres ARIMA RF GBT

Slide 23

Slide 23 text

Data Feature Engineering Feature Selection Impute GB ARIMAs ARIMAs ARIMAs ARIMAs RF GB GB average ● a GB with lucky seed from script board. ● another 2 GBs with different seed and feature.

Slide 24

Slide 24 text

Ensemble

Slide 25

Slide 25 text

一些發現 ● average時，分數好的model不代表優於分數差的。 ○ 我的 RF 分數優於 luckyGB, 可是 ○ 加入RF 分數變差。 ○ 加入luckyGB 分數變好。 ● 重點還是feature，第一名說他最好的單一model就可以拿前三。 ● 前幾名的model其實可以很精簡, 第四名 use weighted average of 6 GBT and a special post processing, the best single model has only hand-picked 22 features and can take 5th rank ， c("WeekOfMonth","month","week","day","Store","Promo","DayOfWeek","year"," SchoolHoliday","CompDist0","CompOpenSince0","Promo2Since0","MeanLogSalesByStore"," MeanLogSalesByState","MeanLogSalesByStateHoliday","MeanLogSalesByAssortment"," MeanLogSalesByPromoInterval","MeanLogSalesByStorePromoDOW"," MeanLogCustByStorePromoDOW","MeanLogSalesBySchoolHoliday2Type"," Max_TemperatureC","SONNENSCHEINDAUER") Ensemble

Slide 26

Slide 26 text

Tree缺點之一:extrapolation

Slide 27

Slide 27 text

Ensemble 1th’s final ensemble consists the harmonic mean of: ●2 * all features (different seeds) and 2 * all features (using month ahead features) ●1 * sales model ●1 * customer model ●REPEAT all six models for months May to Septemper ●For September, all of the 2*6 models used month ahead features

Slide 28

Slide 28 text

linear model linear model linear model Features Sale GB Customer GB x1115 + + Multiplexer main GB x4 leave out recent feature Data prediction only same months of test all Sale GB Customer GB main GB x4 same as left

Slide 29

Slide 29 text

conclusion you can enter 10% if doing well on Feature Engineering. programming skill still matters.