Upgrade to Pro — share decks privately, control downloads, hide ads and more …


January 18, 2016



January 18, 2016


  1. Not Practical. 沒有資源的限制,不管是 memory, cpu, time, maintain, robust, latency, ……………

    壞處 學習 Kaggle討論區很好玩,自己在公司裡遇不到這些人 知道自己在哪里.(and let others know where you are ! ) 好處
  2. time series features of each store ( cross-sectional data) Data

    forecast 6 weeks of daily sales for 856 stores. Training data : 1,115 stores with near 3 year time series data. Problem
  3. The data fields • Time Series each Store ◦ Date

    ◦ Sales - The turnover for any given day (variable to be predicted) ◦ Customers - The number of customers on a given day ◦ Open - An indicator for whether the store was open: ◦ Promotion i. Promo - Indicates whether a store is running a promo on that day ii. Promo2 - Continuing and consecutive promotion for some stores: iii. Promo2Since [Year/Week] - Describes the year and calendar week when the store started participating in Promo2 iv. PromoInterval - Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. • Holiday Information ◦ StateHoliday - Indicates a state holiday ◦ SchoolHoliday - Indicates if the (Store, Date) was affected by the closure of public schools • Store Information ◦ StoreID - Unique Id for each store ◦ Assortment - Describes an assortment level: ◦ StoreType - Differentiates between 4 different store models • Store Competitor information ◦ CompetitionDistance - Distance in meters to the nearest competitor store ◦ CompetitionOpenSince [Month/Year] - Approximate year and month of the time the nearest competitor was opened
  4. Features “ the model should have features on 1) past

    information 2) temporal information 3) trends.” -- winner Gert
  5. The features past information past information • data of ◦

    yesterday ◦ last week ◦ ... • measures of centrality and measures of variability by ◦ last quarter, ◦ last half year, ◦ last year ◦ ... Recursive Prediction. It give me much worse result. but the 30th say it works,
  6. Temporal information “The day counter indicates either the number of

    days before, after or within the event.” -- Gert 我有抽取的: • 距離下個event還有多久,距離上個 event過了多久; • 目前state經過多久,剩多久結束, • state長度多久。 The features Temporal information
  7. trends 我使用moving average & moving STD in various scales for

    past data. 可是少了一個重點-- curent trends: • “I fit a store specific linear model on the day number - to extrapolate the trend into the six week period . “ -- winner Gert The features trends
  8. 鄰近商店 把商店在 1.同一州裡, 2.且具有同年同月的競爭對手, 3.且競爭對手距離總和小於 x, 標記為同一群。 pay day 7th:

    “As for other interesting tricks, one was "payday loans" feature engineering. We noticed that if the month {28,29,30} is Monday, OR {28} day is either Thursday of Friday - there were evident increase in Sales. So one could reason that people are taking short-term loans before their paydays to buy stuff 。” 整修 觀察後猜測比賽給的資料並沒有標示出所有的 優惠活動 clearance and grand opening sale The features 一些比較有創意的feature
  9. 有些比賽允許外部資料,規定你要用的話需要公開在討論板上。 在討論板上可以看到: • German States derived from StateHoliday • detailed

    Holiday • Weather • Google Trends • 體育賽事 • 物價指數 • 股市 • 失業率 • and more…… 所以這個比賽資料無限,你的scraper大軍可以出動了。 問題是商店是匿名的,很多資料沒辦法用, 好在有人把商店的州名給還原,所以還是有很多外部資料可以連結 The features external data
  10. 1. CompetitionDistance 2. MeanSales 3. MaxSales 4. SalesByWeekDay_EMA21_lag49 5. Open_EMA14_lag-7

    6. CustomersByWeekDay_Diff_lag49 7. MaxCustomers 8. Open_EMA14_lag7 9. CloudCover 10. CustomersByWeekDay_Diff_lag147 11. Assortment 12. maxS/C 13. StateSpan_Open 14. TimeToEnd_Open 15. SalesByWeekDay_Diff_lag49 16. Store 17. YearDay 18. CustomersByWeekDay_EMA21_lag49 19. MonthDay 20. TimeSincePromo2Open 21. CloudCover_lag-1 22. TimeSinceCompetitionOpen 23. MeanCustomers 24. CustomersByWeekDay_Diff_lag51 25. CustomersByWeekDay_Diff_lag50 26. TimeFromStart_Open 27. Promo The features 420 100 RFE
  11. change CV method open script private score open script public

    score RF GB Mean add many feautres ARIMA RF GBT
  12. Data Feature Engineering Feature Selection Impute GB ARIMAs ARIMAs ARIMAs

    ARIMAs RF GB GB average • a GB with lucky seed from script board. • another 2 GBs with different seed and feature.
  13. 一些發現 • average時,分數好的model不代表優於分數差的。 ◦ 我的 RF 分數優於 luckyGB, 可是 ◦

    加入RF 分數變差。 ◦ 加入luckyGB 分數變好。 • 重點還是feature,第一名說他最好的單一model就可以拿前三。 • 前幾名的model其實可以很精簡, 第四名 use weighted average of 6 GBT and a special post processing, the best single model has only hand-picked 22 features and can take 5th rank , c("WeekOfMonth","month","week","day","Store","Promo","DayOfWeek","year"," SchoolHoliday","CompDist0","CompOpenSince0","Promo2Since0","MeanLogSalesByStore"," MeanLogSalesByState","MeanLogSalesByStateHoliday","MeanLogSalesByAssortment"," MeanLogSalesByPromoInterval","MeanLogSalesByStorePromoDOW"," MeanLogCustByStorePromoDOW","MeanLogSalesBySchoolHoliday2Type"," Max_TemperatureC","SONNENSCHEINDAUER") Ensemble
  14. Ensemble 1th’s final ensemble consists the harmonic mean of: •2

    * all features (different seeds) and 2 * all features (using month ahead features) •1 * sales model •1 * customer model •REPEAT all six models for months May to Septemper •For September, all of the 2*6 models used month ahead features
  15. linear model linear model linear model Features Sale GB Customer

    GB x1115 + + Multiplexer main GB x4 leave out recent feature Data prediction only same months of test all Sale GB Customer GB main GB x4 same as left
  16. conclusion you can enter 10% if doing well on Feature

    Engineering. programming skill still matters.