$30 off During Our Annual Pro Sale. View Details »

3

superChing
January 18, 2016
49

 3

superChing

January 18, 2016
Tweet

Transcript

  1. Rossmann Store
    Kaggle Challenge
    my experience of 55th/3303 and What the tops did.

    View Slide

  2. about Kaggle
    host:
    company
    research
    non-profit

    View Slide

  3. Why Kaggle
    PROS and CONS

    View Slide

  4. Not Practical.
    沒有資源的限制,不管是 memory, cpu, time, maintain, robust, latency, ……………
    壞處
    學習
    Kaggle討論區很好玩,自己在公司裡遇不到這些人
    知道自己在哪里.(and let others know where you are ! )
    好處

    View Slide

  5. View Slide

  6. Rossmann Store

    View Slide

  7. Problem & Data

    View Slide

  8. time series
    features of each store ( cross-sectional data)
    Data
    forecast 6 weeks of daily sales for 856 stores.
    Training data : 1,115 stores with near 3 year time series data.
    Problem

    View Slide

  9. The data fields
    ● Time Series each Store
    ○ Date
    ○ Sales - The turnover for any given day (variable to be
    predicted)
    ○ Customers - The number of customers on a given day
    ○ Open - An indicator for whether the store was open:
    ○ Promotion
    i. Promo - Indicates whether a store is running a promo
    on that day
    ii. Promo2 - Continuing and consecutive promotion for
    some stores:
    iii. Promo2Since [Year/Week] - Describes the year and
    calendar week when the store started participating in
    Promo2
    iv. PromoInterval - Describes the consecutive intervals
    Promo2 is started, naming the months the promotion is
    started anew.
    ● Holiday Information
    ○ StateHoliday - Indicates a state holiday
    ○ SchoolHoliday - Indicates if the (Store, Date) was affected by
    the closure of public schools
    ● Store Information
    ○ StoreID - Unique Id for each store
    ○ Assortment - Describes an assortment level:
    ○ StoreType - Differentiates between 4 different store
    models
    ● Store Competitor information
    ○ CompetitionDistance - Distance in meters to the
    nearest competitor store
    ○ CompetitionOpenSince [Month/Year] -
    Approximate year and month of the time the nearest
    competitor was opened

    View Slide

  10. View Slide

  11. Feature and Validation

    View Slide

  12. Features
    “ the model should have features on
    1) past information
    2) temporal information
    3) trends.”
    -- winner Gert

    View Slide

  13. The features past information
    past information
    ● data of
    ○ yesterday
    ○ last week
    ○ ...
    ● measures of centrality and measures of variability by
    ○ last quarter,
    ○ last half year,
    ○ last year
    ○ ...
    Recursive Prediction.
    It give me much worse result. but the 30th say it works,

    View Slide

  14. Temporal information
    “The day counter indicates either the number of days before, after or within the event.” -- Gert
    我有抽取的:
    ● 距離下個event還有多久,距離上個 event過了多久;
    ● 目前state經過多久,剩多久結束,
    ● state長度多久。
    The features Temporal information

    View Slide

  15. trends
    我使用moving average & moving STD in various scales for past data.
    可是少了一個重點-- curent trends:
    ● “I fit a store specific linear model on the day number - to extrapolate the trend into the six week
    period . “ -- winner Gert
    The features trends

    View Slide

  16. 鄰近商店
    把商店在
    1.同一州裡,
    2.且具有同年同月的競爭對手,
    3.且競爭對手距離總和小於 x,
    標記為同一群。
    pay day
    7th: “As for other interesting tricks, one was "payday loans" feature engineering. We noticed that if the
    month {28,29,30} is Monday, OR {28} day is either Thursday of Friday - there were evident increase in
    Sales. So one could reason that people are taking short-term loans before their paydays to buy stuff 。”
    整修
    觀察後猜測比賽給的資料並沒有標示出所有的
    優惠活動
    clearance and grand opening sale
    The features 一些比較有創意的feature

    View Slide

  17. Variance-stabilizing transformation
    log transform (討論板上數學魔人的理論證明是有假設的 , 在這份data假設不成立,我自己的
    實驗結果是log後反而變差)
    討論版上另一個數學魔人證明要乘上 神奇的0.985去近似真實的sales,許多人只要乘上這個數
    字名次都會往前跳幾百名。
    The features

    View Slide

  18. 有些比賽允許外部資料,規定你要用的話需要公開在討論板上。
    在討論板上可以看到:
    ● German States derived from StateHoliday
    ● detailed Holiday
    ● Weather
    ● Google Trends
    ● 體育賽事
    ● 物價指數
    ● 股市
    ● 失業率
    ● and more……
    所以這個比賽資料無限,你的scraper大軍可以出動了。
    問題是商店是匿名的,很多資料沒辦法用,
    好在有人把商店的州名給還原,所以還是有很多外部資料可以連結
    The features external data

    View Slide

  19. 1. CompetitionDistance
    2. MeanSales
    3. MaxSales
    4. SalesByWeekDay_EMA21_lag49
    5. Open_EMA14_lag-7
    6. CustomersByWeekDay_Diff_lag49
    7. MaxCustomers
    8. Open_EMA14_lag7
    9. CloudCover
    10. CustomersByWeekDay_Diff_lag147
    11. Assortment
    12. maxS/C
    13. StateSpan_Open
    14. TimeToEnd_Open
    15. SalesByWeekDay_Diff_lag49
    16. Store
    17. YearDay
    18. CustomersByWeekDay_EMA21_lag49
    19. MonthDay
    20. TimeSincePromo2Open
    21. CloudCover_lag-1
    22. TimeSinceCompetitionOpen
    23. MeanCustomers
    24. CustomersByWeekDay_Diff_lag51
    25. CustomersByWeekDay_Diff_lag50
    26. TimeFromStart_Open
    27. Promo
    The features
    420 100
    RFE

    View Slide

  20. CrossValidation

    View Slide

  21. CrossValidation
    train
    train
    train
    train
    valid
    valid
    valid
    valid
    time

    View Slide

  22. change CV method
    open script
    private score
    open script
    public score
    RF GB
    Mean
    add many
    feautres
    ARIMA
    RF
    GBT

    View Slide

  23. Data
    Feature
    Engineering
    Feature
    Selection
    Impute
    GB
    ARIMAs
    ARIMAs
    ARIMAs
    ARIMAs RF
    GB
    GB
    average ● a GB with lucky seed from script board.
    ● another 2 GBs with different seed and
    feature.

    View Slide

  24. Ensemble

    View Slide

  25. 一些發現
    ● average時,分數好的model不代表優於分數差的。
    ○ 我的 RF 分數優於 luckyGB, 可是
    ○ 加入RF 分數變差。
    ○ 加入luckyGB 分數變好。
    ● 重點還是feature,第一名說他最好的單一model就可以拿前三。
    ● 前幾名的model其實可以很精簡, 第四名 use weighted average of 6 GBT and a special
    post processing, the best single model has only hand-picked 22 features and can take 5th
    rank ,
    c("WeekOfMonth","month","week","day","Store","Promo","DayOfWeek","year","
    SchoolHoliday","CompDist0","CompOpenSince0","Promo2Since0","MeanLogSalesByStore","
    MeanLogSalesByState","MeanLogSalesByStateHoliday","MeanLogSalesByAssortment","
    MeanLogSalesByPromoInterval","MeanLogSalesByStorePromoDOW","
    MeanLogCustByStorePromoDOW","MeanLogSalesBySchoolHoliday2Type","
    Max_TemperatureC","SONNENSCHEINDAUER")
    Ensemble

    View Slide

  26. Tree缺點之一:extrapolation

    View Slide

  27. Ensemble
    1th’s final ensemble consists the harmonic mean of:
    ●2 * all features (different seeds) and 2 * all features (using month ahead features)
    ●1 * sales model
    ●1 * customer model
    ●REPEAT all six models for months May to Septemper
    ●For September, all of the 2*6 models used month ahead features

    View Slide

  28. linear model
    linear model
    linear model
    Features
    Sale GB
    Customer
    GB
    x1115
    + +
    Multiplexer
    main GB
    x4
    leave out recent feature
    Data
    prediction
    only same months of test
    all
    Sale GB
    Customer
    GB
    main GB
    x4 same as left

    View Slide

  29. conclusion
    you can enter 10% if doing well on Feature Engineering.
    programming skill still matters.

    View Slide

  30. outliner , typo?

    View Slide

  31. View Slide