Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kaggle_meetup_3rd LT ( Sberbank Russian Housing Market )

F6c0cb53d72908942998923f1a05c71b?s=47 Maxwell
October 28, 2017

Kaggle_meetup_3rd LT ( Sberbank Russian Housing Market )

Tips for cleansing messy data

This slide shows the methodology I have used in following financial competition in kaggle.



October 28, 2017

More Decks by Maxwell

Other Decks in Technology


  1. Kaggle Tokyo Meetup #3 Lightning Talk 28 Oct 2017 HoxoMaxwell

  2. Prediction of house price in Russian market. - Regression task

  3. Messy data competition Many NA’s 01 Invalid data 02 Inaccurate

    Longitude and latitude . Macro economic dependency 04 The price data is time-series data. We had to consider macro economic change with time. But data was limited… About 10% abnormal target values ( fake prices ) 03 In Russia, sometimes houses are sold at lower prices than it would be. Participants voice
  4. It’s tough to predict fake prices. About 10% of all

    data are Outliers (fake prices) Histogram of prices and price change with time 01 02 03 Exclude those or would harm model accuracy. There is no indicator for fakes. 04 How to exclude fake prices ?
  5. actual predict Evaluation metric is RMSLE. If fake prices are

    not included, it would be around 0.1 . But w/o any cleansing, local CV was around 0.4, and LB was around 0.3. Test data would include – fake prices. ideal RMSLE : around 0.1 LB RMSLE : around 0.3 w/ fake prices w/o fake prices predict
  6. 1. Fit and predict prices by XGBoost with initial data.

    1 2 3 4 Methodology for excluding fake prices 2. Compare actual price and predicted price, then exclude data whose predicted prices are largely deviated from actual. throw away 3. Again train the model with data after step2. 4. Predict prices by XGBoost and go to step 2. If there would be no data to be excluded, then stop this process.
  7. https://www.datarobot.com/jp/AI-experience-tokyo-2017/?utm_source=database&utm_campaign=JPDREXP Invitation of pre-seminar Date and time : 10/8 18:30

    – 21:00 Location : Tokyo station Shin-Maru-building Join the private seminar of Datarobot !
  8. Thank you !