Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kaggle_meetup_3rd LT ( Sberbank Russian Housing Market )

Maxwell
October 28, 2017

Kaggle_meetup_3rd LT ( Sberbank Russian Housing Market )

A material for Kaggle meetup Tokyo 3rd.

This slide shows a part of my solution in the Sberbank housing market competition in Kaggle.

Kaggle competition page:
https://www.kaggle.com/c/sberbank-russian-housing-market

Updated: Jul.2022

Maxwell

October 28, 2017
Tweet

More Decks by Maxwell

Other Decks in Technology

Transcript

  1. Messy data competition Many NA’s (An ocean of NA's) 01

    Invalid data 02 Inaccurate Longitude and latitude . Macro-economic dependency 04 The price data is time-series data. We had to consider macro economic change with time. But data was limited… About 10% abnormal target values ( fake prices ) 03 In Russia, sometimes houses are sold at abnormal-lower price than natural one. Kaggler's voice He got 1st place. But being critical to this competition.
  2. It’s tough to predict fake price. Because there will not

    be any rule. About 10% of train data are Outliers... Histogram of house price 01 02 03 Exclude those or would harm model accuracy. There is no indicator for fakes. 04 How to exclude fake prices? Not simple outliers, but fake price! Price change with time mean price median price Histogram of logarithmic house price Logarithmic price change with time Price localizing
  3. Actual Prediction Evaluation metric is RMSLE. If fake prices are

    not included, it would be approximately 0.1XX based on my simulation. Test data would also include fake price, because scores on public LB are much higher than 0.1XX. Fake-free RMSLE: ~ 0.1XX Public LB RMSLE: ~ 0.3XX w/ fake price w/o fake price Prediction But w/o any cleansing, local CV was approximately 0.4XX, and LB was around 0.3XX. How about test data?
  4. Step 1. Fit and predict prices by XGBoost with initial

    train data. 1 2 3 4 Algorithm for excluding fake prices Step 2. Compare actual price and predicted price, then remove data where predicted price is largely deviated from actual. Throw away! Step 3. Again train the model with data after step2. Step 4. Predict price by XGBoost and go to step 2. If there would be no data to be removed, then stop this process.
  5. https://www.datarobot.com/jp/AI-experience-tokyo- 2017/?utm_source=database&utm_campaign=JPDREXP Invitation of pre-seminar Date: Oct.8.2017, 18:30 – 21:00

    Location: Tokyo Station Shin-Maru-building Join the private seminar of Datarobot! Legendary Kaggler And more...