Kaggle_meetup_3rd LT ( Sberbank Russian Housing Market )

Slide 1

Slide 1 text

Kaggle Tokyo Meetup #3 Lightning Talk Oct.28.2017 Maxwell_110

Slide 2

Slide 2 text

Predict House Price in Russian Housing Market Sberbank is a Russian bank.

Slide 3

Slide 3 text

Messy data competition Many NA’s (An ocean of NA's) 01 Invalid data 02 Inaccurate Longitude and latitude . Macro-economic dependency 04 The price data is time-series data. We had to consider macro economic change with time. But data was limited… About 10% abnormal target values ( fake prices ) 03 In Russia, sometimes houses are sold at abnormal-lower price than natural one. Kaggler's voice He got 1st place. But being critical to this competition.

Slide 4

Slide 4 text

It’s tough to predict fake price. Because there will not be any rule. About 10% of train data are Outliers... Histogram of house price 01 02 03 Exclude those or would harm model accuracy. There is no indicator for fakes. 04 How to exclude fake prices? Not simple outliers, but fake price! Price change with time mean price median price Histogram of logarithmic house price Logarithmic price change with time Price localizing

Slide 5

Slide 5 text

Actual Prediction Evaluation metric is RMSLE. If fake prices are not included, it would be approximately 0.1XX based on my simulation. Test data would also include fake price, because scores on public LB are much higher than 0.1XX. Fake-free RMSLE: ~ 0.1XX Public LB RMSLE: ~ 0.3XX w/ fake price w/o fake price Prediction But w/o any cleansing, local CV was approximately 0.4XX, and LB was around 0.3XX. How about test data?

Slide 6

Slide 6 text

Step 1. Fit and predict prices by XGBoost with initial train data. 1 2 3 4 Algorithm for excluding fake prices Step 2. Compare actual price and predicted price, then remove data where predicted price is largely deviated from actual. Throw away! Step 3. Again train the model with data after step2. Step 4. Predict price by XGBoost and go to step 2. If there would be no data to be removed, then stop this process.

Slide 7

Slide 7 text

https://www.datarobot.com/jp/AI-experience-tokyo- 2017/?utm_source=database&utm_campaign=JPDREXP Invitation of pre-seminar Date: Oct.8.2017, 18:30 – 21:00 Location: Tokyo Station Shin-Maru-building Join the private seminar of Datarobot! Legendary Kaggler And more...

Slide 8

Slide 8 text

Happy Kaggling!