Invalid data 02 Inaccurate Longitude and latitude . Macro-economic dependency 04 The price data is time-series data. We had to consider macro economic change with time. But data was limited… About 10% abnormal target values ( fake prices ) 03 In Russia, sometimes houses are sold at abnormal-lower price than natural one. Kaggler's voice He got 1st place. But being critical to this competition.
be any rule. About 10% of train data are Outliers... Histogram of house price 01 02 03 Exclude those or would harm model accuracy. There is no indicator for fakes. 04 How to exclude fake prices? Not simple outliers, but fake price! Price change with time mean price median price Histogram of logarithmic house price Logarithmic price change with time Price localizing
not included, it would be approximately 0.1XX based on my simulation. Test data would also include fake price, because scores on public LB are much higher than 0.1XX. Fake-free RMSLE: ~ 0.1XX Public LB RMSLE: ~ 0.3XX w/ fake price w/o fake price Prediction But w/o any cleansing, local CV was approximately 0.4XX, and LB was around 0.3XX. How about test data?
train data. 1 2 3 4 Algorithm for excluding fake prices Step 2. Compare actual price and predicted price, then remove data where predicted price is largely deviated from actual. Throw away! Step 3. Again train the model with data after step2. Step 4. Predict price by XGBoost and go to step 2. If there would be no data to be removed, then stop this process.