Martin O’Leary Last weekend, Kaggle and Data Science London ran a second hackathon, this time focused around the EMI One Million Interview Dataset, a large database of musical preferences. I took third place globally in this competition, and this is an attempt to explain how my model worked. The code is available on GitHub. For this competition I took a “blitz” approach. Rather than focusing on one model, trying to tweak all the performance I could out of it, I threw together a bunch of simple models, and blended their results. In the end, I combined ten separate predictions for my final submission. Because I knew that I was going to be blending the results, for each model I retained a set of cross-‐validation predictions for the training set. These were used as input to the blending process, as well as to give me an idea of how well each model was performing, without having to use up my submission quota. In general, I used ten-‐ fold cross-‐validation, but for models which used random forests, I simply used the out-‐of-‐bag predictions for each data point. Preprocessing As given, the data consists of a table of (user, artist, track, rating) quadruples, along with tables of data about users (demographic information, etc), and about user-‐ artist pairs (descriptive words about each artist, whether the user owned any of their music, etc). These secondary tables were quite messy, with lots of missing data and a lack of standard values. I generally don’t enjoy data cleaning, so I did one quick pass through this data to tidy it up a little, then used it as-‐is for all the models. I merged the two “Good lyrics” columns, which differed only in capitalisation. For the “OWN_ARTIST_MUSIC” column, I collapsed the multiple encodings for “Don’t know”. Similarly, I collapsed several of the levels of the “HEARD_OF” column. The responses for “LIST_OWN” and “LIST_BACK” needed to be converted to numbers, rather than the mish-‐mash of numeric and text values which were there to begin with. To fill in missing values, I used the median value for numeric columns, and the most common value for categorical columns. I then joined these tables with the training data. In most cases, the results were aided by first removing “global effects”. I subtracted the overall mean rating, and then estimated effects for users and tracks, with Bayesian priors which reduced these effects towards zero for poorly sampled users and tracks. These effects were then added back in after the model prediction had been made. Chuck it all in a Random Forest/GBM/Linear Regression The first thing I tried was an attempt to mimic Ben Hamner’s success in the last hackathon, by throwing everything into a random forest and hoping for the best. It turned out that while this was a pretty good approach, it was also extremely slow. I was only able to run a limited number of trees, with a reduced sample size, so the results probably weren’t as good as they could have been. I also originally ran this without removing global effects, and didn’t have time to go back and do it again.