Data visualisation competition winner on Kaggle.com

08.02.2013 - Coding

Kaggle-what?! Apparently I have reached a new level of geek-ness - at least no one in my circle of IT colleagues has heard of the Kaggle platform. Okay: Simply put, Kaggle hosts something like the Champions League of Data Mining. Companies such as Google, IBM, Nasa or Facebook publish current problems on the platform, which are then solved by thousands of "Kagglers" in highly remunerated competitions. The open Kaggle community now comprises around 100,000 scientists and private individuals from a wide range of disciplines, spread across the globe.

The competitions and problems are mostly predictive modelling tasks in which the search is for an optimal model (algorithm) that best characterises the data. From the model, forecasts can then be made for new, similar data, for example. Example: A model is needed for the probability of survival in the Titanic shipwreck. A part of the passenger data (name, age, gender, cabin class, ticket price...) including whether the person survived is available, as well as a second part of passenger data in which the information on the survival of the persons is missing. From the first part of the data, a model is to be created, which then predicts the unknown survival information in the second part. A simple model would be "If the gender of the person is female, the person survived, otherwise not", which already gives a correct prediction in 70% of the cases. Based on the uploaded prediction, Kaggle automatically calculates a score and the participant sees this in a ranking list on the Kaggle website.

I came across Kaggle in mid-2012 and found both the crowdsourcing idea behind the platform brilliant and the tasks extremely exciting. During my last Christmas holidays I finally had enough time to take part in my first Kaggle Competition. As it turned out, with success. The Leaping Leaderboard Leapfrogs competition was announced by Kaggle itself. For a change, it was less about data models and more about a new visualisation of the Kaggle competition leaderboards. Data basis were the entries of different, past competitions with team name, score and date. We were looking for a visualisation that was as appealing as possible and that would show the tough battles for the top places in the rankings.

I stuck to the mantra "keep it simple": Unlike most other participants who built entire web applications to track the absolute score of all teams over time, I simply focused on the relative ranking positions. The result is a fairly simple but illustrative Battles of the Best graphic that shows the battles of the leading teams over time. The Kaggle community liked it. After the submission deadline, I collected the most votes and am now the (proud) owner of an iPad ;-)

Technical Details