Kaggle Days 2018

This year has seen the first edition of Kaggle Days. The event was a nice opportunity to break in for fresh kagglers – it consisted of two days with kaggle masters.

Presentation Day

The first day consisted of two parallel tracks – presentations and workshops.
One of first presentations given by Mikhail Trofimov overviewed on how to setup pipelines for competitions. Main takeaway is to split data by types (text, categorical, continuous and so on) and then aggregate the features so that they can be fed into a single model. The approach suggests to start small, use Gradient Boosting by default and only then start incorporating more arcane models (like neural networks).

The most interesting presentation was in my opinion the one that dealt with semantic image segmentation. After introducing the problem and outlining its usefulness to other tasks author presented commonly used architectures (they’re fully convolutional, for example UNet). Perhaps surprisingly, segmentation models can be trained using relatively small datasets – compared to classification each example contains more information (per-pixel predictions).

Other presentations worth mentioning were given by Kaggle staff – they outlined the goals of Kaggle. Their vision is to make Kaggle more of a data science platform, and not to only focus on competitions. As an example they mentioned their newly published tutorials, they can be found here.

Competition Day

Finally the competition. The task was to predict number of upvotes for reddit question answers. Top teams succeeded by reverse-engineering the structure of threads (this was not exactly given in data), so the competition was basically about extracting good features and using gradient boosting atop of that (no team from top 3 reported even using neural networks or using text features beyond bag of words). The competition was a good place to get into typical pipelines used in Kaggle competitions. The challenge’s authors should be given credit for setting up the data – there was some leakage problem, but the data had very similar distribution in both training and test subsets (hence there was no need for estabilishing complicated crossvalidation scheme).


All in all it seems that first edition of Kaggle days was a success. Sadly there were not enough places for workshops, but otherwise the organization was solid.

Leave a Comment

Your email address will not be published. Required fields are marked *