Today, we share our experience from a Kaggle competition where we needed to detect ships on satellite images. We faced numerous challenges along the way and worked out effective strategies on how to solve them.

Make sure to understand the problem

When starting out a new project, it is always important not to jump into building models before the objective is understood by everyone on the team. In our case, the problem was much more complicated than it initially seemed: while “detecting ships” sounds like an easy object detection task, the target bounding boxes were in fact oriented and therefore the scoring metric required extreme precision.

Another challenge was working with an extremely tough dataset. First of all, most of satellite images contain nothing but ocean or clouds. Even when there are some ships on the image, they occupy only small portion of it (satellite images are usually quite large). To make things even worse, ships often appear close to one another (especially in ports), which makes separating individual ships significantly harder.

After analyzing possible approaches to this problem, we decided to go with the image segmentation approach. The main reason behind that choice was the fact that using standard object detection methods, like YOLO, produce square bounding boxes around object instances and considering evaluation metric for this problem – they were just not accurate enough. In order to use this approach effectively, we would have to use an architecture like Mask-RCNN to detect precise borders, which requires much heavier preprocessing to create target masks for  the training.

Bounding box can be significantly less accurate than a segmentation mask

We decided to focus our efforts on TernausNet – a state-of-the-art image segmentation network which is an improvement of the classical U-Net architecture. This architecture proved to be very good at segmentation during multiple Kaggle competitions (1,2). It also allows to change a model size for performing experiments and testing a training pipeline before it runs with a full-sized model on an expensive machine.

Design the pipeline around the model

Any model working with images requires a lot of time and computing power to be trained. The only reasonable solution for such problems is transfer learning. In this case, we used pre-trained VGG16 encoder with frozen weights trained on ImageNet, and just tuned the decoder. This meant that the images and masks were needed to be processed to a format recognizable by the model.

Due to the fact that the image segmentation models do not recognize object instances on their own, it was necessary to merge masks for all the ships in an image into one mask, and split model output into separate masks if necessary. Loss function and training target (masks can have more than 1 channel) affected the models’ ability to split individual ships. Therefore, we needed to have more than 1 dataset ready to train the model. Working with unbalanced classes (proportion of ships’ area to non-ship area was below 1%), it was necessary to sample the subset from the previously chosen training set – otherwise the model would learn to predict no ships at all.

All factors described above add many more steps into our training pipeline than in the usual machine learning project. In order to perform as many experiments as possible, we had to think how we want to organize our work and minimize the iteration time (which is the amount of time required to judge whether a change to the model works or not).

Our main challenge was to shorten the process and make as many experiments as possible

Speed things up by distribution

We knew that re-sampling training data based on different criteria and changing model architecture or learning parameters (optimizer, loss function, etc.) would be necessary. Despite doing our best to organize the project, the training process itself was incredibly long due to the large size of images. Performing multiple experiments concurrently would be a waste of resources since in this case we just wanted to tune the model to a specific dataset. Instead, to speed the iteration time up, we decided to distribute as many parts of the pipeline as possible.

The biggest bottleneck was of course training the model, which for some of our setups took around 4-5 hours per training epoch (mainly due to the large size of the images and VGG16 encoder). To distribute this, we used Horovod library and moved to a server with 8 GPUs (p2.8xlarge AWS instance). One epoch of distributed learning was not much longer than the standard epoch, but it took into account gradients from all 8 GPUs. As a result, we could train the model nearly 8 times faster.

Distributing steps such as mask generation, model predictions and mask splitting was easy – these tasks did not require any synchronization. For the GPU-intensive steps we distributed computation between 8 GPUs and for other tasks a cheaper instance with 36 CPU cores was used to decrease the time by a factor of 10-15 (disk I/O turned out to be the bottleneck here). Thanks to all these, the full training experiment took us days to complete, not weeks, and in most cases we knew if the experiment is worth continuing in just 2-3 hours.

Results

After a month of experiments, we ended up with a reasonably good model and post-processing strategy that could accurately identify ships and split them into separate instances. We successfully trained it to pick up all ships, but needed more training time the model to limit the number of false positives. Unfortunately, the Kaggle competition had a data leak from the training set into the test set and we put the project on hold. When the contest resumes, we will surely continue training, as the model is yet to reach its full potential.

Our models’ prediction on a sample image from an open dataset

Our Data Science team is always on the lookout for new challenges, reach out and we will see how we could help you..

 

Leave a Reply