How to decide which images to label next?

1

We have a custom dataset of 20 thousand images with two pixel-wise labeled classes. However we have 1 million more raw images, which we would like to label.

We want to label the most important new images first. Importance is defined as:

  • Images with the more new information
  • Images helping our deep learning bounding box classifier to improve

So instead of labeling images, where we already have about thousand similar ones, we first want to label the images which are quite different from the already-labeled ones and help more to improve our classifier.

How can we assign priorities and decide which images to label first?

Laurenz

Posted 2017-10-04T14:36:46.880

Reputation: 113

Answers

2

This type of problem is considered to be part of 'active learning'. There is a lot of research being done on this topic at the moment, but some first approaches are relatively easy, depending on the type of model that you are using. Since you mentioned that you are using deep learning bounding box detectors, I will showcase a few examples of how to approach this problem using Convolutional neural networks.

The core idea is that we want some measure of potential gain of an unlabeled sample. That way we can train our model on our labeled training set, predict the labels for our unlabeled set and measure which examples will be most useful to label.

In case of classification you could use the sigmoid/softmax output and get some kind of uncertainty from there, however deep learning models are usually fairly certain about their predictions and a high probability doesn't automatically mean that it predicts it well.

Another approach is to use dropout in your model during training, and then apply dropout to your predictions on your unlabeled set as well. By sampling multiple dropout masks and comparing all the different predictions, you could measure how different the outputs are. If the outputs are very similar, it's unlikely that your model will learn much more if you label this, but if the outputs vary wildly, maybe this example lives in a part of your feature space that your model doesn't know or understand very well yet.

There are a lot of ways to approach this, what I have written here is just an introduction to the concept of 'active learning'. There are a lot of papers available about this topic! EDIT: I haven't actually read a lot of this research, but here are a few:

https://arxiv.org/pdf/1703.02910.pdf

https://arxiv.org/pdf/1707.05928.pdf

https://arxiv.org/pdf/1701.03551.pdf

Jan van der Vegt

Posted 2017-10-04T14:36:46.880

Reputation: 8 538

Awesome! Could you kindly link one or two good papers? :) – Laurenz – 2017-10-04T14:48:41.143

I don't know about the quality of these (only read the one about named entity recognition) but added some research papers, and Gal is a name that knows a lot about uncertainty in deep learning so I would guess that that is a useful one – Jan van der Vegt – 2017-10-04T15:10:53.497