Python package for machine-learning aided data labelling



In a lot of cases unlabelled data needs to be transformed to labelled data. The best solution is to use (multiple) human classifiers. However, going to all the data by hand (i.e. in text-mining or image-processing) is often a daunting task. Is there software that can combine human classifiers and machine-learning techniques in real time? I am especially interested in python packages.

To illustrate, classifying images from video streams is very repetitive. After 100 images (from different streams) a machine-learning algorithm could be used to predict the labels given by the human classifier. The machine classifier might be very confident about some (un)seen samples and very uncertain about others. The human classifier can then focus on the uncertain samples helping the machine classifier to learn better what is does not yet know.


Posted 2017-04-18T16:28:16.577

Reputation: 941

Look for sloth – enterML – 2017-04-18T19:28:05.047 – D.W. – 2017-04-18T22:24:54.223

Am I correctly that with sloth the computer is helping the human in labelling the images and not the other way around. I am looking for tools where humans and machines predict the same objective and they aid each other. – Pieter – 2017-04-19T07:06:25.213



It sounds like you are looking for active learning. In active learning, the classifier learns which samples would be most useful to have labelled by a human.

There are many techniques for active learning, and many ways to adapt an existing (standard) learning algorithm to the active learning setting. The particular approach you mentioned is called "uncertainty sampling", and can be applied to any standard classifier that outputs confidence/certainty scores. There are other selection methods as well, which may perform better in some settings.

You can also apply unsupervised methods to cluster the samples, then label one or a few samples from each cluster.


Posted 2017-04-18T16:28:16.577

Reputation: 2 721

Thanks for your answer! Do you know any implementations which can handle arbitrary machine-learning pipelines (that output confidence)? – Pieter – 2017-04-18T22:30:16.210

The query strategies on wikipedia form a really nice list of the possibilities! – Pieter – 2017-04-19T07:08:15.923