Is there a clustering algorithm which accepts some clusters as input and outputs some more clusters?


Heres the task: I have data I don't know much about. The final task is to build a classifier to classify the samples into a few categories. Some of the categories are pretty clear, we can easily use these as labels for a classifier. But I guess there are more useful categories possible, because right now most of my samples don't belong to any category. As I am no expert in the specific field, I would like to use some clustering algorithm to show possible label ideas. When using traditional clustering algorithms, they find all sorts of patterns in the data I am not interested in.

So I am looking for a way to tell the algorithm: "Hey, find some clusters in my data, but please take the existing clusters (or labeled data) into account." This should tell the clustering algorithm what I am interested in, and in what not.

Does something like this exists? Or any other idea how to solve the problem of finding additional labels?

BTW: in my case, I am doing NLP.


Posted 2020-10-30T15:15:05.467

Reputation: 91



You are describing semi-supervised learning where the training dataset is only partially labeled.

One common set of techniques to solve that problem is active learning. In active learning, there is a learning loop where the algorithm makes predictions and a human corrects those predictions.

Pre-clustering is a specific active learning technique to address the problem you describe. The goal is to repetitively select the most representative training examples to add new labels and as well as avoiding repeatedly labeling samples in same cluster. "Active Learning Using Pre-clustering" by Nguyen and Smeulders goes into greater detail.

Brian Spiering

Posted 2020-10-30T15:15:05.467

Reputation: 10 864

Thanks for your answer. As a matter of fact, I trained my models with active learning (which works great). But still the problem exist that I am not sure which new labels to add, because I don't have an overview over my data (as I am no domain expert) . Thats why I wanted to use a semi-supervised clustering algorithm, which proposes new clusters. To see if there are other clusters I haven't seen before. On the other hand, I don't want to add new labels if there are only a few samples for that specific label (so I have to look for a more general label) . Any idea? – chefhose – 2020-11-02T18:45:16.150


You basically have partially labeled data. You can do clustering regardless of labels, and then assign unlabeled data into the majority of labels you find in their cluster.

Same approach can be done using KNN. Just simply try KNN with different K and metrics on a validation split of your training data and when it shows good performance, apply it to the whole data and guess the label of unlabeled samples.

Kasra Manshaei

Posted 2020-10-30T15:15:05.467

Reputation: 5 323