Clustering of multi-label data


The dataset consists of

1) a set of objects and

2) a set of labels, which are used to describe the objects.

For the moment, for simplicity sake, each label can be marked as either true or false (In a more complex setup, each label will have a value of 1-10).

But, not all the labels are actually applied to all the objects (in principle, all the labels can and should be applied across all the objects, but in practice, they just are not). Also, when a label isn't applied to an object, one cannot simply assume that the label's value for that particular is false. Therefore, the missing labels will be ignored in the model.

I need to cluster the objects based on their labels.

Any tips on how and what algorithms to use will be appreciated.


Posted 2019-03-31T05:54:52.710

Reputation: 111

1First you need to decide whether you want to do clustering (ignore the labels?) or classification (predict missing labels). – Has QUIT--Anony-Mousse – 2019-03-31T07:13:17.253

Ignore the missing labels. Wrongly predicted missing labels can mess things up. – Yogesch – 2019-03-31T07:42:16.217

1That sounds pretty much like the standard setup of recommender systems? – Has QUIT--Anony-Mousse – 2019-03-31T10:40:32.203

Ok, maybe... At first look, the crux to any sort of clustering in a recommendation system is to be able to define a "distance" metric between arbitrary points (objects). For each point/object, I have a set {L1, L2, ... Ln} where Ln can be 0 or 1, or na. So now how do I invent this "distance" metric in a consistent/coherent way? Should that be another question? Sorry, I'm yet to figure out what's a trivial question and what's a serious question in the datascience business. – Yogesch – 2019-03-31T15:57:48.290

By the way, I have briefly checked various similarity measures, like Euclidean, Cosine, etc. but I am not clear which would be most appropriate for my case nor how to decide that. Is there maybe a long list with short descriptions of various possible distance metrics? – Yogesch – 2019-03-31T16:10:37.737

From your question it wasn't even clear whether you can distinguish 0 and na at all. But think of users giving 0 to 5 stars (or not having rated a product at all) - aren't you exactly having this recommender system setting? – Has QUIT--Anony-Mousse – 2019-03-31T19:26:53.490

No. Most computer systems are able to distinguish between a 0 and a null or na value. So I'm not sure why I'd not be able to distinguish them. It'd be more helpful if you could clarify. – Yogesch – 2019-04-01T05:34:14.413

@Anony-Mousse again, No. I do not have a system where someone rates the object on a scale of 1-5. Like I repeatedly mentioned in the question, I am asking about multiple labels being applied to each object. Each label can take one of a few (two) values. – Yogesch – 2019-04-01T05:36:02.200

1Consider each label to be a user! – Has QUIT--Anony-Mousse – 2019-04-01T05:42:49.893

That is a clever idea :-) – Yogesch – 2019-04-01T09:03:14.413

No answers