10

2

While trying to emulate a ML model similar to the one described in this paper, I seemed to eventually get good clustering results on some sample data after a bit of tweaking. By "good" results, I mean that

- Each observation was put in a cluster with high probability (as opposed to having being 50/50 between two clusters or something similar)
- A high proportion of the observations were put in the correct cluster, indicating that the model actually did work.

For example, if we had observation $a$ that belonged to cluster $A$, and observation $b$ belonging to cluster $B$, then the model might output `(0.99, 0.01)`

for observation $a$ (where the `0.99`

indicates a high probability for $a$ belonging to $A$ and `0.01`

represents a low probability of belonging to $B$) and `(0.02, 0.98)`

for observation $b$. (These specific numbers are chose randomly, but generally the good results give probabilities close to 0 and 1.) However, every few times I train the model, I get a weird result; one that seems to still 'cluster' the data in a sense, but is technically wrong. A bad result would give something of the nature `(0.99, 0.01)`

for observation $a$ (which is good), but then give something like `(0.65, 0.35)`

for observation $b$ . In this way, when I look at the data I can tell that it has been "clustered" as the observations that belong to $A$ give different cluster probability distributions than the observations of $B$, but the model has not accomplished either of the two goals above when actually assigning clusters.

I would like to make my model more robust so that I get "good" results more often, but I don't know what I could do to do this. One thing that would definitely work to help avoid this problem would be to do the cluster training many times and take the average result; the problem with this is that each training takes somewhere on the order of hours, meaning that doing it multiple times might take on the order of days, which I want to avoid.

If there are any (hopefully quick) fixes I can do to avoid getting these odd results, I would love to hear any advice on the topic. I should also be able to answer questions if you need more information, but the paper I linked should have most of the pertinent info.

+1 for the referenced paper – Nikos M. – 2020-07-26T08:53:44.883

Your analysis indicates that this clustering method may be more sensible to initialization than the paper suggests. You should contact the author about this. – Pedro Henrique Monforte – 2020-08-04T02:45:38.903