While trying to emulate a ML model similar to the one described in this paper, I seemed to eventually get good clustering results on some sample data after a bit of tweaking. By "good" results, I mean that
- Each observation was put in a cluster with high probability (as opposed to having being 50/50 between two clusters or something similar)
- A high proportion of the observations were put in the correct cluster, indicating that the model actually did work.
For example, if we had observation $a$ that belonged to cluster $A$, and observation $b$ belonging to cluster $B$, then the model might output
(0.99, 0.01) for observation $a$ (where the
0.99 indicates a high probability for $a$ belonging to $A$ and
0.01 represents a low probability of belonging to $B$) and
(0.02, 0.98) for observation $b$. (These specific numbers are chose randomly, but generally the good results give probabilities close to 0 and 1.) However, every few times I train the model, I get a weird result; one that seems to still 'cluster' the data in a sense, but is technically wrong. A bad result would give something of the nature
(0.99, 0.01) for observation $a$ (which is good), but then give something like
(0.65, 0.35) for observation $b$ . In this way, when I look at the data I can tell that it has been "clustered" as the observations that belong to $A$ give different cluster probability distributions than the observations of $B$, but the model has not accomplished either of the two goals above when actually assigning clusters.
I would like to make my model more robust so that I get "good" results more often, but I don't know what I could do to do this. One thing that would definitely work to help avoid this problem would be to do the cluster training many times and take the average result; the problem with this is that each training takes somewhere on the order of hours, meaning that doing it multiple times might take on the order of days, which I want to avoid.
If there are any (hopefully quick) fixes I can do to avoid getting these odd results, I would love to hear any advice on the topic. I should also be able to answer questions if you need more information, but the paper I linked should have most of the pertinent info.