I'm trying to cluster words based on pre trained embeddings. I ran a simple experiment where I obtained around 100 words relating to "food taste", obtained word embeddings from a pre-trained set, and tried to run k-means on the result.
I do get sensible results on certain clusters, but very strange results on others. E.g:
Cluster 1: [fatty, oily, greasy] -- (good) Cluster 2: [crumbly, powdery, grainy, flakey, chalky] -- (good) Cluster 3: [flavorful, hearty, unflavored, savory, full-bodied] -(bad) Cluster 4: [seasoned, unseasoned] -- (bad)
Can anyone suggest as to why seemingly opposite words like (seasoned, unseasoned) and (flavorful, unflavored) would be clustered together?
What I tried:
1) Using fasttext embeddings and Glove embeddings. The latest results are from concatenating fasttext wikipedia and common crawl embeddings.
2) Normalizing the vectors to same length before doing k-means (using eucledian distnaces). I think this is somewhat similar to having cosine distances. Tried not normalizing s well, but normalizing gave better results.
3) Tried some other clustring methods like DBSCAN but k-means seems better.
4) Tried PCA to reduce the dimensionality of the word vectors - didnt change the results much. Tried PCA for the selected embeddings in my vocab that needs to be clustered- not the whole word set.
Any suggestions to improve my results are welcome. If someone came across research articles discussing similar issues please post them as well.