K-means clustering of word embedding gives strange results



I'm trying to cluster words based on pre trained embeddings. I ran a simple experiment where I obtained around 100 words relating to "food taste", obtained word embeddings from a pre-trained set, and tried to run k-means on the result.

I do get sensible results on certain clusters, but very strange results on others. E.g:

Cluster 1: [fatty, oily, greasy] -- (good)
Cluster 2: [crumbly, powdery, grainy, flakey, chalky] -- (good)
Cluster 3: [flavorful, hearty, unflavored, savory, full-bodied] -(bad)
Cluster 4: [seasoned, unseasoned] -- (bad)

Can anyone suggest as to why seemingly opposite words like (seasoned, unseasoned) and (flavorful, unflavored) would be clustered together?

What I tried:

1) Using fasttext embeddings and Glove embeddings. The latest results are from concatenating fasttext wikipedia and common crawl embeddings.

2) Normalizing the vectors to same length before doing k-means (using eucledian distnaces). I think this is somewhat similar to having cosine distances. Tried not normalizing s well, but normalizing gave better results.

3) Tried some other clustring methods like DBSCAN but k-means seems better.

4) Tried PCA to reduce the dimensionality of the word vectors - didnt change the results much. Tried PCA for the selected embeddings in my vocab that needs to be clustered- not the whole word set.

Any suggestions to improve my results are welcome. If someone came across research articles discussing similar issues please post them as well.

Thanks! TNC


Posted 2018-04-27T00:38:13.767

Reputation: 101

Antonyms are still similar by distribution, or context. If your goal is to separate them, try a different model, such as LWET: Revisit Word Embeddings with Semantic Lexicons for Modeling Lexical Contrast. Welcome to the site!

– Emre – 2018-04-27T01:15:44.750


Paragram-SL999 are a large set of word embeddings that get nearly human performance on a word-word similarity benchmark. You might get better results with these: https://www.cs.cmu.edu/~jwieting/

– Russell Richie – 2018-09-11T19:51:12.363



Word embeddings are trained by substitutability, not similarity.

If you consider a sentence like "This food is unflavored." Then a good substitute word would be "flavored", and the sentence will still be "correct".

In many cases, substitutability arises from similarity (crunchy, crispy) but it does also arise from opposites. You may consider "king" and "queen" to be opposites, too.

You probably should use a supervised approach then.

Has QUIT--Anony-Mousse

Posted 2018-04-27T00:38:13.767

Reputation: 7 331