How cluster a twitter data-set?



I have a twitter data-set and I wanna extract their related topics. So, I decided to classify my Tweets into clusters using an unsupervised machine learning algorithm like k-means. This choice is made due the time consuming of the training process in the supervised approaches.

So, as a first step after cleaning my tweets, I will extract features (eg. Hashtags...) from them, and enrich them with side information from knowledge bases (eg. Wikipedia). Secondly, they will be represented in a Vector space. Next, using k-means and for given K=6 clusters, my tweets already enriched will be classified into 6 clusters.

However, I don’t know how to identify automatically the topics related to these clusters. Is there any solutions?

amir ladhar

Posted 2016-06-19T11:52:41.187

Reputation: 81



k-means is very sensitive to noise

because it is designed as a least-squares approach. Noise deviations, when squared, become even larger.

Twitter is mostly noise

Twitter is full of spam and nonsense tweets. These will be entirely unlike any other and thus have the largest deviations.

Chances are you get one "cluster" that contains almost everything, and the other k-1 clusters consist of a few tweets with their duplicates. Clusters are not topics. They are more likely to be duplicates than topics.

An appropriate clustering algorithm for tweets should probably discard 90% of tweets and produce thousands of clusters. But it will rarely be better than finding all tweets in common - most tweets only have 2-3 usable words.

Has QUIT--Anony-Mousse

Posted 2016-06-19T11:52:41.187

Reputation: 7 331

Clear. So, what's the alternative to have topics from twitter data-set ? – amir ladhar – 2016-06-19T23:40:04.720

I don't think Twitter data clusters. Nor will it have well defined "topics". It's too diverse. Topic modeling works reasonably when you have e.g. 20news data which has much longer texts, and much better separated topics. – Has QUIT--Anony-Mousse – 2016-06-20T05:40:06.580

1So my recommendation is: don't bother. Find a way to work without this. Why would you need to cluster Tweets? It's not as if there is anything important on Twitter. – Has QUIT--Anony-Mousse – 2016-06-20T05:41:07.523

I wanna discover topics from a large twitter data-set, that's why. I was thinking that clustering can be a solution.. – amir ladhar – 2016-06-20T06:09:10.870

Good luck then... you are looking for something which probably does not exist. – Has QUIT--Anony-Mousse – 2016-06-20T11:33:30.820

@amirladhar you should use a text mining technique like LDA or LSA. – Ricardo Cruz – 2016-06-20T11:38:25.613


Have you found a good approach? I am envolved in the same work right now. My approach is the following:

1) Make a vector respresentation of all texts in the data set, for example with tfidf technique.

2) Take the first vector and put it in a pile.

3) Enter in the following loop:

3a) take the next vector and compute the cosine similarity between this vector and the centroid of each built pile.

3b) if one of this cosine similarity falls below a predefined threshold, stack this document representation in the corresponding pile. Another case, build a new pile with this vector.

3c) recompute the centroid of each modified pile.

This algorithm is going to find similar tweets, which we suppose that are related with same topic.

Federico Caccia

Posted 2016-06-19T11:52:41.187

Reputation: 660


Basically if I rephrase your task - you have a large document which you want to summarize. Text mining is your tool - you can either choose traditional approaches such as tf-idf, tf and etc. I would recommend use holmertz technique - in such framework it makes stuff easier as it can detect stopwords on its own, extract features and etc. Hierarchical clustering can also work, check if you will not get obvious words as cluster centers - filtering them require subject matter knowledge and additional time.

Vitaly Portnoy

Posted 2016-06-19T11:52:41.187

Reputation: 61