How can I discover topics in a social media data-set?



I'm working on a project and i need to discover topics existing in a social media data set. For instance, i wanna extract the topics existing on 200K tweets. Any one recommend to me any machine learning algorithms?

amir ladhar

Posted 2016-06-16T10:14:26.183

Reputation: 81

What do you mean with "topics"? Are you sure that a "topic" from topic modelling (like LDA) matches your concept of topic? Or do you want to do some clustering or labelling task? – None – 2016-06-20T12:03:13.750

I wanna discover topics by two methods, LDA and doing clustering or labeling tasks. – amir ladhar – 2016-06-20T14:48:22.727



You can take a look at Latent Dirichlet Allocation. In my experience this does very well without too much effort. You need to remove words that don't help like stopwords (and in your case Twitter handles and probably URLs) before feeding it to the algorithm. The only really important parameter that you need to give it is the number of topics. This will depend on your population (are these random tweets, or only tweets from a specific subgroup/hashtag?) and you need to compare some settings. What you can do is print the most important words per topic and see if they indeed do belong together.

If there are different languages in your tweets you need to deal with that beforehand, maybe classify them on language and only keep the English ones for example.

Jan van der Vegt

Posted 2016-06-16T10:14:26.183

Reputation: 8 538

Thank you Jan for your reply. However, is there any other algorithms rather than using topic modeling ones ? because I was asked by my teacher to find the alternatives.. – amir ladhar – 2016-06-17T00:04:59.833


A different direction (although not necessarily a better one) is to cluster the texts you receive, perhaps using an algorithm that doesn't require many input parameters like the number of clusters. Note that not any text clustering algorithm will do - some are optimized for clustering much longer texts. This paper gives a survey of short text clustering methods: It's not the newest, but it's a starting point. From experience, I agree with @Jan van der Vegt that it's recommended at the very least to look at English separately from other languages.


Posted 2016-06-16T10:14:26.183

Reputation: 31

For example I clastered my data-set into N cluster. And i wanna know the topic of each one. How can I define automatically the category X from an unsupervised ML cluster. – amir ladhar – 2016-06-20T06:30:49.003

How about looking at the most common words or phrases from each cluster (filtering out stopwords)? – Sharon – 2016-06-20T07:54:30.777

Looking at the most common words will be like an LDA unsupervised solution that return the topics. Thus it will be more easy LDA to my twitter data-set without performing any clustering. However I'm searching the LDA alternatives, using unsupervised or supervised techniques... – amir ladhar – 2016-06-20T09:04:26.597


One approach could be the following:

1) Make a vector respresentation of all texts in the data set, for example with tfidf technique.

2) Take the first vector and put it in a pile.

3) Enter in the following loop:

3a) take the next vector and compute the cosine similarity between this vector and the centroid of each built pile.

3b) if one of this cosine similarity falls below a predefined threshold, stack this document representation in the corresponding pile. Another case, build a new pile with this vector.

3c) recompute the centroid of each modified pile.

This algorithm is going to find similar tweets, which we suppose that are related with same topic.

Federico Caccia

Posted 2016-06-16T10:14:26.183

Reputation: 660