Spam detection in social media


I want to train a binary classification algorithm for spam detection using labeled data set of mentions from social media. These mentions have the following features:

URL, text message, posting date, is deleted, author

where the author has its own fields:

URL, nick, type (person or community), number of subscribers, registration date, last activity date

I have $\approx 13.5k$ observations with $\approx 450$ spam messages among them.

My suggestions:

  • transform data features in real numbers
  • binarize URLs by media sources (e.g. facebook, instagram, etc.)
  • use SVM with Gaussian kernel for all data

Questions: Am I on the right way? What else can I do? How can I process text data?

Thanks in advance for any suggestions!


Posted 2015-07-15T14:15:17.773

Reputation: 163

Can you clarify what you're asking? what have you tried and what is the result? what's the problem? – Sean Owen – 2015-07-15T14:57:37.437

There is no problem except this is my first task in machine learning. I just want to ensure that I have choosen right algorithm and, for example, should I normalize features? Also I am still confused whether it is correct to combine categorical and continuous features. – nmerci – 2015-07-15T15:12:14.460



The approach seems correct, don't worry. But I'd suggest to use Naive Bayes Classifier instead of SVM - it perfectly works for this task (and actually popular email services use this algorithm for spam detection).

Maksim Khaitovich

Posted 2015-07-15T14:15:17.773

Reputation: 383


Yes, you are essentially on the right track. You can normalize the data , if your text samples have varied length. Also, you can use categorical data but not in the same feature. For example, having one feature which is labelled "1" for facebook and "2" for instagram is not good. Have separate features for these kind of categorical data with {0,1} labels. There are two additional inputs that I would like to give here.

First, use a systematic grid search instead of just guessing the parameters. This is a very good practice which I picked up very late. If you are using a soft margin SVM ( which you should) with the gaussian kernel, you will have just two parameters to search over, C and gamma. This will make sure that you are extracting the best parameters from your classifier.

Second, you have 13.5K samples and only 450 are spam. Which means you have 13 K desired mails. So you have 450 positive samples and 13K negative samples. If this is indeed the data distribution, you have an unbalanced data. You will have to weight the data according to the proportion otherwise SVM will end up classifying most of the samples as negative. To read about this in more detail look at Section 7 of this document . Weighted SVMs are implemented in LibSVM, so you can directly use that.


Posted 2015-07-15T14:15:17.773

Reputation: 303