I want to train a binary classification algorithm for spam detection using labeled data set of mentions from social media. These mentions have the following features:
URL, text message, posting date, is deleted, author
where the author has its own fields:
URL, nick, type (person or community), number of subscribers, registration date, last activity date
I have $\approx 13.5k$ observations with $\approx 450$ spam messages among them.
- transform data features in real numbers
- binarize URLs by media sources (e.g. facebook, instagram, etc.)
- use SVM with Gaussian kernel for all data
Questions: Am I on the right way? What else can I do? How can I process text data?
Thanks in advance for any suggestions!