Biasing SVM algorithm towards particular subset of data


I'm training an SVM model for sentiment analysis, based on social media data eg. tweets.

The model will be trained using a small selection of a particular company's tweets in order to classify new ones. However, since the training set is too small to get an accurate model I will be combining the company's data with a much larger general tweets dataset to train the model.

Being specialised to one company, the content of the respective data is slightly different to the content of the general dataset. Since the data to be predicted is company specialised, it seems logical to me to bias the models training towards giving greater importance to the company related tweets to improve the accuracy. My first thought was simply increasing the magnitude of the polarity of the companies tweets, ie general tweets are -1 or 1 and company tweets are -3 and 3, for example.

Is this the right idea/method?


Posted 2020-06-24T15:24:59.793

Reputation: 11



I don't think that's a very good idea: the goal is not to make the model predict a more extreme polarity when the tweet relates to the company.

Instead you might want to consider oversampling the few instances of this specific company. For instance if you have 100 company-specific tweets and 1000 general tweets in your training set, you could duplicate the company-specific ones 10 times in order to give the specific tweets have a higher weight in the data. If possible you should tune the parameter of how many times to duplicate in order to obtain the optimal value.


Posted 2020-06-24T15:24:59.793

Reputation: 12 600


Please try duplicating the specific company's data ten times or more, and include more samples in cross/test data from that company-specific data (3:1). I hope this will have some positive implications.

Muhammad Shahzad

Posted 2020-06-24T15:24:59.793

Reputation: 3