Improving Naive Bayes accuracy for text classification



I am performing document (text) classification on the category of websites, and use the website content (tokenized, stemmed and lowercased).

My problem is that I have an over-represented category which has vastly more data points than any other (roughly 70% or 4000~ of my data points are of his one category, while about 20 other categories make up the last 30%, some of which have fewer than 50 data points).

My first question:

What could I do to improve the accuracy of my classifier in this case of sparse data for some of the labels? Should I simply discard a certain proportion of the data points in the category which is over-represented? Should I use something other than Gaussian Naive Bayes with tf-idf?

My second question:

After I perform the classification, I save the tfidf vector as well as the classifier to disk. However, when I re-rerun the classification on the same data, I sometimes get different results from what I initially got (for example, if previously a data point was classified as "Entertainment", it might receive "News" now). Is this indicative of an error in my implementation, or expected?


Posted 2014-12-22T09:31:31.747

Reputation: 261

It's hard to say about Gaussian NB and tf*idf, but I have successfully used multinomial (and sometimes binomial one). Multinomial distribution is much easier to reason about, and it clearly maps to text classification task (value of a feature = number of occurrences of some word in text, probability to get specific text ~ multinomial distribution over all words in this text). Imbalance in dataset is one of the most important components in NB classifier (recall P(C)). You can omit it, if it improves results, but it won't be NB anymore, but more like MLE. – ffriend – 2014-12-30T22:58:11.760

Based on my experience as a newbie, yes, I've seen that discharging some elements that were causing noise (they were not helping to form a linear continuity) ultimately helped in achieving better results. – Andrea Moro – 2020-03-17T11:38:40.897



Regarding your first question...

Do you anticipate the majority category to be similarly over-represented in real-world data as it is in your training data? If so, perhaps you could perform two-step classification:

  1. Train a binary classifier (on all your training data) to predict membership (yes/no) in the majority class.
  2. Train a multi-class classifier (on the rest of the training data) to predict membership in the remaining minority classes.

Charlie Greenbacker

Posted 2014-12-22T09:31:31.747

Reputation: 1 451


The correct term for what you're describing here is 'class imbalance' or 'class imbalance problem' . It'd be great if you could include that in the title to have a more meaningful title.

Concerning your first question:

Have you plotted a confusion matrix of the resulting classifications? Is the reason why the accuracy is not satisfying really that instances are wrongly classified as the most common class?

Given your context of application it seems that you could use a certain degree of oversampling. To what extent this can be applied should depend on the frequency of each underrepresented class.

If there is a reasonably high variation in the value distributions on the underrepresented class instances then one could argue that this should be considered when applying the oversampling.

Also, the approach Charlie suggested in his answer could be considered, given that the instances of the underrepresented classes would form a dataset that is suitable for classification.

Edit: I haven't used naive bayes for text classification yet, so I'm not too sure how your attributes look like exactly. Do you just use the frequency of the terms that scored best with tfidf? More general, do you have discrete or continously valued attributes?

If the latter is the case you should consider using some kind of discretisation.

Regarding your second question:

Are you splitting the dataset in any way? Normally, given that the classifier has been trained on the same data, the outcome for the same instance should also be the same.


Posted 2014-12-22T09:31:31.747

Reputation: 151