I am performing document (text) classification on the category of websites, and use the website content (tokenized, stemmed and lowercased).
My problem is that I have an over-represented category which has vastly more data points than any other (roughly 70% or 4000~ of my data points are of his one category, while about 20 other categories make up the last 30%, some of which have fewer than 50 data points).
My first question:
What could I do to improve the accuracy of my classifier in this case of sparse data for some of the labels? Should I simply discard a certain proportion of the data points in the category which is over-represented? Should I use something other than Gaussian Naive Bayes with tf-idf?
My second question:
After I perform the classification, I save the tfidf vector as well as the classifier to disk. However, when I re-rerun the classification on the same data, I sometimes get different results from what I initially got (for example, if previously a data point was classified as "Entertainment", it might receive "News" now). Is this indicative of an error in my implementation, or expected?