Dealing with diverse text data

7

I'm currently working with a dataset with a wide range of document lengths -- anywhere from a single word to a full page of text. In addition, the grammatical structure and use of punctuation varies wildly from document to document. The goal is to classify those documents into one of about 10-15 categories. I'm currently using ridge regression and logistic regression for the task, and CV for the alpha values of ridge. The feature vectors are tf-idf ngrams.

Recently I've noticed that longer documents are much less likely to be categorized. Why might this be the case, and how can one "normalize" for this kind of variation? As a more general question, how does one typically deal with diverse data sets? Should documents be grouped based off of metrics like document length, use of punctuation, grammatical rigor, etc. and then fed through different classifiers?

Madison May

Posted 2014-06-20T14:58:09.320

Reputation: 1 959

Can you clarify you question by defining the goals of this analysis? What is the nature of the 10 to 15 categories? Are these categories you define a priori or are they clusters suggested by the data itself? I appears that your question is centered on the choosing a good data encoding/transformation process rather than on data analysis methods (e.g. discriminant analysis, classification). – MrMeritology – 2014-06-20T18:25:53.873

1If your documents range from single words to full page of text, and you aim to have any combination of document lengths/types in any category, then you'll need to use a very simple encoding method such as Bag of Words. Anything more complicated (e.g. grammar style) won't scale across that range. – MrMeritology – 2014-06-20T18:29:26.120

Answers

5

I'm not sure how you are applying a regression framework for document classification. The way I'd approach the problem is to apply a standard discriminative classification approach such as SVM.

In a discriminative classification approach the notion of similarity or inverse distance between data points (documents in this case) is pivotal. Fortunately for documents, there is a standard way of defining pairwise similarity. This is the standard cosine similarity measure which makes use of document length normalization to take different document lengths into account.

Thus, practically speaking, in cosine similarity you would work with relative term weights normalized by document lengths and hence document length diversity should not be a major issue in the similarity computation.

One also has to be careful when applying idf in term weights. If the number of documents is not significantly large the idf measure may be statistically imprecise thus adding noise to the term weights. It's also a standard practice to ignore stopwords and punctuations.

Debasis

Posted 2014-06-20T14:58:09.320

Reputation: 1 476