I'm currently working with a dataset with a wide range of document lengths -- anywhere from a single word to a full page of text. In addition, the grammatical structure and use of punctuation varies wildly from document to document. The goal is to classify those documents into one of about 10-15 categories. I'm currently using ridge regression and logistic regression for the task, and CV for the alpha values of ridge. The feature vectors are tf-idf ngrams.
Recently I've noticed that longer documents are much less likely to be categorized. Why might this be the case, and how can one "normalize" for this kind of variation? As a more general question, how does one typically deal with diverse data sets? Should documents be grouped based off of metrics like document length, use of punctuation, grammatical rigor, etc. and then fed through different classifiers?