Using NLP to automate the categorization of user description



I have a huge file of customer complaints about the products my company owns and I would like to do a data analysis on those descriptions and tag a category to each of them.

For example: I need to figure out the number of complaints on Software and Hardware side of my product from the customer complaints. Currently, I am using excel to do the data analysis which do seek a significant amount of manual work to get a tag name to the complaints.

Is there a way in NLP to build and train a model to automate this process? I have been reading stuffs about NLP for the past couple of days and it looks like NLP has a lot of good features to get a head start in addressing this issue. Could someone please guide me with the way I should use NLP to address this issue?


Posted 2014-12-09T20:49:37.093

Reputation: 945

Do you have any programming skills? There are many ways to do this, I can suggest something hopefully based on something you have used in the past. – sheldonkreger – 2014-12-09T20:51:47.973

I have a good knowledge in Java. I have used R for a couple of data mining tasks. Currently I am studying Python for using NLP. – SRS – 2014-12-09T20:53:30.183



One way to handle this is to use 'supervised classification'. In this model, you manually classify a subset of the data and use it to train your algorithm. Then, you feed the remaining data into your software to classify it.

This is accomplished with NLTK for Python (

If you are simply looking for strings like "hardware" and "software", this is a simple use case, and you will likely get decent results using a 'feature extractor', which informs your classifier which phrases in the document are relevant.

While it's possible to implement an automated method for finding the keywords, it sounds like you have a list in mind already, so you can skip that step and just use the tags you are aware of. (If your results aren't satisfactory the first time, this is something you might try later on).

That's an overview for getting started. If you are unhappy with the initial results, you can refine your classifier by introducing more complex methods, such as sentence segmentation, identification of dialogue act types, and decision trees. The sky is the limit (or more likely, your time is the limit)!

More info here.


Posted 2014-12-09T20:49:37.093

Reputation: 1 139

This basic strategy would also work if you found an NLP toolkit in another language you know, like Java. I'm just not familiar with those. – sheldonkreger – 2014-12-09T22:00:30.657


Sheldon is correct, this sounds like a fairly typical use case for supervised classification. If all of your customer complaints are either software or hardware (i.e., zero individual complaints cover both categories, and zero are outside these two classes), then all you need is a binary classifier, which makes things simpler than they otherwise could be.

If you're looking for a Java-based NLP toolkit that supports something like this, you should check out the Stanford Classifier -- it's licensed as open source software under the GPL.

Their wiki page should help you get started using the classifier -- keep in mind that you'll need to manually annotate a large sample of your data as a training set, as Sheldon mentioned.

Charlie Greenbacker

Posted 2014-12-09T20:49:37.093

Reputation: 1 451

Software/Hardware classification is like the sample task that I tried to work with the categorization. There are several other categories I have on my mind which would take a deep understanding of what is wrong with the products, by reading the customer's case, and tag the appropriate category. I started reading NLPTK using python but I would like to know the kind of functions that I should be looking for to address this case – SRS – 2014-12-10T15:44:21.057

This isn't a simple matter of looking for magic functions. What you want to do is build a classifier using supervised machine learning. These are the steps... 1. manually annotate a sample of your data as a training set, 2. extract features from your data to train on (for text, this might be something like ngrams), 3. build the classifier model using a machine learning library, 4. apply the classifier model to new data. Some libraries like Stanford Classifier will help you with steps 2 & 3. – Charlie Greenbacker – 2014-12-10T17:27:06.407