How to create a good list of stopwords



I am looking for some hints on how to curate a list of stopwords. Does someone know / can someone recommend a good method to extract stopword lists from the dataset itself for preprocessing and filtering?

The Data:

a huge amount of human text input of variable length (searchterms and whole sentences (up to 200 characters) ) over several years. The text contains a lot of spam (like machine input from bots, single words, stupid searches, product searches ... ) and only a few % of seems to be useful. I realised that sometimes (only very rarely) people search my side by asking really cool questions. These questions are so cool, that i think it is worth to have a deeper look into them to see how people search over time and what topics people have been interested in using my website.

My problem:

is that i am really struggling with the preprocessing (i.e. dropping the spam). I already tried some stopword list from the web (NLTK etc.), but these don't really help my needs regarding this dataset.

Thanks for your ideas and discussion folks!


Posted 2015-05-24T21:45:02.207

Reputation: 323

1Python NLTK module provides stopwords data and if it did not help you better provide more info about your dataset. Why it was not helpful in your case? – Kasra Manshaei – 2015-05-25T01:25:41.937

@kasramsh: When i filtered for these SWs i had the impression that this did not filter significantly out the spam. I think the reason is, that these list are generated on natural texts (not sure) and therefore are not usable for searchwords and site queries. Like when you cluster ( based on the search string similarity) i had the impression that the spam has a strong effect at entropy level and thereby is mixing up the end result :-/. – PlagTag – 2015-05-25T09:53:01.813

1I think @PlagTag don't understand what is stop words. Stop-wrods is a list of most common words in some language, for example I, the, a and so on. You will just remove this words from your text before start train your algorithm which try identify which text is spam or not. It didn't help you identify which text is spam or not, it can give your learning algorithm some improve. – itdxer – 2015-05-25T14:18:51.500

@itdxer, thanks for your comment. I used the term stopwords here in an broader extend (as i thought it might be ok for the purpose). Thank you for clearing up the issue ;-) – PlagTag – 2015-05-26T08:49:23.637



One approach would be to use tf-idf score. The words which occur in most of the queries will be of little help in differentiating the good search queries from bad ones. But ones which occur very frequently (high tf or term-frequency) in only few queries (high idf or inverse document frequency) as likely to be more important in distinguishing the good queries from the bad ones.

Shagun Sodhani

Posted 2015-05-24T21:45:02.207

Reputation: 722

actually a high IDF score alone would do the trick – CpILL – 2018-05-16T00:37:54.523

thx a lot, i will try out this one and report here! – PlagTag – 2015-05-26T08:46:17.480


It depends on your application.

When you are doing topic modelling, try the default stopwords first. When there are some words occurring prominently in many topics (note my rather vague formulation) they are good candidates for additional stopwords.

E.g., in a corpus with texts containing figures and tabular material, the words "fig", "figure", "tab", or "table" are good additional stopwords. In the result, your topics become more well-defined.


Posted 2015-05-24T21:45:02.207



An approach I have used to build a stopword list is to build and train a logistic regression model (due to its interpretability) on your text data. Take the absolute value of the coefficients for each token. Then, sort descending the absolute value of the coefficients of the tokens. Then, create a list of all the tokens with high coefficient absolute value that might lead to overfitting or that might meet some other criteria to be a stopword. That list is your stopwords list. You can then apply that stopword list to another set of documents of this type (kind of like a test set), to see if removing them increases the accuracy, precision, or recall of the test set model.

This strategy is effective because it takes into account the impact of tokens when building a stopword list.


Posted 2015-05-24T21:45:02.207

Reputation: 111


Using TFIDF (term frequency inverse document frequency) will solve your purpose. Get the TFIDF score for each word in your document and sort the words by their scores by which you can select the important words in your data.

Thilak Adiboina

Posted 2015-05-24T21:45:02.207

Reputation: 76


Stopwords may be part of the solution at some point, but not the key. In any case for any major languages good lists of stop words exist, it should not be domain specific.

I also don't think that using TD-IDF alone is really correct. There could be very rare (potentially garbage) words in poor quality strings.

Instead of trying to guess which exact features are useful: I would start by creating a data set by randomly selecting some of the data and labeling them by hand (as good or bad, or on a scale from 0.0 to 1.0). Then code something up that pulls out many features (length, number of words (tokens), spam score, whether it contains URLs or botish chars, detected language, whether it has a question mark, whether it has proper capitalisation). Also don't forget to include non-linguistic features that you may have, like country of the geoIP of the user that made the query, whether the user was logged in, how old the user's account is. So at this point you will have a massive table/CSV, and a smaller one with one extra column for the label you've added.

Then train some machine learning package with those labeled examples to build a model that is accurate enough for you. Then let that model run on the rest of the data.

If you wish not to code too much, you could even just get those features into CSV form, and give them to the Google Prediction API's spreadsheet interface.

Adam Bittlingmayer

Posted 2015-05-24T21:45:02.207

Reputation: 504