NLP - why is "not" a stop word?

23

4

I am trying to remove stop words before performing topic modeling. I noticed that some negation words (not, nor, never, none etc..) are usually considered to be stop words. For example, NLTK, spacy and sklearn include "not" on their stop word lists. However, if we remove "not" from these sentences below they lose the significant meaning and that would not be accurate for topic modeling or sentiment analysis.

1). StackOverflow is helpful      => StackOverflow helpful
2). StackOverflow is not helpful  => StackOverflow helpful

Can anyone please explain why these negation words are typically considered to be stop words?

E.K.

Posted 2016-12-15T22:20:16.537

Reputation: 335

2If you're doing a semantical analysis of sentences, obviously logical connectives are important: (1) iff not (2). If you intend to model the logic of these sentences, keep them out of the stops bag. They're usually thrown in there because from a data mining point of view, the presence of 'not' in a document isn't going to tell us much about the topic to help us distinguish it from other documents; it's not rare enough. There are probably other reasons for ignoring them in nlp tasks. – Hunan Rostomyan – 2016-12-15T22:48:27.803

Answers

24

Stop words are usually thought of as "the most common words in a language". However, other definitions based on different tasks are possible.

It clearly makes sense to consider 'not' as a stop word if your task is based on word frequencies (e.g. tf–idf analysis for document classification).

If you're concerned with the context (e.g. sentiment analysis) of the text it might make sense to treat negation words differently. Negation changes the so-called valence of a text. This needs to be treated carefully and is usually not trivial. One example would be the Twitter negation corpus. An explanation of the approach is given in this paper.

oW_

Posted 2016-12-15T22:20:16.537

Reputation: 5 477