Under what circumstance is lemmatization not an advisble step when working with text data?



Disregarding possible computational restraints, are there general applications where lemmatization would be a counterproductive step when analyzing text data?

For example, would lemmatization be something that is not done when building a context-aware model?

For reference, lemmatization per dictinory.com is the act of grouping together the inflected forms of (a word) for analysis as a single item.

For example, the word 'cook' is the lemma of the word 'cooking'. The act of lemmatization is, for example, replacing the word cooking with cook after you have tokenized your text data. Additionally, the word 'worse' has 'bad' as its lemma, and as the previous example replacing the word 'worse' with 'bad' is the action of lemmatization.


Posted 2018-08-08T22:26:50.310

Reputation: 155

1I think this question would be improved with a short description of what lemmatization is – kbrose – 2018-08-08T23:09:54.590

1@kbrose Alright, I can add a short description. Thank you for the suggestion. – Zer0k – 2018-08-08T23:22:59.873

1Thanks! Interesting question. I there are simple things like part of speech tagging that would definitely be harmed by lemmatization. Curious to see if there are more – kbrose – 2018-08-09T14:33:51.830



NLP tasks that would be harmed by lemmatization:

1) Tense classification

      sentence        |  tense
He cooked a nice meal |  past
He cooks a nice meal  |  present

The sequence of characters at the end of verbs can help in this task. The verbs cooked and cooks differ at the last characters ed and s repectively.

With lemmatization, this information is lost. Both verbs become cook, making both sentences seem (in this case) in the present tense.

2) Author identification


  • a set of documents $\mathcal{P}$ written by author $a$,
  • a set of documents $\mathcal{Q}$ written by author $b$,
  • a set of documents $\mathcal{S}$ written by either author $a$ or $b$,

classify if a document $s\in\mathcal{S}$ is written by author $a$ or $b$.

One way to achieve this is by looking at the histogram of words present in $s$ and compare it to documents from $\mathcal{P}$ and $\mathcal{Q}$ and select the most similar one.

This works because different authors use certain words with different frequencies. However, by using lemmatization, you distort these frequencies impairing the performance of your model.

Bruno Lubascher

Posted 2018-08-08T22:26:50.310

Reputation: 2 833

So basically, when the structure and the style of the sentence/document are relevant, lemmatization is something detrimental. Did I understand this correctly? – Zer0k – 2018-08-10T18:40:53.313

1@Zer0k, correct. When the important features are granular on the words you don't want lemmatization. If you have higher level tasks, for example, sentiment analysis, you don't need this granularity. "This is the worst restaurant" or "This is the bad restaurant", will both give you negative sentiment. – Bruno Lubascher – 2018-08-10T18:45:11.930

1I am afraid not to agree with the example of author identification. Especially with short texts lemmatization helps a lot. Otherwise the feature vectors are too sparse. – Claude – 2018-08-14T19:51:43.643

@Claude, can you please expand a bit on that? What do you define as short text? – Zer0k – 2018-10-10T18:06:03.360

1@Zer0k 200 tokens or up to 1000 or so. – Claude – 2018-10-10T18:24:47.390

@Claude, so basically many different short texts leading to strongly sparsed matrices. Thank you! – Zer0k – 2018-10-10T18:34:44.553