Preprocessing for Text Classification in Transformer Models (BERT variants)


This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models?

I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning like: tokenize, remove stopwords, remove punctuations, stemming or lemmatization and more. However, during last few days I have had a quick jump into transformer models (fascinated btw), and what I have noticed that most of these models have a built-in tokenizer (cool), but none of the demos, examples, or tutorials are performing any of the these text preprocessing steps. You may take fast-bert for instance, there are no text preprocessing involved for the demos (maybe it is just a demo), but at inference the whole sentences are passed without any cleaning:

texts = ['I really love the Netflix original movies',
         'this movie is not worth watching']
predictions = learner.predict_batch(texts)

The same is true for the original transformer by HuggingFace. Or many tutorials that I have looked at (take this or another one). I can imagine that depending on the task this might not be required, e.g. next work prediction or machine translation and more. More importantly I think this is part of the contextual-based approach that these models offer (that is the innovation so to say) that are meant to keep most of the text and we may obtain a minimum but still good representation of the each token (out of vocabulary word). Borrowed from medium article by HuggingFace:

Tokenisation BERT-Base, uncased uses a vocabulary of 30,522 words. The processes of tokenisation involves splitting the input text into list of tokens that are available in the vocabulary. In order to deal with the words not available in the vocabulary, BERT uses a technique called BPE based WordPiece tokenisation. In this approach an out of vocabulary word is progressively split into subwords and the word is then represented by a group of subwords. Since the subwords are part of the vocabulary, we have learned representations an context for these subwords and the context of the word is simply the combination of the context of the subwords.

But does that hold true for tasks like multi-label text classification? In my use case the text is full of not useful stopwords, punctuation, characters and abbreviations and it is multi-label text classification as mentioned earlier. And in fact the prediction accuracy is not good (after a few rounds of training using fast-bert). What do I miss here?


Posted 2019-11-08T06:28:48.750

Reputation: 3 728


You might want to look at this great resource about tokenization :

– Astariul – 2020-03-23T01:02:35.050



A quick experiment you can do is to once do the preprocessing steps that you usually do and then feed it to the model and get the results. And once feed the dataset as it is to the model to compare the difference.

In my experience doing the preprocessing won't make any difference, based on the dataset it gave me 1 more or less percent difference in accuracy (not a considerable change).

When these models are trained, no preprocessing is done, as they want to learn the context of all sorts of sentences.

A reason that your results are not good enough might because of your label distribution. Most of the time the datasets are populated with only one or two labels and other labels are only a small portion of the dataset. If that's the case you might want to look into oversampling solutions.

Fatemeh Rahimi

Posted 2019-11-08T06:28:48.750

Reputation: 412