This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models?
I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning like: tokenize, remove stopwords, remove punctuations, stemming or lemmatization and more. However, during last few days I have had a quick jump into transformer models (fascinated btw), and what I have noticed that most of these models have a built-in tokenizer (cool), but none of the demos, examples, or tutorials are performing any of the these text preprocessing steps. You may take fast-bert for instance, there are no text preprocessing involved for the demos (maybe it is just a demo), but at inference the whole sentences are passed without any cleaning:
texts = ['I really love the Netflix original movies', 'this movie is not worth watching'] predictions = learner.predict_batch(texts)
The same is true for the original transformer by HuggingFace. Or many tutorials that I have looked at (take this or another one). I can imagine that depending on the task this might not be required, e.g. next work prediction or machine translation and more. More importantly I think this is part of the contextual-based approach that these models offer (that is the innovation so to say) that are meant to keep most of the text and we may obtain a minimum but still good representation of the each token (out of vocabulary word). Borrowed from medium article by HuggingFace:
Tokenisation BERT-Base, uncased uses a vocabulary of 30,522 words. The processes of tokenisation involves splitting the input text into list of tokens that are available in the vocabulary. In order to deal with the words not available in the vocabulary, BERT uses a technique called BPE based WordPiece tokenisation. In this approach an out of vocabulary word is progressively split into subwords and the word is then represented by a group of subwords. Since the subwords are part of the vocabulary, we have learned representations an context for these subwords and the context of the word is simply the combination of the context of the subwords.
But does that hold true for tasks like multi-label text classification? In my use case the text is full of not useful stopwords, punctuation, characters and abbreviations and it is multi-label text classification as mentioned earlier. And in fact the prediction accuracy is not good (after a few rounds of training using fast-bert). What do I miss here?