In order to better understand the role of [CLS] let's recall that BERT model has been trained on 2 main tasks:
- Masked language modeling: some random words are masked with [MASK]
token, the model learns to predict those words during training. For
that task we need the [MASK] token.
- Next sentence prediction: given
2 sentences, the model learns to predict if the 2nd sentence is the
real sentence, which follows the 1st sentence. For this task, we
need another token, output of which will tell us how likely the
current sentence is the next sentence of the 1st sentence. And here
comes the [CLS]. You can think about the output of [CLS] as a
Now you may ask the question: can we instead of using [CLS]'s output just outputting a number (as probability)? Yes, we can do that if the task of predicting next sentence is a separate task. However, BERT has been trained on both tasks simultaneously. Organizing inputs and outputs in such a format (with both [MASK] and [CLS]) will help BERT to learn both tasks at the same time and boost its performance.
When it comes to classification task (e.g. sentiment classification), as mentioned in other answers, the output of [CLS] can be helpful because it contains BERT's understanding at the sentence-level.