What is purpose of the [CLS] token and why is its encoding output important?


I am reading this article on how to use BERT by Jay Alammar and I understand things up until:

For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything else.

I have read this topic, but still have some questions:

Isn't the [CLS] token at the very beginning of each sentence? Why is that "we are only interested in BERT's output for the [CLS] token"? Can anyone help me get my head around this? Thanks!


Posted 2020-01-09T17:20:10.963

Reputation: 327



CLS stands for classification and its there to represent sentence-level classification.

In short in order to make pooling scheme of BERT work this tag was introduced. I suggest reading up on this blog where this is also covered in detail.

Noah Weber

Posted 2020-01-09T17:20:10.963

Reputation: 4 932

The article you shared is quite helpful. Thanks! – user3768495 – 2020-01-09T18:14:06.310


[CLS] stands for classification. It is added at the beginning because the training tasks here is sentence classification. And because they need an input that can represent the meaning of the entire sentence, they introduce a new tag.

They can’t take any other word from the input sequence, because the output of that is the word representation. So they add a tag that has no other purpose than being a sentence-level representation for classification.


Posted 2020-01-09T17:20:10.963

Reputation: 131


In order to better understand the role of [CLS] let's recall that BERT model has been trained on 2 main tasks:

  1. Masked language modeling: some random words are masked with [MASK] token, the model learns to predict those words during training. For that task we need the [MASK] token.
  2. Next sentence prediction: given 2 sentences, the model learns to predict if the 2nd sentence is the real sentence, which follows the 1st sentence. For this task, we need another token, output of which will tell us how likely the current sentence is the next sentence of the 1st sentence. And here comes the [CLS]. You can think about the output of [CLS] as a probability.

Now you may ask the question: can we instead of using [CLS]'s output just outputting a number (as probability)? Yes, we can do that if the task of predicting next sentence is a separate task. However, BERT has been trained on both tasks simultaneously. Organizing inputs and outputs in such a format (with both [MASK] and [CLS]) will help BERT to learn both tasks at the same time and boost its performance.

When it comes to classification task (e.g. sentiment classification), as mentioned in other answers, the output of [CLS] can be helpful because it contains BERT's understanding at the sentence-level.

hoang tran

Posted 2020-01-09T17:20:10.963

Reputation: 101