How to get sentence embedding using BERT?



How to get sentence embedding using BERT?

from transformers import BertTokenizer
sentence='I really enjoyed this movie a lot.'
#1.Tokenize the sequence:

2. Add [CLS] and [SEP] tokens:

tokens = ['[CLS]'] + tokens + ['[SEP]']
print(" Tokens are \n {} ".format(tokens))

3. Padding the input:

padded_tokens=tokens +['[PAD]' for _ in range(T-len(tokens))]
print("Padded tokens are \n {} ".format(padded_tokens))
attn_mask=[ 1 if token != '[PAD]' else 0 for token in padded_tokens  ]
print("Attention Mask are \n {} ".format(attn_mask))

4. Maintain a list of segment tokens:

seg_ids=[0 for _ in range(len(padded_tokens))]
print("Segment Tokens are \n {}".format(seg_ids))

5. Obtaining indices of the tokens in BERT’s vocabulary:

print("senetence idexes \n {} ".format(sent_ids))
token_ids = torch.tensor(sent_ids).unsqueeze(0) 
attn_mask = torch.tensor(attn_mask).unsqueeze(0) 
seg_ids   = torch.tensor(seg_ids).unsqueeze(0)

Feed them to BERT

hidden_reps, cls_head = bert_model(token_ids, attention_mask = attn_mask,token_type_ids = seg_ids)
print(hidden_reps.shape ) #hidden states of each token in inout sequence 
print(cls_head.shape ) #hidden states of each [cls]

hidden_reps size 
torch.Size([1, 15, 768])

cls_head size
torch.Size([1, 768])

Which vector represents the sentence embedding here? Is it hidden_reps or cls_head ?

Is there any other way to get sentence embedding from BERT in order to perform similarity check with other sentences?


Posted 2019-11-04T15:22:32.240

Reputation: 861



There is actually an academic paper for doing so. It is called S-BERT or Sentence-BERT.
They also have a github repo which is easy to work with.

Fatemeh Rahimi

Posted 2019-11-04T15:22:32.240

Reputation: 412


Which vector represents the sentence embedding here? Is it hidden_reps or cls_head?

If we look in the forward() method of the BERT model, we see the following lines explaining the return types:

outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]  # add hidden_states and attentions if they are here
return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)

So the first element of the tuple is the "sentence output" - each token in the input is embedded in this tensor. In your example, you have 1 input sequence, which was 15 tokens long, and each token was embedding into a 768-dimensional space.

The second element of the tuple is the "pooled output". You'll notice that the "sequence" dimension has been squashed, so this represents a pooled embedding of the input sequence.

So they both represent the sentence embedding. You can think of hidden_reps as a "verbose" representation, where each token has been embedded. You can think of cls_head as a condensed representation, where the entire sequence has been pooled.

Is there any other way to get sentence embedding from BERT in order to perform similarity check with other sentences?

Using the transformers library is the easiest way I know of to get sentence embeddings from BERT.

There are, however, many ways to measure similarity between embedded sentences. The simplest approach would be to measure the Euclidean distance between the pooled embeddings (cls_head) for each sentence.


Posted 2019-11-04T15:22:32.240

Reputation: 1 816

1zachdj thanks for the information . So should i use hidden_reps or cls_head to get sentence vector ? clas_head has only 1 vector with 768 dimension but hidden_reps has 15 vectors with 768 dimension . How should i convert these 15 vectors into single vector ? should i add or do mean or any other way to get the 15 token vectors represented into a single vector . – Aj_MLstater – 2019-11-05T12:53:32.700

1There are many ways to pool the 15 token embeddings into a single vector. You could take use mean pooling or max pooling. You could also avoid pooling altogether and use all 15 embeddings. – zachdj – 2019-11-05T17:29:15.690

1For your question about whether to use hidden_reps or cls_head, it just depends on what you're trying to do. They both represent the sentence. One represents each token, and one has already been pooled. – zachdj – 2019-11-05T17:30:01.290

@zachdji thanks for the information .Can you share the syntax for mean pool and max pool i tired torch.mean(hidden_reps[0],1) but when i tried to find cosin similarity for 2 different sentences it gave me high score .So not sure whether im doing the right way to get the sentence embedding . – Aj_MLstater – 2019-11-06T11:15:27.990

1s1=what is your age? tensor([-0.0106, -0.0101, -0.0144, -0.0115, -0.0115, -0.0116, -0.0173, -0.0071, -0.0083, -0.0070], grad_fn=<MeanBackward1>)

s2='Today is monday' tensor([-0.0092, -0.0094, -0.0113, -0.0106, -0.0166, -0.0071, -0.0073, -0.0074, -0.0080, -0.0076], grad_fn=<MeanBackward1>)

cos = torch.nn.CosineSimilarity(dim=0)
score was 0.93 .But ideally it should be very less as 2 sentences are not similry not sure why berth is giving high score . – Aj_MLstater – 2019-11-06T11:15:36.220

In that case, cosine similarity may not be a good choice after all. It was just the first thing that would occur to me. – zachdj – 2019-11-06T17:00:41.180

Any suggestion on how should i calculate the similarity of sentences then using BERT? – Aj_MLstater – 2019-11-07T06:03:49.533

You could try euclidean distance. Intuitively, you would expect similar words to be nearby in the embedding space. This is certainly true for Word2Vec. – zachdj – 2019-11-07T14:12:54.857


There is very cool tool called bert-as-service which does the job for you. It maps a sentence to a fixed length word embeddings based on the pre trained model you use. It also allows a lot of parameter tweaking which is covered extensively in the documentation.


Posted 2019-11-04T15:22:32.240

Reputation: 189


In your example, the hidden state corresponding to the first token ([CLS]) in hidden_reps can be used as a sentence embedding.

By contrast, the pooled output (mistakenly referred to as hidden states of each [cls] in your code) proved a bad proxy for a sentence embedding in my experiments.


Posted 2019-11-04T15:22:32.240

Reputation: 151


bert-as-service provides a very easy way to generate embeddings for sentences.

It is explained very well in the bert-as-service repository:


pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

Download one of the pre-trained models available at here.

Start the service:

bert-serving-start -model_dir /your_model_directory/ -num_worker=4 

Generate the vectors for the list of sentences:

from bert_serving.client import BertClient
bc = BertClient()


Posted 2019-11-04T15:22:32.240

Reputation: 101