Bert Fine Tuning with additional features



I want to use Bert for an nlp task. But I also have additional features that I would like to include.

From what I have seen, with fine tuning, one only changes the labels and retrains the classification layer.

Is there a way to used pre-trained Bert models and include additional features?


Posted 2019-03-05T02:57:48.780

Reputation: 183

1Hars to say with no detail. BERT retraining often involves only the last layer, you could feed this last layer with both BERT previous layers and your new features. – Robin – 2019-03-05T14:38:56.250

@debzsud, Want to make that an answer so we can upvote it? – D.W. – 2019-03-09T23:59:32.180



To add additional features using BERT, one way is to use the existing WordPiece vocab and run pre-training for more steps on the additional data, and it should learn the compositionality. The WordPiece vocabulary can be basically used to create additional features that didn't already exist before.

Another approach to include additional features would be to add more vocab while training. Following approaches are possible:

  1. Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
  2. Append new vocabulary words to the end of the vocab file, and update the vocab_size parameter in bert_config.json. Later, write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

Please note that I haven't tried any of these approaches myself.


Posted 2019-03-05T02:57:48.780

Reputation: 606

1Could you elaborate on what needs to be changed with tf.concat and tf.assign calls and where? I'm looking to do the same as above. Also, as a side note, when increasing the vocab I'm seeing some variable names that appear to be new and seem to be related to the optimizer, and are absent in the original checkpoint. Do you know what these are? ('bert/encoder/layer_8/output/dense/kernel/adam_m', [3072, 768]), ('bert/encoder/layer_8/output/dense/kernel/adam_v', [3072, 768]), – D.S. – 2020-03-01T16:25:50.910


The answer is from this source: . But I wonder where to find this kind of script.

– Pankaj Kumar – 2020-03-03T09:50:12.720