I find this way of using BERT in my translation system and it allows me to load and use more data to train my model.
I got a memory error when I want to use more data like 100k for my task. and I came up that my tokenizer is a kind of problem here because it takes a lot of memory to make my tokenizer for such a huge volume of data so pre-trained models like BERT are the solution to this problem to feed more data like 200k or more to your model without worrying about memory error too much.
Also, in my task, I was worry about words that do not exist in my training phase but exist in a test phase so bert solved this problem for me too, because it was trained on a large corpus.
let's dive in and find out how I used BERT to fix my problem here.
Here I am going to make English to English translation system.
- Loading pre-trained
BERT for English ( if your source and the target language differs from each other, you have to load them separately you can look at
tfhub.dev for them )
max_seq_length = 50 # i need to test the bert so I will keep this small for now
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,name="segment_ids")
#this is the path to pre-trained bert model
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",trainable=True)
- then I call my tokenizer
FullTokenizer = bert.bert_tokenization.FullTokenizer
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)
here is an examples to use the tokenizer:
the output is:
(, ['is'], ['##～'])
- then I used my train data to get my sequence
s = "This is a nice sentence."
stokens = tokenizer.tokenize(s)
stokens = ["[CLS]"] + stokens + ["[SEP]"]
input_ids = get_ids(stokens, tokenizer, max_seq_length)
the output for me was:
input Ids: [101, 2023, 2003, 1037, 3835, 6251, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- this method is also needed for getting token ids and you should add it to your codes
def get_ids(tokens, tokenizer, max_seq_length):
"""Token ids from Tokenizer vocab"""
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = token_ids +  * (max_seq_length-len(token_ids))
- for your model vocab size you can use these values,
which in my case is:
and after these steps, you can get your sequence for input and outputs and then feed them to your model.
I hope it helps. and please share your opinion if I am wrong in this case.