How do I load FastText pretrained model with Gensim?

29

12

I tried to load fastText pretrained model from here Fasttext model. I am using wiki.simple.en

from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)

But, it shows the following errors

Traceback (most recent call last):
  File "nltk_check.py", line 28, in <module>
    word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
  File "P:\major_project\venv\lib\sitepackages\gensim\models\keyedvectors.py",line 206, in load_word2vec_format
     header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "P:\major_project\venv\lib\site-packages\gensim\utils.py", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Question 1 How do I load fasttext model with Gensim?

Question 2 Also, after loading the model, I want to find the similarity between two words

 model.find_similarity('teacher', 'teaches')
 # Something like this
 Output : 0.99

How do I do this?

Sabbiu Shah

Posted 2017-06-30T02:14:39.717

Reputation: 663

Is gensim an absolute requirement? In the end, I just wound up going with the fasttext library directly, since I really just needed the words to get transformed – information_interchange – 2020-05-10T16:24:43.417

Answers

25

Here's the link for the methods available for fasttext implementation in gensim fasttext.py

from gensim.models.wrappers import FastText

model = FastText.load_fasttext_format('wiki.simple')

print(model.most_similar('teacher'))
# Output = [('headteacher', 0.8075869083404541), ('schoolteacher', 0.7955552339553833), ('teachers', 0.733420729637146), ('teaches', 0.6839243173599243), ('meacher', 0.6825737357139587), ('teach', 0.6285147070884705), ('taught', 0.6244685649871826), ('teaching', 0.6199781894683838), ('schoolmaster', 0.6037642955780029), ('lessons', 0.5812176465988159)]

print(model.similarity('teacher', 'teaches'))
# Output = 0.683924396754

Sabbiu Shah

Posted 2017-06-30T02:14:39.717

Reputation: 663

I get DeprecationWarning: Call to deprecated `load_fasttext_format` (use load_facebook_vectors. So I am using from gensim.models.fasttext import load_facebook_model – Hrushikesh Dhumal – 2019-10-29T22:35:25.227

14

For .bin use: load_fasttext_format() (this typically contains full model with parameters, ngrams, etc).

For .vec use: load_word2vec_format (this contains ONLY word-vectors -> no ngrams + you can't update an model).

Note:: If you are facing issues with the memory or you are not able to load .bin models, then check the pyfasttext model for the same.

Credits : Ivan Menshikh (Gensim Maintainer)

Akash Kandpal

Posted 2017-06-30T02:14:39.717

Reputation: 241

1

"For .bin.... you can continue training after loading."

This is not true, as documentation states: "Due to limitations in the FastText API, you cannot continue training with a model loaded this way."

https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.load_fasttext_format

– Andriy Drozdyuk – 2018-11-18T04:57:14.793

1This is no longer true: DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead. – mickythump – 2019-08-17T16:02:54.593

@mickythump Can you please suggest some edits here? – Akash Kandpal – 2020-09-28T12:57:24.907

3

Update 04/2020

load_fasttext_format() is now deprecated, the updated way is to load the models is with gensim.models.fasttext.load_facebook_model() or gensim.models.fasttext.load_facebook_vectors() for binaries and vecs respectively.

For example:

from gensim.models.fasttext import load_facebook_model

wv = load_facebook_model('<path_to_bin')

jcaliz

Posted 2017-06-30T02:14:39.717

Reputation: 131

It is the .bin file, not the .gz file – information_interchange – 2020-05-10T16:07:46.603

2

The FastText binary format (which is what it looks like you're trying to load) isn't compatible with Gensim's word2vec format; the former contains additional information about subword units, which word2vec doesn't make use of.

There's some discussion of the issue (and a workaround), on the FastText Github page. In short, you'll have to load the text format (available at https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md).

Once you've loaded the text format, you can use Gensim to save it in binary format, which will dramatically reduce the model size, and speed up future loading.

https://github.com/facebookresearch/fastText/issues/171#issuecomment-294295302

Fred

Posted 2017-06-30T02:14:39.717

Reputation: 131

1

I really wanted to use gensim, but ultimately found that using the native fasttext library worked out better for me. The following code you can copy/paste into google colab and will work, out of the box:

pip install fasttext

import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

Works for out of vocab words too:

ft.get_word_vector("another")
ft.get_word_vector("dkjeri37id20hnd")

information_interchange

Posted 2017-06-30T02:14:39.717

Reputation: 111

but how to load this BIN file for the embedding layer of keras NN? – MrRaghav – 2020-08-06T17:40:00.783

I have found a way! I used .vec file and loaded it. – MrRaghav – 2020-08-07T11:12:36.780

0

Let’s use a pre-trained model rather than training our own word embeddings. For this, you can download pre-trained vectors from here. Each line of this file contains a word and it’s a corresponding n-dimensional vector. We will create a dictionary using this file for mapping each word to its vector representation.

from gensim.models import FastText
def load_fasttext():
        print('loading word embeddings...')
        embeddings_index = {}
        f = open('../input/fasttext/wiki.simple.vec',encoding='utf-8')
        for line in tqdm(f):
        values = line.strip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
        f.close()
        print('found %s word vectors' % len(embeddings_index))
    
        return embeddings_index

embeddings_index=load_fastext()

enter image description here

Let’s check the embedding for a word,

enter image description here

embeddings_index['london'].shape

Here’s a bit more info, from a blog post I wrote for my company, on FastText and other document classification methods (for smaller datasets)

Jakub Czakon

Posted 2017-06-30T02:14:39.717

Reputation: 29