How do I load FastText pretrained model with Gensim?



I tried to load fastText pretrained model from here Fasttext model. I am using wiki.simple.en

from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)

But, it shows the following errors

Traceback (most recent call last):
  File "", line 28, in <module>
    word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
  File "P:\major_project\venv\lib\sitepackages\gensim\models\",line 206, in load_word2vec_format
     header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "P:\major_project\venv\lib\site-packages\gensim\", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Question 1 How do I load fasttext model with Gensim?

Question 2 Also, after loading the model, I want to find the similarity between two words

 model.find_similarity('teacher', 'teaches')
 # Something like this
 Output : 0.99

How do I do this?

Sabbiu Shah

Posted 2017-06-30T02:14:39.717

Reputation: 663

Is gensim an absolute requirement? In the end, I just wound up going with the fasttext library directly, since I really just needed the words to get transformed – information_interchange – 2020-05-10T16:24:43.417



Here's the link for the methods available for fasttext implementation in gensim

from gensim.models.wrappers import FastText

model = FastText.load_fasttext_format('wiki.simple')

# Output = [('headteacher', 0.8075869083404541), ('schoolteacher', 0.7955552339553833), ('teachers', 0.733420729637146), ('teaches', 0.6839243173599243), ('meacher', 0.6825737357139587), ('teach', 0.6285147070884705), ('taught', 0.6244685649871826), ('teaching', 0.6199781894683838), ('schoolmaster', 0.6037642955780029), ('lessons', 0.5812176465988159)]

print(model.similarity('teacher', 'teaches'))
# Output = 0.683924396754

Sabbiu Shah

Posted 2017-06-30T02:14:39.717

Reputation: 663

I get DeprecationWarning: Call to deprecated `load_fasttext_format` (use load_facebook_vectors. So I am using from gensim.models.fasttext import load_facebook_model – Hrushikesh Dhumal – 2019-10-29T22:35:25.227


For .bin use: load_fasttext_format() (this typically contains full model with parameters, ngrams, etc).

For .vec use: load_word2vec_format (this contains ONLY word-vectors -> no ngrams + you can't update an model).

Note:: If you are facing issues with the memory or you are not able to load .bin models, then check the pyfasttext model for the same.

Credits : Ivan Menshikh (Gensim Maintainer)

Akash Kandpal

Posted 2017-06-30T02:14:39.717

Reputation: 241


"For .bin.... you can continue training after loading."

This is not true, as documentation states: "Due to limitations in the FastText API, you cannot continue training with a model loaded this way."

– Andriy Drozdyuk – 2018-11-18T04:57:14.793

1This is no longer true: DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead. – mickythump – 2019-08-17T16:02:54.593

@mickythump Can you please suggest some edits here? – Akash Kandpal – 2020-09-28T12:57:24.907


Update 04/2020

load_fasttext_format() is now deprecated, the updated way is to load the models is with gensim.models.fasttext.load_facebook_model() or gensim.models.fasttext.load_facebook_vectors() for binaries and vecs respectively.

For example:

from gensim.models.fasttext import load_facebook_model

wv = load_facebook_model('<path_to_bin')


Posted 2017-06-30T02:14:39.717

Reputation: 131

It is the .bin file, not the .gz file – information_interchange – 2020-05-10T16:07:46.603


The FastText binary format (which is what it looks like you're trying to load) isn't compatible with Gensim's word2vec format; the former contains additional information about subword units, which word2vec doesn't make use of.

There's some discussion of the issue (and a workaround), on the FastText Github page. In short, you'll have to load the text format (available at

Once you've loaded the text format, you can use Gensim to save it in binary format, which will dramatically reduce the model size, and speed up future loading.


Posted 2017-06-30T02:14:39.717

Reputation: 131


I really wanted to use gensim, but ultimately found that using the native fasttext library worked out better for me. The following code you can copy/paste into google colab and will work, out of the box:

pip install fasttext

import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

Works for out of vocab words too:



Posted 2017-06-30T02:14:39.717

Reputation: 111

but how to load this BIN file for the embedding layer of keras NN? – MrRaghav – 2020-08-06T17:40:00.783

I have found a way! I used .vec file and loaded it. – MrRaghav – 2020-08-07T11:12:36.780


Let’s use a pre-trained model rather than training our own word embeddings. For this, you can download pre-trained vectors from here. Each line of this file contains a word and it’s a corresponding n-dimensional vector. We will create a dictionary using this file for mapping each word to its vector representation.

from gensim.models import FastText
def load_fasttext():
        print('loading word embeddings...')
        embeddings_index = {}
        f = open('../input/fasttext/wiki.simple.vec',encoding='utf-8')
        for line in tqdm(f):
        values = line.strip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
        print('found %s word vectors' % len(embeddings_index))
        return embeddings_index


enter image description here

Let’s check the embedding for a word,

enter image description here


Here’s a bit more info, from a blog post I wrote for my company, on FastText and other document classification methods (for smaller datasets)

Jakub Czakon

Posted 2017-06-30T02:14:39.717

Reputation: 29