Using word embeddings for kaggle?

0

Not sure, if this is the right forum so redirect me if it wrong.

I have started on an NLP problem in kaggle. There i have word embeddings from google news, wiki, glove in a zipped folder. I want to use one of them, say glove, without unzipping the zipped file. This is beecause if I try to unzip, it exceeds the 4.9 Gb space limitation and throws error and stops.

Any idea on how to deal with it?

I found out the way around for it.

   import io
import zipfile

dim=300
embeddings1_index={}

with zipfile.ZipFile("../input/quora-insincere-questions-classification/embeddings.zip") as zf:
with io.TextIOWrapper(zf.open("glove.840B.300d/glove.840B.300d.txt"), encoding="utf-8") as f:
for line in tqdm(f):
values=line.split(' ') # ".split(' ')" only for glove-840b-300d; for all other files, ".split()" works
word=values[0]
vectors=np.asarray(values[1:],'float32')
embeddings1_index[word]=vectors


Do you mean you're trying to unzip the file in memory? If yes an obvious solution is to unzip it on the disk, and then read only a specific file when needed. – Erwan – 2020-05-27T20:17:44.657

@Erwan- Thanks for the response. However, I am not sure what you mean by unzip it on the disk? I have add the image of my code in kaggle, if it helps – fred – 2020-05-29T06:42:02.120

is it a single very large file in the zip? if yes I don't think you can use it if you don't have enough memory. If no, i.e. the zip file contains multiple small files, then you can extract the content from your file explorer and then load each file one by one in python. – Erwan – 2020-05-29T13:31:08.343

@Erwan Yes there are multiple huge files within the zip. And my issue is specific to kaggle. I can extract it easily in my local system or on colab. – fred – 2020-05-29T16:44:36.783

It is resolved now. – fred – 2020-05-31T05:09:19.367