How to divide a dataset for training and testing when the features and targets are in two different files?



I am trying to divide a dataset into training dataset and testing dataset for multi-label classification. The datset I am working on is this one. It is divided into a file which contains the features and another file which contains the targets. They look like this below.

enter image description here

This is the image about the features.

enter image description here

This is the image about the targets.

I intend to use this dataset for multilabel classification. I am following this tutorial . Here the dataset looks like this.

enter image description here

The dataset that I am working on has 17203824 samples and 58255 different and unique labels in the target file. So to follow the tutorial what I intend to create is a new numpy 2d array with 17203824 rows and 58255 columns where appropriate indices will be marked with 1. I am able to create it. But when I try to populate with 1s in the appropriate indices, I am getting an error. Its says that I don't have enough memory. My code is given below.

questions = pd.read_csv("/kaggle/input/stacklite/questions.csv")
question_tags = pd.read_csv("/kaggle/input/stacklite/question_tags.csv")
d = {v: i[0] for i, v in np.ndenumerate(question_tags["Tag"].unique())}

y = np.zeros([questions.shape[0], len(question_tags["Tag"].unique())], dtype = int)

for k in question_tags["Tag"]:
    j = d[k]
    for i, l in enumerate(y):
        y[i][j] = 1

Can anyone please help in telling me how I should proceed?


Posted 2020-06-12T13:39:42.747

Reputation: 123



I suggest you look at some of the common python libraries that will convert a column value into labels. Many of these functions have been around a while, and are optimized to use less memory and/or run faster. For example, you can use what is called "One Hot Encoding" from get_dummies() in pandas or "LableEncoder" from sklearn.

Here is a good reference with many methods to try, depending on your needs.

Here is a sample of how word embedding works. Each word in this example is reduced to 2 values (x and y)

enter image description here

Donald S

Posted 2020-06-12T13:39:42.747

Reputation: 1 493

Thank you for the article. It has very enlightening. I have tried get_dummies. print(pd.get_dummies(question_tags, columns=["Tag"]).head()). I am getting MemoryError: Unable to allocate 2.68 TiB for an array with shape (50576842, 58254) and data type uint8. – odbhut.shei.chhele – 2020-06-13T08:35:59.133

Ok. Think you have too many labels for one hot encoding to be useful or even successful on a normal desktop computer. The data will be too sparse for most purposes. In general, is there any similarity between the labels that can be used, or do you need the labels to be unique? Any grouping you can do would make this simpler to implement. – Donald S – 2020-06-13T11:26:29.610

If this is the case, you can try word embedding to decrease the number of columns for the encoding. This web site has a good explanation of this technique. This is using the similarity between words to decrease the number of dimensions needed.

– Donald S – 2020-06-13T11:28:04.177

Another idea is to use sampling, but need to be careful since you have so many labels that you don't exclude some of them. However, it may be okay if you miss a few rare target values, depending on your goal. – Donald S – 2020-06-13T12:57:06.403


You can load the features in a variable (X) and the 2nd column of targets in another variable (y). Then use train_test_split available in sklearn library.

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=__)

BTW you need to preprocess the data first. There are a lot of NAN values, id and date probably has no value in predicting the target.

Pranshu Mishra

Posted 2020-06-12T13:39:42.747

Reputation: 78

won't it be a problem that X_train and y_train have different row numbers? – odbhut.shei.chhele – 2020-06-12T15:01:16.123

You need them to have equal number of rows. A dataset provided by kaggle should already have that though. – Pranshu Mishra – 2020-06-12T20:38:52.793

This is a multilabel classification. If you look at the first two images that I posted again, then you will see that sample 4 has 4 different targets. So y_train has far more rows than x_train. – odbhut.shei.chhele – 2020-06-13T08:21:14.440