Dropping missing rows in two dataframes


I have two files : Test_data - contains the features of a dataset to find predictions for Submission_data - contains two columns : The index column for test data and another column for its corresponding predicted value

So , I have to make predictions on the test data and store the predicted values in the submission file.

During preprocessing of the test data , I am dropping rows that do not contain values (NaN) for atleast 50% of the features(columns) :

test_data = test_data.dropna(thresh=math.ceil(test_data.shape[1]/2))

Now , How do I remove the corresponding rows in the submissions dataframe ? Because , If I drop some rows in the test data , I cannot make a prediction for the corresponding row in the submissions dataframe/file.

The problem is , there is an Index column that does NOT HAVE UNIQUE values (In both test data and submissions data)

So , How do I drop the rows in Submissions data that were also dropped in Test data ?

I am new to ML challenges and I find this challenging .

Bharathi A

Posted 2020-09-18T06:44:31.197

Reputation: 45



When you read the two csv files and store the data in two dataframes, you could then combine it into one dataframe, do the dropna and then split it back. I will give an example using pandas

import pandas as pd df1 = pd.read_csv('test_data.csv') df2 = pd.read_csv('submission_data.csv') df3 = pd.concat([df1, df2], axis=1) # this will combine the two dfs.

reduced_data = df3.dropna(thresh=math.ceil(test_data.shape[1]/2)) predictions = reduced_data.loc[:,['predictions']] reduced_data.drop(columns=['predictions'], inplace=True)

#instead of 'predictions', use whatever column name you have for the predictions in submission_data.csv file.

Hope this helps.

Deepika Kalra

Posted 2020-09-18T06:44:31.197

Reputation: 94