How can I fill NaN values in a Pandas DataFrame in Python?

6

1

I am trying to learn data analysis and machine learning by trying out some problems.

I found a competition "House prices" which is actually a playground competition. Since I am very new to this field, I got confused after exploring the data. The data has 81 columns out of which 1 is the target column which is the house value. This data contains multiple columns where majority of values are "NaN". When I ran:

nulls = data.isnull().sum()
nulls[nulls > 0]

This shows the columns with missing values:

LotFrontage     259 
Alley           1369
MasVnrType      8   
MasVnrArea      8   
BsmtQual        37  
BsmtCond        37  
BsmtExposure    38  
BsmtFinType1    37  
BsmtFinType2    38  
Electrical      1   
FireplaceQu     690 
GarageType      81  
GarageYrBlt     81  
GarageFinish    81  
GarageQual      81  
GarageCond      81  
PoolQC          1453
Fence           1179
MiscFeature     1406

At this point I am totally lost and I don't know how to get rid of these "NaN" values.
Any help would be appreciated.

Ahmed Dhanani

Posted 2016-12-25T22:29:59.157

Reputation: 163

Answers

7

You can use the DataFrame.fillna function to fill the NaN values in your data. For example, assuming your data is in a DataFrame called df,

df.fillna(0, inplace=True)

will replace the missing values with the constant value 0. You can also do more clever things, such as replacing the missing values with the mean of that column:

df.fillna(df.mean(), inplace=True)

or take the last value seen for a column:

df.fillna(method='ffill', inplace=True)

Filling the NaN values is called imputation. Try a range of different imputation methods and see which ones work best for your data.

timleathart

Posted 2016-12-25T22:29:59.157

Reputation: 3 345

Thanks for the response. The dataset also consists of string values. I think df.fillna() will work on float or integer values. Any pointers on converting string values to numeric values? – Ahmed Dhanani – 2016-12-26T13:07:21.590

1

Ah, I had assumed the data was numeric for some reason. By string values, do you mean categorical data i.e. strings from a particular set of values? Then, you can use scikit-learn's LabelEncoder. Natural language, on the other hand, is more difficult to deal with. Bag-of-words is probably the easiest to think about, but have a look at these options.

– timleathart – 2016-12-26T22:01:41.893

1

  # Taking care of missing data
  from sklearn.preprocessing import Imputer
  imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
  imputer = imputer.fit(X[:, 1:3])
  X[:, 1:3] = imputer.transform(X[:, 1:3])

suppose the name of my array is $X$ and I want to take care of missing data in columns indexed $1$ and $2$ by replacing it with mean. Imputer is a great class to do this from sklearn library

smit patel

Posted 2016-12-25T22:29:59.157

Reputation: 11