How to handle missing date data?


I have a column named GarageYrBlt which just lists the year the garage of that house was built. I have one nan value for this column. Does it make sense to fill it with the columns median (This was a common approach for the missing age values in the Titanic competition)?

Andros Adrianopolos

Posted 2019-07-05T08:21:24.277

Reputation: 322



I suspect you're working with the Ames house price dataset - one of Kaggle's introductory competitions.

Replacing the missing values with the dataset mean / median is very general.

I believe you have a variable which describes which neighbourhood the house is in. A more focussed approach would be to replace that missing value in GarageYrBlt with the mean / median for that neighbourhood (subject to additional sensibility checks like the garage not being built before the house).


Posted 2019-07-05T08:21:24.277

Reputation: 1 387

Actually its Boston – Andros Adrianopolos – 2019-07-05T08:41:07.677

And yea I do have a column like that but can you break down your approach a little more? – Andros Adrianopolos – 2019-07-05T08:41:44.577

It assumes that houses in the same neighbourhood have similar characteristics - e.g. built around the same time, of similar construction, similar sized plots of land, etc. So my suggestion would be that if your missing data point is from a house in neighbourhood A, replace the missing value with a statistic calculated using other houses in neighbourhood A. You don't have to use the neighbourhood - you could instead use year of construction, type of house, etc. – bradS – 2019-07-05T12:17:08.833

The sense check was just to make sure that the statistic calculated using a broader dataset makes sense in a narrow application. E.g. I think garages are generally built at the same time (or after) a house is built, so it would make sense that the statistic you calculate to replace your missing value doesn't imply that the garage was built before the house. – bradS – 2019-07-05T12:19:35.313

In the Titanic competition, I filled the age data with the mean of its column, relative to the sex so all female passengers with missing age got the mean age of all female passengers. Is this the idea you're referring to? – Andros Adrianopolos – 2019-07-05T13:38:41.180

Yes, that's the a different application but is based on the same idea. – bradS – 2019-07-05T14:04:55.777

I tried to do train['GarageYrBlt'] = train['GarageYrBlt'].fillna(train.groupby('YearBuilt')['GarageYrBlt'].transform("median"), inplace=True) but now all of my values appear as None – Andros Adrianopolos – 2019-07-06T02:18:37.100

Maybe: train['GarageYrBlt'] = train.groupby('YearBuilt')['GarageYrBlt'].apply(lambda x: x.fillna(x.median(),inplace=True)) – bradS – 2019-07-06T20:25:44.970

transform broadcasts a scalar value back to the group, so each row in the grouping has the same value. – bradS – 2019-07-06T20:26:44.120

Same thing. I still get NaN for the entire column of GarageYrBlt – Andros Adrianopolos – 2019-07-07T01:14:10.920

Maybe try without the inplace = True – bradS – 2019-07-07T18:00:30.670

Okay so now I'm able to see everything but the original NaN value is still intact. – Andros Adrianopolos – 2019-07-08T05:21:28.527

Third time lucky... I think I may have misunderstood the transform method before. Try train['GarageYrBlt'] = train.groupby('YearBuilt')['GarageYrBlt'].transform(lambda x: x.fillna(x.median()). If that doesn't work, I would recommend searching Stack Overflow. – bradS – 2019-07-08T08:22:27.543