How to find the mean of a column relative to another column?

0

1

I am working on the Boston house price prediction. I have a column named GarageYrBlt that holds the year the garage was built for a specific house. My assumption is that the garage would most likely be built at the same time as the house so I want to fill the missing value with the median of the column GarageYrBlt relative to the column YearBuilt.

To explain my idea further: While I was working on the Titanic problem, I filled the missing Age column with the median relative to the Sex column. So all female passengers with the missing age would get the value that is the median age of all female passengers.

This is what I did:

train['GarageYrBlt'] = train['GarageYrBlt']
     .fillna(train.groupby('YearBuilt')['GarageYrBlt']
     .transform("median"), inplace=True)

And when I do print(train['GarageYrBlt']), this is my output:

0       None
1       None
2       None
3       None
4       None
5       None
6       None
7       None
8       None
9       None
10      None
11      None
12      None
13      None
14      None
15      None
16      None
17      None
18      None
19      None
20      None
21      None
22      None
23      None
24      None
25      None
26      None
27      None
28      None
29      None
        ... 
1430    None
1431    None
1432    None
1433    None
1434    None
1435    None
1436    None
1437    None
1438    None
1439    None
1440    None
1441    None
1442    None
1443    None
1444    None
1445    None
1446    None
1447    None
1448    None
1449    None
1450    None
1451    None
1452    None
1453    None
1454    None
1455    None
1456    None
1457    None
1458    None
1459    None
Name: GarageYrBlt, Length: 1460, dtype: object

Andros Adrianopolos

Posted 2019-07-06T02:29:42.350

Reputation: 322

Answers

1

You need to change inplace to False (which is the default).

Setting it to True changes the dataframe in place, so you don't have to assign the column again. Also, setting inplace=True returns None. So, either:

train['GarageYrBlt'] = train['GarageYrBlt']
     .fillna(train.groupby('YearBuilt')['GarageYrBlt']
     .transform("median"), inplace=False)

or

train['GarageYrBlt']
     .fillna(train.groupby('YearBuilt')['GarageYrBlt']
     .transform("median"), inplace=True)

Daren

Posted 2019-07-06T02:29:42.350

Reputation: 106

Why does setting it to true return None? – Andros Adrianopolos – 2019-07-08T06:08:13.273

1

Lots of discussion about this... https://github.com/pandas-dev/pandas/issues/1893

– Daren – 2019-07-08T06:09:40.707

I tried your first snippet of code and the original nan value stays intact. – Andros Adrianopolos – 2019-07-08T06:10:42.113

Hi, did you try re-reading the DataFrame? – Daren – 2019-07-08T06:20:24.757

How so??....... – Andros Adrianopolos – 2019-07-08T06:21:32.733

Because after you have run your code in your question, the whole column is already None. So you have to read_csv or whatever to get back your original GarageYrBlt column (i.e. go back and revert to the original state). – Daren – 2019-07-08T06:23:29.857

So how do I solve that? Just call read_csv twice? – Andros Adrianopolos – 2019-07-08T07:09:11.787