1

I'm working on imputing null values in the Titanic dataset.
The `'Embarked'`

column has some. I do NOT want to just set them all to the most common value, `'S'`

.
I want to impute `'Embarked'`

based on its correlation with the other columns.

I have tried applying this formula to the `'Embarked'`

column:

```
def embark(e):
if e == 'S': return 1
if e == 'Q': return 2
if e == 'C': return 3
else: return 4
```

This allows me to check out data.corr(), but I think it's trickier than that since I'll get a different correlation with different value assignments (right??). I also thought about using a four-dimensional (for S,Q,C,NaN) one-hot vector, but I doubt that would work.

Is there a skLearn method that does this some way? Any further insights on the matter?

Thanks for the sources. I was going to use knnImputer, but what about for the test data that has NaN values? Do I preprocess the test data in the same way? I guess I would, right? – JTalbott – 2020-07-14T18:59:35.450

Yes, you can impute the data in your test dataset, using the imputation calculated on the training dataset. I included more details in my answer above. – Donald S – 2020-07-15T02:22:04.597