Changing categorical data to binary data is not reflected on the dataset


I am working through the Titanic competition. This is my code so far:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

train = pd.read_csv("")
test = pd.read_csv("")

train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])

# Fill missing values in Age feature with each sex’s median value of Age
train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)

linReg = LinearRegression()

data = train[['Pclass', 'Sex', 'Parch', 'Fare', 'Age']]

# implement train_test_split
x_train, x_test, y_train, y_test = train_test_split(data, train['Survived'], test_size=0.2, random_state=0)

# Training the machine learning algorithm, y_train)

# Checking the accuracy score of the model
accuracy = linReg.score(x_test, y_test)
print(accuracy*100, '%')

This line previously looked like this: data = train[['Pclass', 'Parch', 'Fare', 'Age']], which ended up giving me an accuracy score of 19.5%. I realized that I didn't include sex so I went ahead and did this:

data = train[['Pclass', 'Sex', 'Parch', 'Fare', 'Age']]

Then, I got the following error:

ValueError: could not convert string to float: 'female'

Here I realized that the changes that I've done to my train['Sex'] and train['Age'] did not reflect on the training and the testing of the model, which seems to be the reason why my model performed at 19.5%. How do I come across this problem?

Andros Adrianopolos

Posted 2019-06-05T04:32:47.217

Reputation: 322



Though you have converted the values into integer but you are not assigning it.

train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])

should be like that

train['Sex'] = train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'] = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])


train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)

this code is working without assigning the value just because you have used inplace=True. Otherwise you have to assign it back as I mentioned for "Sex" and "Embarked".

scikit.preprocessing provides us various util methods for handling all such issues. Like LabelEncoder, Imputer for these purposes.

LabelEncoder, will convert string to integer values whereas Imputer, will replace the missing value.

Sample code for your reference:

from sklearn.preprocessing import Imputer, LabelEncoder 
from collections import defaultdict 
data = train[['Pclass', 'Sex', 'Parch', 'Fare', 'Age']]
#If you want to convert all features string to integer 
d = defaultdict(LabelEncoder) 
data = data.apply(lambda x: d[].fit_transform(x))

#Otherwise you can convert each feature strings separately as mentioned below 
encoder = LabelEncoder() 
data['Sex'] = encoder.fit_transform(data['Sex'])

imputer = Imputer(strategy="median") 
data = imputer.fit_transform(data)

vipin bansal

Posted 2019-06-05T04:32:47.217

Reputation: 1 322


You need to transform your independent variables into numeric values. Normally for binary variables, we use the 0-1-encoding. Introduce a new variable called is_female. If the observation is a male person then give the variable the value $0$ and if the observation is a female person give the observation the value $1$.

If you have categorical variables like a city with three categories you will need to create additional variables. Imagine we have the possible values ["New York", "London", "Moskau"]. Then you create three variables is_new_york, is_london, is_moskau. If we have an observation from New York this will result in is_new_york=1, is_london=0, is_moskau=0. If we have an observation from London then the values will be is_new_york=0, is_london=1, is_moskau=0 and if we have an observation from Moskau then the values will be is_new_york=0, is_london=0, is_moskau=1. This type of encoding is called one-hot encoding. You can also have multiple cities. For example, if a person lives in Moskau and London then you can use is_new_york=0, is_london=1, is_moskau=1.


Posted 2019-06-05T04:32:47.217

Reputation: 1 254

Yea but I've already done that. I've changed my Sex variable to binaries of 0/1. – Andros Adrianopolos – 2019-06-05T06:57:20.360