Categorical vs continuous feature selection/engineering



I'm working with a dataset with a number of potential predictors like :

Age : continuous

Number of children : discrete and numerical

Marital Situation : Categorical ( Married/Single/Divorced.. )

Id_User : Categorical ( an id of the user who conducted the first interview with this person )

I'm stopping at four potential predictors, there are more, but for the sake of shortness, these would be enough to ask my question.

Question : Continuous features are easy to deal with, normalize, and feed it to the model, what about categorical and independant ?

Note : I get that categorical features that follow a certain pattern can be encoded as integers and fed to the model, but what if those categorical features have no meaning as integers ( 1 for single, 2 for married , 3 for divorced ; for the model that treats it as a quantitative predictor it doesn't make sense to feed it to it like that)

Any ways to deal with these different types of features?


Posted 2019-04-12T10:17:40.903

Reputation: 1 704



What you are looking for are called dummy variables, they convert your categorical data into a matrix where the column is 1 if the person belongs to a category or 0 otherwise.

The variable ID is not convertible because you don't want your model to overfit over your ID data (meaning: You don't want your model to remember the result for every ID, you want your model to be general).

import pandas as pd
dataset2 = pd.get_dummies(dataset)

Juan Esteban de la Calle

Posted 2019-04-12T10:17:40.903

Reputation: 2 102


For encoding categorical features, there is two common ways:

Ordinal encoder

This is the way you mentioned as 'encoded as integers'. In this method, an integer starting from 0 is assigned to each category. The problem of this method is that it randomly prioritize categories. So in cases when there is no priority among categories, this encoding is meaningless as you mentioned. The only case it work is when assigning larger integer to some categories is meaningful.

One-hot encoder

This method makes a feature vector (one-hot vector) for each categorical feature which is the same size as the number of categories. The method assigns each component of the vector to one of the categories. For each data sample, it assigns 1 to component which its corresponding category is present at the sample and assigns 0 to other components. The benefit of this method is that unlike ordinal encoder it does not prioritize any category.

So in your case, I highly recommend that you use one-hot encoder.


Posted 2019-04-12T10:17:40.903

Reputation: 1 162

The number of columns 'One-Hot' adds to the dataset doesn't affect in anyway the outcome of my model right? – Blenz – 2019-04-12T13:46:20.463

Are you afraid of overfitting? – pythinker – 2019-04-12T13:51:39.553

aren't we all?! I'm under the impression that ,on the contrary, this method doesn't 'encourage' overfitting – Blenz – 2019-04-12T13:53:30.000

Yes, you are right. We all are afraid of over-fitting. By over-fitting I meant, when we increase the number of inputs the model have to learn more weights to map this inputs to outputs. So, I should say, it somehow affects the outcome of your model but it's not a serious concern. – pythinker – 2019-04-12T13:59:29.487

1I believe that in the context of machine learning, "dummy variable" is more commonly used for what you are referring to as "one-hot". – Acccumulation – 2019-04-12T15:21:27.530


One possibility to deal with categorical inputs is to introduce the category input vector $\boldsymbol{t}$. The category input vector of the $n^{\text{th}}$ observation is given by

$\boldsymbol{t}_n=[t_{1n}, t_{2n},...,t_{Kn}],$ in which $K$ is the number of categories. If the continuous input vector $\boldsymbol{x}_n$ is belonging to category $k$, then $t_{1i}=1$ for $i=k$ and $t_{1i}=0$ for $i\neq k$.

This type of encoding is called one hot encoding for classification.


Posted 2019-04-12T10:17:40.903

Reputation: 1 254

1I have a lot of possible values 100+ in let's say Id_User, wouldn't that add 100 additional columns to my dataset? – Blenz – 2019-04-12T11:08:26.640

1@Blenzus: Yes you are right, but the columns are sparse. You have to remember that having so many categories is only feasible if you have a lot of data such that your data set is representative. – MachineLearner – 2019-04-12T12:49:17.543


As others have said, dummy variables is one method. Another method is to take quantitative statistics from the populations having that property. For instance, you can create a "marital situation average" column, and populate it with the average value of the target variable among people with the same marital situation as that subject.

If you are using a tree method, simply assigning integers to each category will approximate dummy variables, especially if there are only a few categories. For instance, if the only categories for marital situation are Married, Single, Divorced, and Widowed, and you assign them 0, 1, 2, 3 respectively, then the only possible splits are Married vs. Everything else, Widowed vs. Everything else, or Married/Single vs Divorced/Widowed. So two thirds of the splits are effectively dummy variables, and the last one will turn into a dummy variable as soon as you split on that variable again.


Posted 2019-04-12T10:17:40.903

Reputation: 241


There could be a number of ways of handling categorical data but what I have seen so far is to create a numeral mapping of the categorical data and then one-hot-encode the mappings to feed into the neural network.

If you are working with Keras, you can use the to_categorical function to transform your mappings accordingly.

>>> from keras.utils import to_categorical

>>> y = [0,1,0,1,1]
>>> oh_y = to_categorial(y, num_classes=2)
>>> print(oh_y)



Posted 2019-04-12T10:17:40.903

Reputation: 1 913

Thanks for the answer, no , actually i'm using "regular" classification algorithms , but yes i've used the to_categorical method while testing an ANN. – Blenz – 2019-04-15T11:16:50.323