Binary encoding and its interpretation in Python

1

I have a column named Street that has 2 values: Paved and Gravel. Here is what print(train[binary_columns[0]].unique().tolist()) gives me:

['Pave', 'Grvl']

I want to encode these values in binary like this:

df['Street'] = df['Street'].replace(['Pave', 'Grvl'], [1, 0])

But I wonder if this is a good idea. Wouldn't the computer interpret this as Pave > Grvl? How does the computer differentiate between binary and integer encoding?

Andros Adrianopolos

Posted 2019-06-13T05:27:02.577

Reputation: 322

Answers

2

Your categorical variable has two levels, so there is no actual difference between dummy-coding vs. simply entering the variable into the analysis. That is, to dummy code you would create one new variable with two values but your original variable is already one variable with two values. Dummy-coding is important for variables with more than two possible values. So, in this case the computer won't consider Pave > Grvl.

But if you have more than two variables then you should use dummy variables.

For your data, you can use pandas.get_dummies() or sklearn's one hot encoder to achieve your result.

bkshi

Posted 2019-06-13T05:27:02.577

Reputation: 1 907

So then when am I allowed to use binary encoding? Because that same problem would arise even for variables like gender, which makes a lot of data scientists depend on binary encoding. – Andros Adrianopolos – 2019-06-13T06:24:49.473

Sorry I got mixed up, have edited. – bkshi – 2019-06-13T06:34:56.553

I already know how to do OHE. My concern was whether the computer will interpret it as Pave > Grvl. Thank you. – Andros Adrianopolos – 2019-06-13T06:38:40.903

0

  1. How to encode?

sklearn.preprocessing provides various classes for this purpose, LabelBinarizer is one of them.

  1. Wouldn't the computer interpret this as Pave > Grvl?

Consider an example, where people prefer Paved house in comparison to Graveled. Then their is a relationship between values and hence it should be treated as something like you have mentioned, other wise it should be independent values(refer next answer).

  1. How does the computer differentiate between binary and integer encoding?

As I mentioned above, if the categorical values have some relationship(as mentioned above), then in such a case it should be integer values(0,1,2 and so on), otherwise it should be binary. Binary representation will help us in presenting as an independent value to ML model (however it doesn't make much sense in this case as you just have 2 values). But consider an example where a feature have more than 2 categorical values. If they all are independent then it should be represented as binary value i.e in the form of OneHotEncoding(refer sklearn.preprocessing classes).

vipin bansal

Posted 2019-06-13T05:27:02.577

Reputation: 1 322