When should ordinal data be represented catigorically and when as integer?


I am doing the Kaggle competition House Prices: Advanced Regression Techniques to learn more about data analysis. I would like to apply multiple models to the data(Regularized LR, Random Forests, Neural Networks, and ensemble methods).

When inspecting the data, I found many fields were ordinal data represented as categorical data. Two examples:

HeatingQC: Heating quality and condition

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor
LotShape: General shape of property

   Reg  Regular 
   IR1  Slightly irregular
   IR2  Moderately Irregular
   IR3  Irregular

I was wondering whether I should keep the fields like this, or whether I should turn encode them as integers(i.e. give each class in the category a number like 1,2,3 or 4) . Since the question probably is 'it depends', I hope you could give me some more general insight in when I should keep this data ordinal, or when to transform it into integers.


Posted 2018-08-18T16:35:07.987

Reputation: 121



You shouldn't keep it like this any way. One option is one-hot encode, but since your variables are ordinals, there's no point in one-hot encoding, you can just transform them into natural numbers.

One-hot encoding will increase numbers of independent variables significantly, that's why one-hot will be worse for your model. But your variables are clear ordinals, because they have clear order for all the values.

So, it this case transforming into natural numbers is the best option.


Posted 2018-08-18T16:35:07.987

Reputation: 437


All machine learning algorithm operates only on numerical dependent variables.

Ordinal dependent variable could be either implicitly treated as nominal by R 'factor' or Py Panda 'categorical' or you will need to convert/encode them into nominal. Refer to "encoding categorical data" section in https://www.datacamp.com/community/tutorials/categorical-data


Posted 2018-08-18T16:35:07.987

Reputation: 21


Like you've said, it depends! For example Trees can handle text-based categorial features, you dont have to convert them to numerical variables.

If you are using an Algorithm which uses statistical measures, e.g. chi2 test, it will cause problems if you encoded categorial values to numerical ones. I would recommend you to use a one-hot enconding, which generates binary vectors for each category instance.

Guy Gabson Junior

Posted 2018-08-18T16:35:07.987

Reputation: 16


Consider looking at Likert-scales. These are exactly the type of ordinal scales that you give in your examples. It is a common assumption that these are quantified as integer numbers, and the assumption is widely accepted as valid.

However, one should be cautious, in particular when working with statistical data descriptors, which are dependent on the topology of the domain space (i.e. whether it is categorical, ordinal or interval). This is illustrated by this article.


Posted 2018-08-18T16:35:07.987

Reputation: 664