## Encoding features like month and hour as categorial or numeric?

38

12

Is it better to encode features like month and hour as factor or numeric in a machine learning model?

On the one hand, I feel numeric encoding might be reasonable, because time is a forward progressing process (the fifth month is followed by the sixth month), but on the other hand I think categorial encoding might be more reasonable because of the cyclic nature of years and days ( the 12th month is followed by the first one).

Is there a general solution or convention for this?

I faced the same issue in defining hour of the day (1 to 24) variable in the RF model. If I convert the variable as categorical, VarImp function shows importance value for each hour and it looks very disorganized. I am just wondering that does it necessary to convert 'hour of the day' type numerical variable to categorical? – Mahmudur Rahman – 2018-02-13T02:51:03.833

30

Have you considered adding the (sine, cosine) transformation of the time of day variable? This will ensure that the 0 and 23 hour for example are close to each other, thus allowing the cyclical nature of the variable to shine through.

1kind of have problem with this because if I do: sin(piX/24) where X in [0, 23] we have the same evaluation for 6 am and 6 pm as sin(pi6/24) == sin(pi*18/24). but these are totally different hours – Eran Moshe – 2018-02-14T09:25:15.443

Can do the cycle like this: sin(pi*X/12). Thanks Eran :] – Eran Moshe – 2018-02-14T09:27:11.237

@EranMoshe fyi in the post from the link above they use a factor of 2pi instead, so it would be sin(2pi*X/12) - they give some reasoning for this in the comments – tsando – 2018-05-19T17:37:06.173

And its (2piX/24) which is (piX/12) :] As you see I've struggled with the exact same problem the author of http://blog.davidkaleko.com/feature-engineering-cyclical-features.html was struggled with. And in the comments you can see "Mariel G" correcting him exactly as I've come to realize: piX/12 will circulate for the hour of the day. What I also come to learn is that you must take the cos and sin components of this to define a true 24 hours period! (you need a true circle, and not the just a periodic function)

– Eran Moshe – 2018-05-21T07:16:43.763

@EranMoshe ah yes, if you want to do over hours then it can be reduced to piX/12, but if you want to do months, then it would be 2piX/12 i.e. pi/6. So in general it would be 2piX/period – tsando – 2018-05-21T16:49:02.327

Should I add the sine or cosine transformation, or both? – Funkwecker – 2018-06-26T07:15:47.967

Actually, isn't it sin(pi*X/24) that you want for time and sin(pi*X/12) for months? – Matt Cremeens – 2019-07-12T03:32:09.593

How about for days of the week ? or month - I think the answer doesn't cover the rest of the question :) – user702846 – 2020-01-14T09:48:27.093

I find (sine, cosine) representation works not so great for Decision Tree Algorithms (Try catboost), I have read that https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca for the Decision Tree algos we have to use a single column representation, but [sine, cosine] representation is not numeric and cannot be passed and two column representation is not great.

Can someone elaborate, how to pass (sine, cosine) cycling features as one single column to decision trees algorithms.?

– Ivan Shelonik – 2020-12-19T17:53:38.283

16

The answer depends on the kind of relationships that you want to represent between the time feature, and the target variable.

If you encode time as numeric, then you are imposing certain restrictions on the model. For a linear regression model, the effect of time is now monotonic, either the target will increase or decrease with time. For decision trees, time values close to each other will be grouped together.

Encoding time as categorical gives the model more flexibility, but in some cases, the model may not have enough data to learn well. One technique that may be useful is to group time values together into some number of sets, and use the set as a categorical attribute.

Some example groupings:

• For month, group into quarters or seasons, depending upon the use case. Eg: Jan-Mar, Apr-Jun, etc.
• For hour-of-day, group into time-of-day buckets: morning, evening, etc,
• For day-of-week, group into weekday, weekend.

Each of the above can also be used directly as a categorical attribute as well, given enough data. Further, groupings can also be discovered by data analysis, to complement a domain knowledge based approach.

8

I recommend using numerical features. Using categorical features essentially means that you don't consider distance between two categories as relevant (e.g. category 1 is as close to category 2 as it is to category 3). This is definitely not the case for hours or months.

However, the issue that you raise is that you want to represent hours and months in a manner where 12 is as close to 11 as it is to 1. In order to achieve that, I recommend going with what was suggested in the comments and using a sine/cosine function before using the hours/months as numerical features.

4

It depends on which algorithm you're using.

If you're using tree-based algorithms like random forest, just pass this question. Categorical encoding isn't necessary for tree-based algorithms.

For other algorithms like neural network, I suggest trying both method(continuous & categorical). The effect differs between different situations.

1It depends on the tree-based implementation. Widely used packages like scikit-learn and xgboost do not recognize categorical variables. You are expected to one-hot encoding them. – Ricardo Cruz – 2017-03-24T09:15:28.913

1

From this post: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159 you should not use one hot for anything based on decision trees, which is pretty much what I am finding out the hard way.

– ashley – 2019-04-04T20:47:27.373

2

To rephrase the answer provided by @raghu. One major difference between categorical and numerical features is whether the magnitude of the numbers are comparable, i.e., is 2019 bigger than 2018, or December(12) bigger than March (3)? Not really. While there is a sequential order in these numbers, their magnitude is not comparable. Thus, transforming into a categorical value may make more sense.

1

Because of all the data you have is well defined I would suggest you a categorical encoding, which is also easier to apply.