Quasi-categorical variables - any ideas?



Let's say I'm trying to predict a person's electricity consumption, using the time of day as a predictor (hours 00-23), and further assume I have a hefty but finite amount of historical measurements.

Now, I'm trying to set up a linear model akin to

$power.used = \alpha* hr.of.day + \beta * temperature$

Problem: using the $hr.of.day$ as a numerical value is a very bad idea for many reasons, the fact that 23 and 0 are actually quite close values is one problem that can be solved with a simple transformation [1]. The fact that electrical consumption is often bi-modal is another problem which isn't solved by a simple transformation.

A possible solution that works rather well is to treat the time of day as a categorical variable. That does the trick, but it suffers from a significant drawback in that there's no information sharing between neighbouring hours.

So what I'm asking is this: does anyone know of a "soft" version of categorical values? I'm suggesting something quite loosely defined: Ideally I would have some parameter alpha that reduces the regression to numerical regression where $\alpha = 1$ and reduces to categorical regression where $\alpha = 0$, and behaves "in between" if it's some other number.

Right now the only answer I can think of is to alter the weights in the regression in such a way that they tend towards zero the further away the quasi-categorical value is from the desired value. Surely there are other approaches?

[1] introduce the hour variable as two new variables: $cos(time.of.day/24)$ and $sin(time.of.day/24)$

Uri Merhav

Posted 2015-01-28T13:42:15.430

Reputation: 206

Did you consider using a non-linear regression method? Or do you need the coefficients for interpretation? – Alexander Bauer – 2016-05-28T10:01:19.290

Use both variables and let the model figure out the coefficients? – Emre – 2015-01-29T03:57:08.760



I would suggest you to use the idea of so-called 'fuzzy clustering', where you put each of your hours of the day value into several clusters at the same time. Details in paper: http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html

The idea is trivial:

You decide how many clusters you want to have. For example, 4 (so you divide your day hours into 4 cathegories). Instead of computing just 1 number (which defines cluster membership) for each of your day hours you compute 4 numbers which represent the degree of membership to each of 4 clusters. So for example if you 4 clusters will contain periods 12 AM-6 AM, 6 AM- 12 PM, 12 PM - 6 PM and 6 PM - 12 AM then you would replace for example 4 AM hour in original data with vector of 4 numbers, first one is the biggest, second is smaller, third one is the smallest one etc.

Then you could use these 4 numbers in your model to fit a regression line.

Of course, if you want you could use 24 clusters and in such case each your day of hour would have a high 'relation' with nearby hours and almost 0 with the distant hours.

Maksim Khaitovich

Posted 2015-01-28T13:42:15.430

Reputation: 383

Interesting. On a somewhat different problem I ended up using an expectation maximization approach, where where you ascribe a "probability of belonging to a cluster" rather than strict for every sample, and a "probability of electrical output given that you belong to said cluster", and maximize over probabilities.

Thanks for the link (I'd upvote if I had the reputation for it), I'll look into it... – Uri Merhav – 2015-01-29T09:41:02.003

Probability also makes sense, though I find it an overkill for simple tasks. When I face same tasks which involve regression and hour of day do matter I usually rely on simple 'hours to midnight' transformation - in majority of problems out there it works best of all. I just calculate the amount of hours between current day hour and the midnight (if hour>12 return 24 - hour, else return hour). More complicated transformations usually cause my models to overfit. – Maksim Khaitovich – 2015-01-29T16:09:23.243


A few things,

1) Have you determined whether the relationship between hr_of_day and power_used is statistically significant?

I recommend doing a kendall's tau correlation if you haven't already. I like kendall's tau because it handles non-linear relationships and can be considered as the probability that the probability that the two are related.

2) Also, I would check whether temperature and hour are related. If there is multicollinearity you might need to reexamine the factors applied.

3) If you know that the independent fits a bimodal distribution then run some P-P plots on known bimodal distributions (such as 'beta distributions'). It would also be interesting to extract the potential gaussian distributions underlying the population - running two OLS models sequentially.

In the end, you may be better off working from time-series analysis where the function is expressed in terms of hour. But I wouldn't go adding your own factors most regression analyses will "tell" you that something is missing. The way to tell is if the model coefficients are not statistically significant or the amount of explained variance is small. But I reiterate that coercing the hour to a cosine or sine function implies a relationship between the dependent and independent variable that may not be real.

It seems like you might want to go back and do some (more) due diligence before throwing it into an ML model.

Dan Temkin

Posted 2015-01-28T13:42:15.430

Reputation: 181


The most logical way to transform hour is into two variables that swing back and forth out of sink. Imagine the position of the end of the hour hand of a 24-hour clock. The x position swings back and forth out of sink with the y position. For a 24-hour clock you can accomplish this with $x=sin(2pi*hour/24)$, $y=cos(2pi*hour/24)$.

You need both variables or the proper movement through time is lost. This is due to the fact that the derivative of either sin or cos changes in time where as the (x,y) position varies smoothly as it travels around the unit circle.

Finally, consider whether it is worthwhile to add a third feature to trace linear time, which can be constructed my hours (or minutes or seconds) from the start of the first record or a Unix time stamp or something similar. These three features then provide proxies for both the cyclic and linear progression of time e.g. you can pull out cyclic phenomenon like sleep cycles in people's movement and also linear growth like population vs. time.

Juan Esteban de la Calle

Posted 2015-01-28T13:42:15.430

Reputation: 2 102