5

5

Let's say I'm trying to predict a person's electricity consumption, using the time of day as a predictor (hours 00-23), and further assume I have a hefty but finite amount of historical measurements.

Now, I'm trying to set up a linear model akin to

$power.used = \alpha* hr.of.day + \beta * temperature$

Problem: using the $hr.of.day$ as a numerical value is a very bad idea for many reasons, the fact that 23 and 0 are actually quite close values is one problem that can be solved with a simple transformation [1]. The fact that electrical consumption is often bi-modal is another problem which isn't solved by a simple transformation.

A possible solution that works rather well is to treat the time of day as a categorical variable. That does the trick, but it suffers from a significant drawback in that there's no information sharing between neighbouring hours.

So what I'm asking is this: does anyone know of a "soft" version of categorical values? I'm suggesting something quite loosely defined: Ideally I would have some parameter alpha that reduces the regression to numerical regression where $\alpha = 1$ and reduces to categorical regression where $\alpha = 0$, and behaves "in between" if it's some other number.

Right now the only answer I can think of is to alter the weights in the regression in such a way that they tend towards zero the further away the quasi-categorical value is from the desired value. Surely there are other approaches?

[1] introduce the hour variable as two new variables: $cos(time.of.day/24)$ and $sin(time.of.day/24)$

Did you consider using a non-linear regression method? Or do you need the coefficients for interpretation? – Alexander Bauer – 2016-05-28T10:01:19.290

Use both variables and let the model figure out the coefficients? – Emre – 2015-01-29T03:57:08.760