KNN Regression: Distance function and/or vector representation for datetime features


Context: Trying to forecast some sort of consumption value (e.g. water) using datetime features and exogenous variables (like temperature).

Take some datetime features like week days (mon=1, tue=2, ..., sun=7) and months (jan=1, ..., dec=12).

A naive KNN regressor will judge that the distance between Sunday and Monday is 6, between December and January is 11, though it is in fact 1 in both cases.


hours = np.arange(1, 25)
days = np.arange(1, 8)
months = np.arange(1, 13)

>>> array([1, 2, 3, 4, 5, 6, 7])
>>> numpy.ndarray


A custom distance function is possible:

def distance(x, y, domain):
   direct = abs(x - y)
   round_trip = domain - direct
   return min(direct, round_trip)

Resulting in:

# weeks
distance(x=1, y=7, domain=7)
>>> 1
distance(x=4, y=2, domain=7)
>>> 2

# months
distance(x=1, y=11, domain=12)
>>> 2
distance(x=1, y=3, domain=12)
>>> 2

However, custom distance functions with Sci-Kit's KNeighborsRegressor make it slow, and I don't want to use it on other features, per se.


An alternative I was thinking of is using a tuple to represent coordinates in vector space, much like we represent the hours of the day on a round clock.

def to_coordinates(domain):
    """ Projects a linear range on the unit circle, 
        by dividing the circumference (c) by the domain size,
        thus giving every point equal spacing.
    # circumference
    c = np.pi * 2
    # equal spacing 
    a = c / max(domain)
    # array of x and y
    return np.sin(a*domain), np.cos(a*domain)

Resulting in:

x, y = to_coordinates(days)

# figure
plt.figure(figsize=(8, 8), dpi=80)

# draw unit circle
t = np.linspace(0, np.pi*2, 100)
plt.plot(np.cos(t), np.sin(t), linewidth=1)

# add coordinates
plt.scatter(x, y);

linear domain to unit circle coordinates

Clearly, this gets me the symmetry I am looking for when computing the distance.


Now what I cannot figure out is: What data type can I use to represent these vectors best, so that the knn regressor automatically calculates the distance? Perhaps an array of tuples; a 2d numpy array?


It becomes problematic as soon as I try to mix coordinates with other variables. Currently, the most intuitive attempt raises an exception:

data = df.values

Where df is:
nested numpy arrays attempt

The target variable, for simple demonstration purposes, is the categorical domain variable days.

TypeError                                 Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-112-a34d184ab644> in <module>
      1 neigh = KNeighborsClassifier(n_neighbors=3)
----> 2, days)

ValueError: setting an array element with a sequence.

I just want the algorithm to be able to process a new observation (a coordinate representing the day of the week and temperature) and find the closest matches. I am aware the coordinate is, of course, a direct representation of the target variable, and thus leaks the answer, but it's about enabling the math of the algorithm.

Thank you in advance.

Robin Teuwens

Posted 2020-08-11T16:11:07.113

Reputation: 43

1An alternative - it looks like there's a 'precomputed' option for distance, which will let you use the distance you "really" (?) want, and should not be slow, since there's no computation to be done. btw, I like your idea of converting to 2d (the unit circle), 2d numpy array would be the way to go here I think. There could be issues if you have both days and months, since the distance may not "know" to separate them - depends on the details of your setup. – bogovicj – 2020-08-11T18:17:17.127

In response to the attempt section, for this code, days) what are the shapes of data and days? Am I understanding that you're predicting temperature from datetime? – bogovicj – 2020-08-12T13:54:37.940

Thank you, @bogovicj , for pointing that out. I have edited the post to clarify.

Naively, I'd simply pass two columns for the algorithm to .fit(): day of week (int) and temperature (float).

However, this gets me in trouble due to the mentioned lack of symmetry (it will compute Monday-Sunday=6).

Instead, I try using coordinates. This gets me the desired symmetry but results in columns with nested arrays: coordinates (list of tuples/numpy array of numpy arrays) and temperature (float).

The last part is my hurdle. – Robin Teuwens – 2020-08-12T15:48:22.237

Could it be that the answer is dead simple, and that is just to split the days_coordinates (x, y) into separate columns days_x and days_y? Thereafter, If I want one feature to be more important than the other I can standardize all features first, and then multiply them by custom weights? – Robin Teuwens – 2020-08-12T16:25:43.597

1Yes, the days_x anddays_y`into separate columns is the way to go if you take the unit circle approach. On feature importance - my intuition is that re-weighting features as you said will do what you want if by "feature importance", you mean "how much each feature matters for determining distance". – bogovicj – 2020-08-12T17:09:02.680



I like your idea of converting to 2d (the unit circle), 2d numpy array would be the way to go here Specifically, try puting the days_x and days_y into separate columns if you take the unit circle approach.

An alternative idea - it looks like there's a 'precomputed' option for distance, which will let you use the distance you "really" want, and should not be slow, since there's no computation to be done.


Posted 2020-08-11T16:11:07.113

Reputation: 409