Is it valid to shuffle time-series data for a prediction task?

8

2

I have a time-series dataset that records some participants' daily features from wearable sensors and their daily mood status.

The goal is to use one day's daily features and predict the next day's mood status for participants with machine learning models such as linear regression.

I think cross-validation could be a good way for me to evaluate the performances. However, would shuffling the data randomly be fine?

Someone told me that because I am using a time-series dataset and I am trying to do a prediction task, shuffling the data randomly will cause some mix-up of future and past, which makes my model meaningless. However, I think I can still use the strategy of randomly shuffling the dataset because the learning model is not a time-series model and, for each step, the model only learns from exactly 1 label value instead of a series of labels.

Han

Posted 2019-06-21T17:48:55.183

Reputation: 83

what method will you use exactly? – Peter – 2019-06-21T18:15:05.103

Thanks! I am currently using a linear regression model and a multitask linear regression model. – Han – 2019-06-22T17:12:42.760

Answers

6

It depends on how you formulate the problem.

Let's say you have a time-series of measurements X and are trying to predict some derived series of values (mood) Y into the future:

X = [x0, x1, x2,.....]
Y = [y0, y1, y2,.....]

Now, if your model has no memory, that is, it is merely mapping xN -> yN, then the model does not care what order it sees x and y in. Feed-forward neural networks, linear regressors etc. are memory-less models.

But, if your model has memory, that is, it is mapping a sequence of features to the next days moods xN-2, xN-1, XN... -> yN-2, yN-1, yN..., then the ordering matters. The model in this is keeping track of what it has seen before. The model's internal parameters are changing and persisting with each new example it sees. The current prediction depends on the last prediction. Recurrent neural networks have memory, so order matters.

You can get around the memory requirement by restructuring your dataset. You can concatenate consecutive features and map them to next day's mood instantaneously (xN-2, xN-1, xN) -> yN. In this way, your input feature will incorporate information about the past, but the model won't care about the order since all the temporal information is encoded in the new feature, and not the model. Your new dataset will look like:

Xnew = [(x0, x1, x2), (x1, x2, x3), (x2, x3, x4),...]
Y    = [         y2,           y3,           y4,...]

hazrmard

Posted 2019-06-21T17:48:55.183

Reputation: 273

1Thank you, really helpful! – Han – 2019-06-22T17:10:06.477

1

You say you use linear regression. There are two possibilities here: Controlling time features with and without a „lag“ of y (see also hazrmard‘s) answer.

In case of the „mood“ of a person, you may assume that todays mood is (among other things) also dependent on yesterdays mood. So I think adding (or testing) lags in a way like y = b0 + b1X + b2y(t-1) may be useful.

In this case I would not randomly shuffle data since a sequence in time is relevant for model training and testing as well.

Peter

Posted 2019-06-21T17:48:55.183

Reputation: 4 724