Predictive modeling on time series: How far should I look back?


I am building a model for classification on a dataset which is collected by recording a system's behaviour through 2 years period of time. The model is going to be used in the same system in real time.

Right now I am using the whole dataset (2 years) to build my classifier but I doubt it might not be the right approach. Since I am trying to model the behaviour of a system in real time, it is possible that older data points in the dataset have become irrelevant or uninformative compared to system's current environment (distribution of the system's input changing drastically through time for example).

My question is how can I determine which parts of the dataset to use for training, taking the last 1.5 years instead of all 2 years for example. Is there a statistical way to help me decide that a specific time period is not helping or possibly hurting the model to correctly classify more recent data points?


Posted 2018-05-30T13:41:39.417

Reputation: 33

Consider time period as a model hyperparameter and try to optimize for it using holdout from recent data. – Timofey Chernousov – 2018-05-31T10:35:07.467



Hold out the most recent block of data (maybe something like a month) to use as your validation data.

Try multiple models and different ways of training those models. For example, training only on the most recent couple of months vs the whole time period.

Take a small number of your top performers and test their performance on the validation set and see what does better.

There’s nothing special about this answer. Anytime you want to know is X better than Y, testing performance on a held out validation set is almost always the way to go.


Posted 2018-05-30T13:41:39.417

Reputation: 1 637