How to methodologically show that a given 'time-series/sequential' data is not really sequential?

5

I have an apparent time-series/sequencial (supervised: multi-class classification problem) dataset with each data-point time-stamped. However, some domain intuition tells me that the data is static and no need to treat them as time-series data. I have also verified that a non-timeseries method fits (even with rigorous cross-validation and other good practices) the data well. However, I want to prove this in more rigorous/methodological way, preferably in python but R is also fine if no alternative. Is there any method/test that validate my hypothesis?

Thanks.

dbm

Posted 2019-01-29T23:52:17.133

Reputation: 191

You could try computing autocorrelation of that signal you think it is a static one. By autocorrelation I mean crosscorrelation of signal $X$ with itself. Low values of autocorrelation could imply that signal is static or that there is not cycles or trends in time-serie. https://ipython-books.github.io/103-computing-the-autocorrelation-of-a-time-series/ here is some info about autocorrelation.

– maksylon – 2019-01-30T11:24:50.060

What is the input of classifier? do you change $(x, t) = ((t_1, x_1), (t_2, x_2), ..., (t_n, x_n))$ sequence to $x = (x_1, x_2,...,x_n)$ and feed $x$ to classifier? or you only use one data point $(t_i, x_i)$, discard its time and feed $x_i$ to classifier? – Esmailian – 2019-02-28T11:14:05.160

Answers

1

TL;DR: There is no off-the-shelf formula to determine if you should go for time-series or not. Hypothesis testing is the most formal approach, AFAIK.

Just to find a common ground, let's first distinguish between using timestamp vs. dealing with time-series.

Timestamp data itself may contain lots of information that could be quite useful in many domains. For instance, hour, day of week and month are widely used in demand forecasting applications. However, using the timestamp in this fashion does not make it a time-series approach. Here, time plays only a role of just another feature.

When dealing with time-series, one would expect to benefit from temporal dependencies present in the data (as well as from spatial dependencies). BTW, most of the methods are limited to evenly spaced time-series. If you have timestamps of some arbitrary events, you would have to do pre-precessing/aggregation before being able to treat is as a time-series.

Should you treat your data as a time-series?

I would suggest considering two basic hypotheses (i.e., try proving the opposite if you believe your non-time-series model is just right):

  • (1) there is a time-domain relationship between observations;
  • (2) it is indeed relevant to your problem.

First, you may try to investigate (1) by checking if there is a time-series model that explains well your observations. As suggested in comments, autocorrelation test is a way to get a clue if there is a linear relationship between observations through time.

You may wish to go beyond linearity, and try to build a non-linear predictor by using a time delay (TDNN) or a recurrent (RNN) neural model.

Before deep diving in pure statistics for a formal answer, I would suggest a simple test: train your best predictor model on a real time-series and also do it on randomly shuffled (broken) time-series. If there is no significant difference in performance - there probably nothing in time-domain but noise or some sort of complex chaos.

For instance, a time-series of the chaotic process like x(t+1) = 4 * (1-x(t))*x(t) looks quite meaningless, but a very simple non-linear model will be able to predict it perfectly. On the other hand, good luck trying to predict an randn() generator, which is also a chaotic time-series :-).

If you cannot reject (1), i.e. there are strong evidence of time-domain correlations, then you should verify if (2) is true. Here it could be as easy/hard as it gets, depending on your domain. Speaking more formal, I'd go for trying to find out if the estimates of

P[Y(t)| X(t), X(t-1)..., X(t-n)] # predictions of the target variable given past observations

are as good as

P[Y(t)| X(t)]. # predictions of the target variable given only current observation

You could try to answer it empirically by comparing the performance of a time-series vs. non-time-series models in a sufficient number of trials, or by using some specific knowledge of your domain.

E.g., P[cat | pixels of current picture] = P[cat | pixels of the current picture, pixels of the previous picture], which may be true or not, depending whether the pictures were taken in the same or different households.

M0nZDeRR

Posted 2019-01-29T23:52:17.133

Reputation: 186