One-class classifier for time series data classification

4

4

My problem is different from the common time series data problem. What I need to do is check if future time series data is in accord with previous time series data I already consider to be correct.

In short, I need a one-class classifier applied to time series data, which have variable length (going from 110 to 125 points).

Does anyone know any lib that might help me with this task? I usually use Python+scikit-learn, but I'm open to use any other configuration.

Could you explain what do you mean by "if future time series data is in accord with previous time series data ". – Shagun Sodhani – 2016-05-18T12:36:42.647

Suppose I have a set of 30 time series data. Next time I get a new time series data, I have to check if it is from the same class of the 30 sequences I already have, but I don't have to include the new one in the set. – Minoru – 2016-05-19T15:38:28.367

Do you also have an opposite class, i.e. one that isn't that of the first 30 sequences? – K3---rnc – 2016-05-19T16:53:34.160

No, I don't. The only data I have previously is the 30 sequences that are from the positive class. – Minoru – 2016-05-20T17:14:50.470

I think it'll be hard / impossible to detect a useful pattern if you only have 30 samples and ~100 features. You could try to come up with some smart manual feature engineering to reduce the 100 features down to 2 or 3 meaningful features and then try to use a one-class SVM, local outlier factor or gaussian mixture model.. – stmax – 2016-05-24T10:44:25.030

What kind of time series are you analyzing (address randomness, independence, and stationarity-- if you don't know, tell us more about your data). That very much determines how you'd look for an answer to your question-- if the future is "in accord" with the past. – Pete – 2016-05-25T16:42:46.657

@Pete The data is collected from an equipment and the configuration shall not change. The variability of the data can be described as a stochastic proccess. – Minoru – 2016-05-31T15:58:25.387

4

Apart from the approach @Rolf Schorpion mentioned, there are others. For example, you could use a deep neural network, specifically, an auto-encoder (see here for a tutorial).

But there's an important catch to all purely "data-driven" approaches: if the figure of 30 time series you mention in the comments is a typical order of magnitude for your training set, the results will be more or less arbitrary. If you don't define "accordance" in any way, this is a classification problem with only positive training data which consists of 30 data points with more than 100 features each.

Unless your data is very, very special (e.g., all time series are identical), there is just too much freedom in the solution of this problem. Different algorithms (or different parameter sets for the same algorithm) will use this freedom in very different ways. So you will probably see very different solutions when you experiment with different methods.

So if you don't want to do more or less arbitrary experiments and choose from the solutions afterwards, you have to use a method in which you somehow define the meaning of "in accordance" in advance. You may try and use traditional time series techniques to find a common model that fits each of the time series well. You could then postulate that "accordance" is a good fit to this model.

Or you might just do some exploratory analysis of your data to decide upon a simple rule which could be used to detect accordance. There's lots of possibilities, and without further details on the application it's difficult to decide which one is appropriate. Delegating the decision of what's "in accordance" to some algorithm doesn't seem to be a good choice in this case, though.

Thanks for the complete answer. My data is a bit similar (all of them have the same number of peaks, around the same number of medium values), but some of them have some disparity in time. I'll go for exploratory analysis and may post some of my findings here. – Minoru – 2016-05-24T20:10:17.117

2

It sounds like 'novelty detection' is what you might be looking for.

There is a scikit library for this. It generates a one-class model and predicts whether new observations fit into the one class or not.

For further reading, I would like to refer to this link: Novelty detection scikit-learn There you can also find an example using a SVM classifier.

Thanks for the clear answer. I was a bit worried about the performance of the SVM for time series data, but as I commented in @MightyCurious answer, I'll go for exploratory analysis and try the SVM too. – Minoru – 2016-05-24T20:12:22.627

2

Based on your description, a simple cross-correlation should do it. You are testing whether a series point in the future correlates with a series in the past which you consider correct. This measurement is given exactly by the cross-correlation between the future series and the past series -

The other good thing about cross correlation is you can adjust the length of your series accordingly, which gives a lot of flexibility.

Cross correlation itself might not be enough for your system, as it only provides a measure of difference, you can still build a model for the actual "classification" - since your output are binary, anything simple from a threshold detector (i.e. if xcor > value then 1 else 0) to a small neural network should both work fine.

Thanks for the answer. My first approach to the problem involved the DTW, another measure of difference. The problem is that I don't have a negative class to compare the distances and decide. I'll try to use a neural network in sequence. – Minoru – 2016-05-24T20:03:18.547

What if the data are completely uncorrelated to begin with, as in a random, independent, stationary process? What if the data are correlated, say temperatures from the same thermometer, but the future set is still uncorrelated to the past set (enough time has elapsed)? – Pete – 2016-05-25T16:53:02.507

@Pete from what I understood from the question : the OP is looking for "future time series data is in accord with previous time series" - so I assumed the two data sets are correlated and the future process is somewhat dependent on the past process - maybe with hidden states - but I guess that the OP is only looking for a 2 state model, either it changed or it did not change and this threshold/distribution can only be trained/realized by what the OP decide to train on. – GameOfThrows – 2016-05-26T08:09:36.813

@GameOfThrows Is there any lib where I can find the cross-correlation? numpy has one, but I suspect it is not what you were talking about.

– Minoru – 2016-06-30T15:51:47.993

2

As you (OP) further clarified, you've got equipment sensor data. Your first time-series was recorded when you knew the machine was in good operating condition. Later, you sample another time series, and you want to know if anything has changed. This is called anomaly detection.

You described the time-series as stochastic, meaning there is an element of randomness. There are several interesting methods you can use to find the probability that the two series were generated by the same process (nothing has changed). I'm afraid the answers I've read to your question so far would only work in special cases, and are perhaps too complicated for the job.

The general concept is that there is an underlying probability distribution (described by a 'probability density function' or PDF) which generates your time-series. When the machine gets out of order, that PDF changes. You test if the new distribution is different from the old one. You should use a completely general approach which does not assume a particular form of the PDF (Gaussian, Poisson, Beta, etc).

First, you can always construct and graph the empirical cumulative distribution function or ECDF for each time-series and compare them visually. This requires a qualitative judgement on what "different" looks like.

To make it quantitative, a test from classical statistics is a K-S test. It's a one-liner in R. The disadvantage of this test is, well, it's not Bayesian. :) There are other issues too. That being said, I use the K-S test all the time because it's so fast and easy.

There are many caveats here. You'll probably want to use the first-differences (changes) of sensor data instead of the values themselves. If you have more than a few sensors, you'll probably want to use dimensionality reduction such as principal component analysis (PCA) as well.

I much prefer a Bayesian A-B test. This requires you to construct models of the PDFs to which your data are applied. This generates a PDF which most likely explains your data. In fact, it generates a PDF for each parameter in your PDF. You look to see how much overlap you have in the parameters. Not enough overlap means an anomaly. To get started with Bayesian probability I recommend Python and PYMC and the Bayesian Hackers book.