## how to compare different sets of time series data

8

3

I am trying to do some anomaly detection between time#series using Python and sklearn (but other package suggestions are definitely welcome!).

I have a set of 10 time-series; each time-series consists of data collected from torque value of a tire (so 10 tires in total) and the sets may not contain same number of data points (set size differ). Each time-series data is pretty much just the tire_id, timestamp, and the sig_value (value from the signal, or the sensor). Sample data for one time-series looks like this:

tire_id        timestamp        sig_value
tire_1           23:06.1            12.75
tire_1           23:07.5                0
tire_1           23:09.0            -10.5

Now I have 10 of them, and 2 of them behave strangely. I understand that this is an anomaly detection problem, but most of the articles I read online are detecting anomaly points within the same time-series (aka if at some points the torque values are not normal for that tire).

To detect which 2 tires are behaving abnormally, I tried using clustering method, basically k-means clustering (since it's unsupervised).

To prepare the data to feed into the k-means clustering, for each time-series (aka for each tire), I calculated:

1. The top 3 sets of adjacent local maximum and local minimum with highest amplitude (difference)
2. Mean of torque value
3. Standard Deviation of torque values

I also set the number of clusters to be only 2, so either cluster 1 or 2.

So my end result (after assigning clusters) looks like following:

amplitude  local maxima  local minima  sig_value_std  \
tire_0     558.50        437.75       -120.75      77.538645
tire_0     532.75        433.75        -99.00      77.538645
tire_0     526.25        438.00        -88.25      77.538645
tire_1     552.50       -116.50        436.00      71.125912
tire_1     542.75        439.25       -103.50      71.125912

sig_value_average  cluster
tire_0          12.816990        0
tire_0          12.816990        0
tire_0          12.816990        0
tire_1          11.588038        1
tire_1          11.588038        0

Now I have a question of what to do with this result ... so each tire has 3 rows of data, as I've picked the top 3 pairs of local max/min with 3 largest amplitudes, and that means each row can be assigned to a cluster, and sometimes they are assigned to different clusters for 1 tire even. Also the cluster size is normally larger than just 2.

My questions are:

1. How to do anomaly detection about "set of time-series" not just individual data points?
2. Is my approach reasonable/logical? If it is, how can I clean up my result to get what I want? And if not, what can I do to improve?

1

Pretty interesting question!

First of all have a look at my edit as your question was a bit unclear according to the standard terminology. you have a set of Time-Series and you want to detect the outliers (abnormalities).

1. Your approach is pretty clear and logic and shows understanding of the problem and solution. The point is about the way you chose to apply it.
2. K-means is not the best way. I would like to point out that choice of 2 clusters are very smart here as you hope the clusters get formed based on a normal/abnormal structure. It just does not work well in practice if the features you extracted from your time-series are not inhibiting the abnormal behavior.
3. I assume Embedding algorithms are the right way to do so. Most probably if you apply a simple PCA you will see abnormal time-series somewhere further than others. Bellow I write the code. Try it and drop me a line if it didn't work so I go for more sophisticated solutions (e.g. you may construct a phase-space and see your data there or monitor the recurrence of time-series, etc.)
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_new = pca.fit_transform(X)
plt.figure(figsize=(10,10))
plt.plot(X_new[:,0],X_new[:,1],"*")
plt.show()

where X is a 10xN matrix in which each row is one time-series.

You may choose more components for PCA and check different PCs against each other.

Anyways the problem is not that difficult and if it did not work I will update my answer with another solution.

Hope it helps and good luck!

hi Kasra! Thank you SOOO much for trying to help! I'm trying out your method and I noticed one shortage/limitation of your method... that is that your approach is assuming that every time series data set i'm using has the same number of data points, which is not the case here... any other suggestions? :( – alwaysaskingquestions – 2018-03-04T03:34:56.330

Sure if you upvote/accept the answer if it worked. At very first stage just clip time-series. Cut them to be as the same size as the shortest time-series. If it did not help, drop another comment here. – Kasra Manshaei – 2018-03-04T03:45:43.087

Hi Kasra, I dont want to lose the data; is it possible to not cut the data? I want to use all of them. – alwaysaskingquestions – 2018-03-04T03:47:18.770

So replace the tail of your time-series with the last value. Just try it and let me know if it worked – Kasra Manshaei – 2018-03-04T03:54:46.477

doesnt that change the data basically? cuz now im adding value to the shorter data sets.... so that is changing my results right? (thank you so much for being patient with me!) – alwaysaskingquestions – 2018-03-04T03:56:06.743

It does but what can we do? Either cut the long ones or replicate short ones or "extract a set of good representative features" from every time-series. – Kasra Manshaei – 2018-03-04T13:39:49.117

@KasraManshaei May I kindly draw your attention to my question in this regard? It will be highly appreciated.

– Mario – 2020-12-03T15:42:13.180

@Mario Sure my friend. As soon as I find a bit of time I will get back to your question :) – Kasra Manshaei – 2020-12-09T11:13:31.740

@Thanks for your consideration I also reformulated the question in another community if you don't mind.

– Mario – 2020-12-09T12:40:27.753