Similarity measure for multivariate time series with heterogeous length and content



I am interested in clustering multivariate N time series of T'values' each(different lengths) using python. Each variable have many trends and values which are simultaneously numeric and nominal.

A sample $T_{i}$ in the dataset has the following format:

TimeStamp       | Sensor0 | Sensor1| Sensor2
2015-02-05 11:30|<Min     | On     | off
2015-02-05 11:31|<Min     | on     | off 
2015-02-05 11:32| Action2 | 10     | 0.0001  
2015-02-07 11:33| Action2 | 10     | 0.00012 
2015-02-07 11:34| Action2 | 10     | 0.00012 
2015-02-07 11:35| Action2 | 20     | 0.00015 

Another sample $T_{j}$ in the dataset has the following format:

TimeStamp       | Sensor0 | Sensor1| Sensor2
2015-10-05 11:30| Action2 | 11     | off
2015-10-05 11:31| Action1 | 11     | off 
2015-10-05 11:32| Action2 | NAN    | 0.0001  
2015-10-07 11:33| Action3 | NAN    | 0.00012 
2015-10-07 11:34| <Min    | 10     | 0.00012 
2015-10-07 11:35| <Min    | 15     | on 

For the missing values (not numeric), they were not collected by the sensors so my idea was to replace them by minimum values., given that all values are strictly positive. Otherwise, they would be considered as missing values. In which case the problem would be of finding a similiraty measure that can compare missing values (off,on..) and numeric values.

I am wondering if there is a similarity / distance measure already exist in the litterature to compare such multivariate timeseries, with hetergonuos lengths, and whether this kind of problem has already been formulated in the papers, books or else for R and python.

Thanks for your advice.


Posted 2016-08-16T08:04:51.537

Reputation: 63

Fit the time series to a model, and cluster the model parameters. – Emre – 2016-08-16T19:45:10.313

@ emre thank you for your response. I just can't seem to find a way of finding the right modeling framework for such context. So you have a specific method in mind? – user23440 – 2016-08-16T21:48:50.593

I'd use a neural network.

– Emre – 2016-08-16T21:51:15.613

That's indeed a very good paper. I am a newbie to deep learning. It will be difficult for me to implement it in python. Do you know of good resources for a beginner(books moods github..)? – user23440 – 2016-08-16T22:24:44.630

Start here, then try these time series tutorials.

– Emre – 2016-08-16T23:03:46.060

Thank you so much for your advices. So once the model parameters are learned it's possible to cluster using similarity measure like Euclidean distance? Is it still a valid metric for such features representation? – user23440 – 2016-08-16T23:48:31.010

That's where things get hairy, because clustering is subjective, and rescaling the features will change the clusters with a metric like the Euclidean distance. I suggest just trying the various clustering algorithms and looking into metric learning. But finding this embedding (time series representation) to make clustering feasible in the first place is the hard part, so don't fret!

– Emre – 2016-08-17T17:44:46.833



Try this recent paper: Consistent Algorithms for Clustering Time Series.

Your question is very much a current research topic.

Here's an older but excellent paper which talks about the fundamentals: Generalized Feature Extraction for Structural Pattern Recognition in Time-series Data.


Posted 2016-08-16T08:04:51.537

Reputation: 754

@ Pete thank you for your response.this seems to me an interesting avenue to explore.I will get back to you when I finish reading – user23440 – 2016-08-16T21:49:40.233