Using time series data from a sensor for ML

8

1

I have the following data for a little side project. It's from an accelerometer sitting on top of a washer/dryer and I'd like it to tell me when the machine has finished.

data

x is the input data (x/y/z movement as one value), y is the label on/off

Because the x values overlap for y=1 and y=0, I was thinking of using x and a rolling 3 minute window as inputs for a SVM:

xyz60=res.xyz.resample("60S").max()
X["x"]=xyz60
X["max3"]=xyz60.rolling(window=3, min_periods=1).max()

data

Is this a good approach for this kind of problem? Are there alternatives that might produce better results?

laktak

Posted 2017-06-01T13:52:10.327

Reputation: 181

By a three-minute rolling window, do you mean that you want to use input from a three-minute window time=1, 2, 3 and then move to time=2, 3, 4, and get a label 0/1 for off/on for each window? – StatsSorceress – 2017-06-01T14:03:01.700

@StatsSorceress basically yes - I'm using a window because the x values overlap (updated) – laktak – 2017-06-01T14:12:14.987

Answers

7

You have time series data which is used to measure the acceleration. You which to identify when the machine is in its nominal state (OFF) and anomalous state (ON). This problem would be best solved using anomaly detection algorithms. But, there are so many ways that you can approach this problem.

Preparing you data

All of the methods will rely on the feature extraction method you select. Assuming we continue to use the 3 sample time window as you suggested. In this algorithm you will calculate a statistic for this nominal state $y = 0$. I would suggest the mean as I assume you are already doing, take the average of the three sample resultant accelerations. You will then be left with a large number of values in a training set $S$ defined as

$S = \{s_0, s_1, ..., s_n \}$

where $s$ is the mean of the tree samples in a window. $s$ is defined as

$s_i = \frac{1}{3} \sum_{k=i-2}^{i} x_k$

where $x$ is your sample observations and $i\geq2$.

Then collect more data if it is possible with the machine active such that $y = 1$.

Now you can choose if you want to train your algorithm on a one-class dataset (pure anomlay detection). A biased dataset (anomaly detection) or a well-balanced dataset. The balance of the dataset is the ratio between the two classes in your dataset. A perfect dataset for a 2-class classifier would be 1:1. 50% of the data belonging to each class. You seem to have a biased dataset, assuming you don't want to waste a lot of electricity.

Do note that there is nothing stopping you from keeping the neighboring samples split as an instance in your dataset. For example:

$x_i$ $x_{i-1}$ $x_{i-2}$ | $y_i$

This would make a 3-dimensional input space for a specific output which is defined for the currently taken sample.


A Biased Dataset


Easy Solution

The easiest way that i would suggest. Assume you are using a single statistic to define what is happening throughout the 3 sample window. From the collected data get the maximum $s$ of your nominal points ($y=0$) and the minimum $s$ of your anomalous points ($y=1$). Then take the halfway mark between these two and use that as your threshold.

If a new test sample $\hat{s}$ is larger than the threshold then assign $y=1$.

You can extend this by calculating the mean $s$ for all of your nominal samples $y=0$. Then calculate the mean for your anomalous samples $y=1$. If a new sample falls closer to the mean of the anomalous samples then classify it as $y=1$.

But I want to get fancy!

There are a number of other techniques you can use to do this exact task.

  • k-Nearest Neighbors
  • Neural Networks
  • Linear Regression
  • SVM

Simply put, almost every machine learning algorithm is well suited for this purpose. It just depends on how much data is available to you and it's distribution.


I really want to use SVM


If this is the case keep the three samples completely separated. Your training matrix will have 3 columns as discussed above. And then you will have your outputs $y$. Using SVM in python is very easy: http://scikit-learn.org/stable/modules/svm.html.

from sklearn import svm

X = [[0, 0, 0], [1, 1, 1], ..., [1, 0, 1]] 
y = [0, 1, ..., 1]
clf = svm.SVC()
clf.fit(X, y)  

This trains your model. Then you will want to predict the outcome for a new sample.

clf.predict([[2., 2., 1]])

JahKnows

Posted 2017-06-01T13:52:10.327

Reputation: 7 863

Let me know if you want some more information about specific things. – JahKnows – 2017-06-01T15:53:58.127

1+1 for the detailed answer - I will test this as soon as the washing machine generates more data ;) – laktak – 2017-06-01T22:24:04.970

Are there any alternatives to 'Preparing your data'? I've tested my old method and yours with 3 and 5 input values but I always have problems at the 'edges' when y changes (like y 1/0/1/0/1 instead of 1/1/1/1/1). – laktak – 2017-06-02T21:55:11.033

At the edges? I'm not sure I understand what you mean. Can you elaborate please? – JahKnows – 2017-06-05T13:40:41.277

For example when the machine turns off, y can jump from 1 to 0 and back multiple times. Instead of one end time I get several. I'm only interested in the start and end times, is there maybe a better approach for that? – laktak – 2017-06-05T19:06:38.853

If you really care about the functionality and not the process (for learning purposes). Go about it using the simple solution I proposed. Just some questions: what is your sampling speed? What is your margin of error for start and end times? – JahKnows – 2017-06-05T19:25:17.883

I'd rather like to know if/how this can be solved with ML. I can adjust the sampling speed; at the moment I take one every 100ms, store the max every 500ms and use the max (rather than mean) of a one minute interval as input. start/end should be detected within <5 minutes. – laktak – 2017-06-05T19:47:44.807

Ok so I would tackle the problem a bit differently than I stated. I would collect the samples for every distinct minute and use that as my feature space. This will give you 10 samples per each minute interval and then you can use that as your input. Can you send me a link to your data please. Host it somewhere and post the link. Kaggle preferably. – JahKnows – 2017-06-05T19:57:15.943