## What are the best ways to use a time series data for binary classification

1

2

I have large number of csv files and each of them are timeseries based csv files sampled at Avery 5 seconds for 2-3 mins. I have 20k such files with 200-300 variables in each file. I am aggregating the data by mean over the entire 2-3 mins window are using it for binary classification.

Currently I am using mean of each column in the .CSV file to represent that file, so basically I am summarizing the csv's using one scalar value per column. so each file is one sample represented by its respective mean value. Could anyone suggest me some better ways to summarize the timeseries data.

>

• Why do you have to summarise it by only one number? 2) what do you mean by „better“? What is your criterion?
• < – aivanov – 2017-12-08T22:10:02.320

I need to summarize each of he csv because I have huge number of such csv's. I am using mean to do such summarization hence, I wanted to know if there are other plausible ways to summarize time series data. Coz I believe mean is being too flat. – Anurag Upadhyaya – 2017-12-09T10:42:13.733

Your question isn’t answerable as it stands now. 1) It is still unclear what you mean by “better”. Do you have any quality criterion / procedure that could tell you that mean is worse than let say standard deviation, or just first measurement of your time series? You can’t optimise your “compression approach” without the optimisation criterion. 2) If you can’t formulate the criterion yourself, try to explain us how you intend to use the aggregated data (mean). Do you pass it to some ML algorithm? What kind of ML? – aivanov – 2017-12-09T11:05:21.740

Okay so yes , I am aggregating each file by mean and then I am training a binary classifier on it. So the data is pretty imbalanced and mean is not giving me variables which are separable. Both classes have similar distribution among all the variables. So as the mean was used to aggregate I was thinking of using some other way of aggregation as the data is time series may be mean is not the right way. – Anurag Upadhyaya – 2017-12-09T11:55:19.773

4

From your comment, I understand that you are trying to solve the binary classification problem using your aggregated data and you are getting very poor results when you simply use the mean.

Depending on specifics of your data and the shape of your time series, there are several alternatives that you could try. Note, that you might need (significantly) more than just a single number per time series to solve your problem.

1. In addition to the mean, you could use the quantiles or some other summary statistic, like standard deviation, min, or max.
2. You could try to sample the data, i.e. instead of taking the entire time series, pick only the values that are minutes, hours or days a part. Or pick only mid-day values. The frequency of the sampling depends on your data.
3. Or just pre-aggregate by calculating averages for every hour, day, month, etc.
4. Additionally, you could calculate the periodicity of your time series and use it as a new feature.
5. Or calculate some trends.
6. Try to fit some standard time series models to your data, e.g. ARIMA and use the coefficients as informative features.
7. Last but not least, use the domain knowledge re what could be relevant feature for your classification problem: the biggest jump (max first order difference), change of regime, etc.

Edit I’d pick at least 10-20 features per time series generated as described above and apply logistic regression with LASSO or even xgboost.

After selecting 10-20 features per time series you also could try PCA to reduce the dimension.

Thanks for your ideas, I have used std, variance, skewness and kurtosis as well as a single number representation.Do you have ideas on how to use this kind of data to perform binary classification. Because if I don't agreegate data as single number I won't be able to frame the problem for classification. – Anurag Upadhyaya – 2017-12-09T13:21:17.537

I don’t understand why you are limited to single number only. Can you please explain? The classifier can take into account multiple features at the same time. Please provide some snippet of your data including the class label. – aivanov – 2017-12-09T13:25:37.313

The reason is each CSV has 100-200 columns and there are 20k csv's. So , I am currently aggregating the columns of each CSV as a single point value, so I get 20k samples of 100-200 features and then I train a classifier on it. – Anurag Upadhyaya – 2017-12-09T13:41:18.697

What is your AUC? – aivanov – 2017-12-09T14:01:11.823

My auc is around .81 on test data. – Anurag Upadhyaya – 2017-12-09T14:20:13.393

On your edit, so you want me to use something among those 5 options and then select some 15-20 features and try logistic or xgboost or lasso. Correct me if I am wrong !! – Anurag Upadhyaya – 2017-12-10T10:48:52.703

You can use a combination of 5 options above, e.g. max, mean, slope of linear trend, several monthly averages (average in June, July,...) Considering your AUC, which is not so bad from my point of view, you might start with 10 features per time series. Option 2 is unlikely to be helpful in your particular case, but you can try it anyway. I meant a) “logistic with Lasso penalty” or b) xgboost (make sure to test colsample <1) . You also can try c) “logistic with elastic net penalty”. Please keep me posted on your results. – aivanov – 2017-12-10T13:42:28.113

Sure I will keep you posted, the auc I am getting has 30-35% false positives on validation and test set. My data is sampled every 5 seconds for 3-4 mins, so I think I can try slope of a linear trend or the periodicity approach. Currently I am using xgboost , as the data imbalanced so rest algorithms are not that competent. Thanks btw for the ideas :) – Anurag Upadhyaya – 2017-12-10T14:20:06.960

@AnuragUpadhyaya btw, consider editing the title of your question and the question itself. Include the details provided in the comments and you might get more responses. – aivanov – 2017-12-10T17:12:12.057

@AnuragUpadhyaya just wanted to check whether you were able to get higher AUC? – aivanov – 2017-12-21T18:48:43.620

I am still struggling to increase my AUC further. Would let you know if anything turns out. – Anurag Upadhyaya – 2018-01-08T05:07:55.677

For point 5 mentioned above, If I am not wrong can I go ahead with seasonality and trends on the raw data. – Anurag Upadhyaya – 2018-01-09T06:47:59.383

1

You can try taking Fourier transform of the series if your domain suggests that frequency elements might have some meaning or relevance for classification. I once took top 10 coefficients of the transform along usual statistical features as suggested by aivanov. It helped me classify my data. Before taking the transform, you might also benefit from passing the series through a high-pass/low-pass or a band filter.