When should you balance a time series dataset?

5

I'm training a machine learning algorithm to classify up/down trends in a time series and I'm using an imbalanced feature set. It seems necessary to balance the data since the algorithm could learn a bias towards a particular trend, but this comes at the cost of a non-representative dataset. Should I balance my data? And if so, is random undersampling the right method?

Jonathan Shobrook

Posted 2018-02-22T18:10:43.190

Reputation: 303

1Which types of models do you use? some models are less sensitive to imbalanced datasets – Omri374 – 2018-02-24T20:41:22.073

1@Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier. – Jonathan Shobrook – 2018-02-25T05:28:10.190

1For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows – Omri374 – 2018-02-25T09:07:20.167

You might want to read this paper

– iso_9001_ – 2019-03-14T14:09:08.027

Answers

2

If you can change the Loss function of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.

Disclaimer:

If you use python, PyCM module can help you to find out these metrics.

Here is a simple code to get the recommended parameters from this module:

>>> from pycm import *

>>> cm = ConfusionMatrix(matrix={"Class1": {"Class1": 1, "Class2":2}, "Class2": {"Class1": 0, "Class2": 5}})  

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]

After that, each of these parameters you want to use as the loss function can be used as follows:

>>> y_pred = model.predict      #the prediction of the implemented model

>>> y_actu = data.target        #data labels

>>> cm = ConfusionMatrix(y_actu, y_pred)

>>> loss = cm.Kappa             #or any other parameter (Example: cm.SOA1)

Alireza Zolanvari

Posted 2018-02-22T18:10:43.190

Reputation: 616