How to handle extremely 'long' images?


After transforming timeseries into an image format, I get a width-height ratio of ~135. Typical image CNN applications involve either square or reasonably-rectangular proportions - whereas mine look nearly like lines:

Example dimensions: (16000, 120, 16) = (width, height, channels).

Are 2D CNNs expected to work well with such aspect ratios? What hyperparameters are appropriate - namely, in Keras/TF terms, strides, kernel_size (is 'unequal' preferred, e.g. strides=(16, 1))? Relevant publications would help.

Clarification: width == timesteps. The images are obtained via a transform of the timeseries, e.g. Short-time Fourier Transform. channels are the original channels. height is the result of the transform, e.g. frequency information. The task is binary classification of EEG data (w/ sigmoid output).

Relevant thread


Posted 2020-04-05T18:08:08.727

Reputation: 141

Are the spatial correlations in the vertical (height) dimension relevant? I.e. does the vertical distance have physical meaning? – MPA – 2020-04-06T11:51:53.633

@MPA Question updated – OverLordGoldDragon – 2020-04-06T11:58:59.943

What is the type of the desired output? A class label (e.g. for detection), a time series (e.g. for filtering), etc.? – MPA – 2020-04-06T12:07:52.987

@MPA Binary classification, updated again – OverLordGoldDragon – 2020-04-06T12:09:05.097

Depending on the physical nature of the input data, you could first perform 1D convolutions with dimensionality reduction along the time axis to extract the most relevant features for each frequency band, and then combine the features of each frequency band into a class prediction. Or if the data are sufficiently smooth, you could try an auto-encoder to compress the time axis into a smaller latent space representation, and perform additional analysis on the latent space. – MPA – 2020-04-06T14:00:14.183

@MPA 16000 is precisely from the autoencoder approach; the full original timeseries spans 200,000+ timesteps. Though compressing it even further is unrealistic without loss of information - and the timeseries transform used is promising for discrimitative feature extraction. The linked thread has some sound suggestions I might summarize in an answer. – OverLordGoldDragon – 2020-04-06T14:05:51.903

I suppose the EEG data is quasi-periodic. What about Dynamic Mode Decomposition (DMD) and performing the classification on the dominant modes? – MPA – 2020-04-06T14:15:35.520



I had recently used a slightly unorthodox method to process such images, which involved using RNNs.

Assume the image dimensions to be (16000, 120, 16) = (width, height, channels), as in the question.

Apply a 2D convolution (or multiple such convolutions) of shape(1, k, c), such that the output of the convolutions becomes (16000, 1, c). So if you only use a single convolutional layer, k=120.

Then, squeeze the extra dimension, to get the shape (16000, c).

The problem has now been transformed back into a sequence problem! You can use RNN variants for further processing.

Susmit Agrawal

Posted 2020-04-05T18:08:08.727

Reputation: 113

This defeats the purpose of the transform; see updated question. – OverLordGoldDragon – 2020-04-06T11:58:10.537