## How can I use variable length inputs to train a regression model?

5

1

I'm working predicting a value $y \in \mathbb{R}$ from the value of $x_{n+1}$, where $n$ is the number of samples ($x_{i \in [1,n]}$) used for training.

Each training sample $x_{i}$ is a time series of variable lengths. How can I engineer features to predict $y$ while making the most of the samples available?

Nota bene:

• I'm working with Python & Scikit-learn
• the length of the time series may be considered as an explanatory variable

Hello, I have asked the same question here while providing my current solution: http://datascience.stackexchange.com/questions/16929/predict-rank-from-physical-measurements-with-various-lengths This forum seems not so active or perhaps my question was not properly asked

– Wli – 2017-02-17T11:57:41.650

4

A quick approach can be using a library like tsfresh, which extracts features from your time series, e.g. max value, number of peaks, median value, etc.

Normally, a thoughtful solution to this kind of problem involves domain knowledge, that is, insights from an expert that can tell you which aspects of your time series are important, e.g. frequent variations, very high peaks, plateaus, local patterns, etc. This knowledge is then used to derive ad hoc feature extractors.

3

If I got your question right, the $x_i$ has different length $l$ over $i\in[1,n]$ as your training data. A very common method is to padding each training sample and testing sample to the same length, or use a fixed length time window for sampling your time series data. As for here, you may pad all the $x_i$ to the length of $L$ with $L = max(l_i), i \in [1,2,...,n,n+1...]$. The values for padding depends on your data, use some values that are not supposed to occur in your data to represent the "padding".

To make the prediction on $x_{n+1}$, you also pad or use time window to sample the $x_{n+1}$ first, and then make the prediction using the model you trained.

I don't know what kind of model you are using and why your time series vary in length, I assume you have used subsampling on your original data. For deep learning models like RNN, LSTM, you can use fixed length time series data since the model extract the features over time and subsampling is unnecessary.