## Multiple time-series predictions with Random Forests (in Python)

3

4

I am interested in time-series forecasting with RandomForest. The basic approach is to use a rolling window and use the data points within the window as features for the RandomForest regression, where we regress the next values after the window on the values within the window. Just plain autoregressive model (with lags), but with Random Forest instead of linear regression.

### Problem:

If I have more than one time-series (multiple time-series), how to pass them in RF regression?

For example: given two time series $$y_1(t)$$ and $$y_2(t)$$, the outcome time series is $$z(t)$$ and I am interested in predicting the values of $$z(t)$$ based on the combination of $$y_1$$ and $$y_2$$.

What I need, is to use rolling window for each $$y_1$$ and $$y_2$$, and then feed these values within the window from both time-series into RF regression, to predict the value of $$z(t)$$.

### Question:

How do I incorporate the data from both rolling windows into the input for RF regression?

12

Random forest (as well as most of supervised learning models) accepts a vector $x=(x_1,...x_k)$ for each observation and tries to correctly predict output $y$. So you need to convert your training data to this format. The following pandas-based function will help:

import pandas as pd

def table2lags(table, max_lag, min_lag=0, separator='_'):
""" Given a dataframe, return a dataframe with different lags of all its columns """
values=[]
for i in range(min_lag, max_lag + 1):
values.append(table.shift(i).copy())
values[-1].columns = [c + separator + str(i) for c in table.columns]
return pd.concat(values, axis=1)


For example, the following code:

df = pd.DataFrame({'y1':[1,2,3,4,5], 'y2':[10,20,40,50,30], 'z': [1,4,9,16,25]})
x = table2lags(df[['y1', 'y2']], 2)
print(x)


will produce output

   y1_0  y1_1  y1_2  y2_0  y2_1  y2_2
0   1.0   NaN   NaN  10.0   NaN   NaN
1   2.0   1.0   NaN  20.0  10.0   NaN
2   3.0   2.0   1.0  40.0  20.0  10.0
3   4.0   3.0   2.0  50.0  40.0  20.0
4   5.0   4.0   3.0  30.0  50.0  40.0


The first two rows have missing values, because lags 1 and 2 are undefined on them. You can fill them with what you find appropriate, or simply omit them.

When you have matrix of $x$ values, you can feed it, for example, to a scikit-learn regressor:

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor().fit(x[2:], df['z'][2:])


Finally, a piece of advice. Your model could much improve if you used not only raw lagged values as features, but also their different aggregations: mean, other linear combinations (e.g. ewm), quantiles, etc. Including additional linear combinations into a linear model is useless, but for tree-based models it can be of much help.