0

In a work, I have to benchmark different algorithms to fill in missing values in time series.

I insist on the fact that this is **imputation** and not forecasting.

In my case, I have access to **15 years** of complete temperature data from **20 stations**.

I have different algorithms that, knowing the position of missing values, try to complete the missing data.

However, these algorithms have some **parameters** to be set. So I want to implement a classical method such as **k-cross fold validation**.

Moreover, these algorithms have to be tested in a classical case of time series completion. Indeed, here is a typical drawing of time series data completion:

Each line is a time series. (note that some series are and must be complete for the benchmark)

In red are the missing data to be completed, and in green are the unknown data. In practice, these algorithms use the notion of **space-time** to complete missing data.

However I am facing a problem. Indeed, the method of k-cross fold validation takes **randomly** data and it is this purely random character which poses me a problem. Indeed, as shown on the drawing, some algorithms could work better if I take my data completely randomly. Whereas in reality, they will only be tested on a **template as on the image**.
Indeed, I know in advance that some of my algorithms work very well when blocks of several years of known data are present and don't work at all when purely randomly selecting the known data.

Do you have an idea of how to parameterize my algorithms to be optimal on a data framework such as the one on the image?

Thanks in advance

(Feel free to ask me questions if I haven't been clear enough)