Training network with variable frame rate?


I would like to train a temporal network, but the video data available are in different frame rates(ex 7,12,15,30). How should I train this network, without down-sampling higher frame rate videos.

I tried up-sampling everything, but there is some artifacts generated.

What is the suitable approach?


Posted 2018-09-17T13:24:18.450

Reputation: 43



I don't believe there is a well-known method to deal with this.

Simple pre-processing

While I haven't done this with images/videos, I know from general time-series analysis that you basically have to interpolate the lower frequencies or you need to down-sample the higher frequencies. If you think about, what else is there to do...?

Modelling solution

Nvidia released a research paper with an accompanying video showing how they were able to train a model, which could estimate the frames between frames - effectively interpolating video and increasing its frame rate. This would essentially be the equivalent of interpolation between frames and allow you to scale up your lower frequency videos to match the higher frequency ones. The paper is named:

Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

... sounds like something worth reading.

There are older algorithms that try to do the same thing (e.g. "twixtor"), but I read they have problems with things such as rotating objects. Another thing to keep in mind is the usual GIGO: garbage in garbage out. There are still some artefacts of interpolation in the Nvidia video, but that likely comes from blurry input images used during training when e.g. objects were moving faster than the recording frame rate could handle.

It seems that they train two models: the first encodes the optical flow between frames and the second model uses that, along with the base images to perform the interpolation. Please read the paper for more details. It also outlines how they train the model (learning rates, number of epochs, augmentation steps, etc.).

Here is the sketch of their model for flow computation/interpolation:

enter image description here

We can see that it is an encoder/decoder-looking model, introducing a bottleneck that condenses the information, before upsampling again. This is based on the U-net model architecture: an encoder/decoder that also introduces skip connections between layers of different scales.


Posted 2018-09-17T13:24:18.450

Reputation: 12 573

Hmmm... well I think, my approach would be something like if I go with 15fps, I would make the 30fps into 2 videos(hopefully more data) – user1589759 – 2018-09-17T21:44:43.637