I don't believe there is a well-known method to deal with this.
While I haven't done this with images/videos, I know from general time-series analysis that you basically have to interpolate the lower frequencies or you need to down-sample the higher frequencies. If you think about, what else is there to do...?
Nvidia released a research paper with an accompanying video showing how they were able to train a model, which could estimate the frames between frames - effectively interpolating video and increasing its frame rate. This would essentially be the equivalent of interpolation between frames and allow you to scale up your lower frequency videos to match the higher frequency ones. The paper is named:
Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation
... sounds like something worth reading.
There are older algorithms that try to do the same thing (e.g. "twixtor"), but I read they have problems with things such as rotating objects. Another thing to keep in mind is the usual GIGO: garbage in garbage out. There are still some artefacts of interpolation in the Nvidia video, but that likely comes from blurry input images used during training when e.g. objects were moving faster than the recording frame rate could handle.
It seems that they train two models: the first encodes the optical flow between frames and the second model uses that, along with the base images to perform the interpolation. Please read the paper for more details. It also outlines how they train the model (learning rates, number of epochs, augmentation steps, etc.).
Here is the sketch of their model for flow computation/interpolation:
We can see that it is an encoder/decoder-looking model, introducing a bottleneck that condenses the information, before upsampling again. This is based on the U-net model architecture: an encoder/decoder that also introduces skip connections between layers of different scales.