I work in physics. We have lots of experimental runs, with each run yielding a result,
y and some parameters that should predict the result,
x. Over time, we have found more and more parameters to record. So our data looks like the following:
Year 1 data: (2000 runs) parameters: x1,x2,x3 target: y Year 2 data: (2000 runs) parameters: x1,x2,x3,x4,x5 target: y Year 3 data: (2000 runs) parameters: x1,x2,x3,x4,x5,x6,x7 target: y
How does one build a regression model that incorporates the additional information we recorded, without throwing away what it "learned" about the older parameters?
- just set
x5, etc to
-1when I'm not using them?
- completely ignore
x4,x5,x6,x7and only use
- add another parameter that is simply the number of parameters?
- train separate models for each year, and combine them somehow?
- "weight" the parameters, so as to ignore them if I set the weight to 0?
- make three different models, using
x6,x7parameters, and then interpolate somehow?
- make a custom "imputer" to guestimate the missing parameters (using available parameters)
I have tried imputation using mean and median, but neither works very well because the parameters are not independent, but rather fairly correlated.