Regression model with variable number of parameters in dataset?



I work in physics. We have lots of experimental runs, with each run yielding a result, y and some parameters that should predict the result, x. Over time, we have found more and more parameters to record. So our data looks like the following:

Year 1 data: (2000 runs)
    parameters: x1,x2,x3                target: y
Year 2 data: (2000 runs)
    parameters: x1,x2,x3,x4,x5          target: y
Year 3 data: (2000 runs)
    parameters: x1,x2,x3,x4,x5,x6,x7    target: y

How does one build a regression model that incorporates the additional information we recorded, without throwing away what it "learned" about the older parameters?

Should I:

  • just set x4, x5, etc to 0 or -1 when I'm not using them?
  • completely ignore x4,x5,x6,x7 and only use x1,x2,x3?
  • add another parameter that is simply the number of parameters?
  • train separate models for each year, and combine them somehow?
  • "weight" the parameters, so as to ignore them if I set the weight to 0?
  • make three different models, using x1,x2,x3, x4,x5, and x6,x7 parameters, and then interpolate somehow?
  • make a custom "imputer" to guestimate the missing parameters (using available parameters)

I have tried imputation using mean and median, but neither works very well because the parameters are not independent, but rather fairly correlated.


Posted 2016-07-14T18:52:58.407

Reputation: 173

I'd set up neural network and disconnect the neurons corresponding to the missing inputs as necessary. – Emre – 2016-07-14T19:20:45.543

impute the missing vals? – Brandon Loudermilk – 2016-07-14T20:12:53.817

Could impute the missing values...Would have to make my own imputer though, the imputer for scikit-learn is not very smart (mean, median, etc.) Not sure I want to just replace every missing value with the average. – OrangeSherbet – 2016-07-14T20:44:41.313

3What kind of model are you intending to make? When used to predict, do you need it to work with limited parameters same as your early experiments? Or will the final model assume that users will always input all the parameters you have determined are interesting? – Neil Slater – 2016-07-15T06:46:21.513

The model needs to predict y well across the historical data and the new data. One of the parameters, x1, is a time, and I need to model y as a function of this time parameter, y(x1), and fit an exponential to it for each run. y(x1) is the goal, but to improve y(x1)'s predicted form, I hope to incorporate more than just x1,x2 by using x3,x4,... as these new measurement channels become used. As the experiment continues, new measurement channels will probably be added. – OrangeSherbet – 2016-07-15T19:13:09.787

My Worry about Imputing the missing values is that it might dilute the information when the values are not missing. – OrangeSherbet – 2016-07-15T21:36:57.240

If you are looking at "pointwise estimates" of partially observed random variables, imputing the missing values with mean might be a valid approach. If the data is identically and independently distributed throughout Year 1, 2 ... mean is least likely to introduce bias.

Or you can model the problem using a probabilistic graphical model, in which case you can marginalize out the missing variables. In this kind of modeling, you also take the spread of the variable into account while predicting y. So missing variables with high variance might not dilute the information. – abhnj – 2016-07-16T01:07:56.873

Tried it...I guess my data is such that mean doesn't actually work very well. I'm going to need a better imputation method. – OrangeSherbet – 2016-08-01T19:06:03.003



One simple idea, no imputation needed: build a model using the parameters have always existed, then each time a new set of parameters gets added, use them to model the residual of the previous model. Then you can sum the contributions of all the models that apply to the data you happen to have. (If effects tend to multiply rather than add, you could do this in log space.)

Ken Arnold

Posted 2016-07-14T18:52:58.407

Reputation: 206

I didn't try this, but this sounds very similar to boosting (e.g. Boosted Decision Trees). Boosting has only impressed me. – OrangeSherbet – 2018-11-03T22:26:18.057


If the old variables and the new variables are highly correlated then you could do a more advanced form of imputation and make a model for each new input that predicts the new input given the old inputs. This model would probably be pretty effective good at predicting the new inputs because, as you said, there is a strong correlation among the inputs. Then you would split up all of your data across the years so that you have an equal proportion of old records and new records in your training, validation, and test sets.

Ryan Zotti

Posted 2016-07-14T18:52:58.407

Reputation: 3 849


I would multiply-impute the values for x4, x5, x6, and x7. For the number of imputations, look at the whole dataset and compute the % of fields missing and round up to the nearest integer. Don't use mean- or median-imputation, use PROC MI in SAS or the equivalent. I would guess that because your data are monotone-missing, you could use a MONOTONE statement. This is probably the most conservative approach because you open yourself up to bias when excluding information--whether it be variables or observations.

The Baron

Posted 2016-07-14T18:52:58.407

Reputation: 1