7

1

I work in physics. We have lots of experimental runs, with each run yielding a result, `y`

and some parameters that should predict the result, `x`

. Over time, we have found more and more parameters to record. So our data looks like the following:

```
Year 1 data: (2000 runs)
parameters: x1,x2,x3 target: y
Year 2 data: (2000 runs)
parameters: x1,x2,x3,x4,x5 target: y
Year 3 data: (2000 runs)
parameters: x1,x2,x3,x4,x5,x6,x7 target: y
```

How does one build a regression model that incorporates the additional information we recorded, without throwing away what it "learned" about the older parameters?

Should I:

- just set
`x4`

,`x5`

, etc to`0`

or`-1`

when I'm not using them? - completely ignore
`x4,x5,x6,x7`

and only use`x1,x2,x3`

? - add another parameter that is simply the number of parameters?
- train separate models for each year, and combine them somehow?
- "weight" the parameters, so as to ignore them if I set the weight to 0?
- make three different models, using
`x1,x2,x3`

,`x4,x5`

, and`x6,x7`

parameters, and then interpolate somehow? - make a custom "imputer" to guestimate the missing parameters (using available parameters)

I have tried imputation using mean and median, but neither works very well because the parameters are not independent, but rather fairly correlated.

I'd set up neural network and disconnect the neurons corresponding to the missing inputs as necessary. – Emre – 2016-07-14T19:20:45.543

impute the missing vals? – Brandon Loudermilk – 2016-07-14T20:12:53.817

Could impute the missing values...Would have to make my own imputer though, the imputer for scikit-learn is not very smart (mean, median, etc.) Not sure I want to just replace every missing value with the average. – OrangeSherbet – 2016-07-14T20:44:41.313

3What kind of model are you intending to make? When used to predict, do you need it to work with limited parameters same as your early experiments? Or will the final model assume that users will always input all the parameters you have determined are interesting? – Neil Slater – 2016-07-15T06:46:21.513

The model needs to predict

`y`

well across the historical data and the new data. One of the parameters,`x1`

, is a time, and I need to model`y`

as a function of this time parameter, y(x1), and fit an exponential to it foreachrun. y(x1) is the goal, but to improve y(x1)'s predicted form, I hope to incorporate more than just`x1,x2`

by using`x3,x4,...`

as these new measurement channels become used. As the experiment continues, new measurement channels will probably be added. – OrangeSherbet – 2016-07-15T19:13:09.787My Worry about Imputing the missing values is that it might dilute the information when the values are not missing. – OrangeSherbet – 2016-07-15T21:36:57.240

If you are looking at "pointwise estimates" of partially observed random variables, imputing the missing values with mean might be a valid approach. If the data is identically and independently distributed throughout Year 1, 2 ... mean is least likely to introduce bias.

Or you can model the problem using a probabilistic graphical model, in which case you can marginalize out the missing variables. In this kind of modeling, you also take the spread of the variable into account while predicting y. So missing variables with high variance might not dilute the information. – abhnj – 2016-07-16T01:07:56.873

Tried it...I guess my data is such that mean doesn't actually work very well. I'm going to need a better imputation method. – OrangeSherbet – 2016-08-01T19:06:03.003