Machine Learning with intended missing values



I have a dataset relating to humans completing reviews, the target variable is whether the review decision is correct / incorrect and one of my features is a trailing 4 week accuracy score for the reviewer.

These accuracy scores are not always available however. My question is around how to model this data - the fact that there is no available accuracy score might be a signal. From my research into this - everything I see tells me that the missing values must be imputed or removed. I am wondering whether there are techniques to incorporate the fact that the data is missing into the dataset.

Perhaps I could convert the score into a categorical variable {low, medium, high, not available] - would this be common practice? I am open to suggestions and would love to hear what is commonly done in these scenarios

Stats DUB01

Posted 2021-01-20T15:56:22.030

Reputation: 151



The common case of missing values for which data is imputed or removed assumes that missing values appear randomly in the data, so the absence of value has no relevance to the task.

From your description, in your data the fact that a value is missing is significant by itself. So I'd say that yes, it makes sense in this case to represent this information as a categorical variable. Note that it can be represented as a special value for the score feature indeed, but it doesn't have to be the same variable.


Posted 2021-01-20T15:56:22.030

Reputation: 12 600

Thanks Erwan. I'm not sure what you mean when you say that it can be represented as a special variable for the score feature.

Ideally, I would not have to change the score feature to a categorical variable, since there might be value in maintaining the continuous numerical format (0-1). Is there any method you know of to do this - perhaps a modelling technique that allows for null values, I have read that some decision trees can allow for this – Stats DUB01 – 2021-01-20T20:37:08.500

@StatsDUB01 I was suspecting that it would be best to leave the regular score variable unchanged indeed, that's why I mentioned the possibility of a different variable: you could simply add a binary feature 0 for "regular score" and 1 for "missing value". It's not sure that this would work well with any type of algorithm: decision trees should be able to use it pretty well (basically the model would not process the score the same way depending on this variable), other methods based on a numerical approach, it's hard to say. – Erwan – 2021-01-20T21:16:47.863

@StatsDUB01 You're right that there are some models which can handle missing values by themselves, some variants of decision trees do. In this case it depends on the implementation, not all of them have a specific way to represent a value as missing. I know that at least Weka supports missing values in the data (simply represented as NA as far as I remember). There are certainly other libraries as well but I don't know which ones. – Erwan – 2021-01-20T21:20:06.807

1@Erwan IIRC Weka's trees are Quinlan, and treat missing values by splitting the observations with appropriate weights to go down each path in the tree; in that way, I think you lose the information from the feature being missing. – Ben Reiniger – 2021-01-21T23:25:18.493


To help you find other resources, this is commonly referred to as "Missing Not At Random."

Some models, like xgboost, handle missing values inherently, making tree splits at a real value but then choosing which branch to send the missing values along. (Other implementations of CART don't do that, and the Quinlan family of trees does something entirely different.)

For other models, I'd recommend adding a "missingness indicator" feature and then imputing. For linear models especially, the coefficient on the original feature can fit to the "real" slope, while the coefficient on the indicator "fixes" for the missing values (and whatever imputation you use). See e.g. this stats.SE answer.

Ben Reiniger

Posted 2021-01-20T15:56:22.030

Reputation: 7 097