I have a large genetic dataset that I am using xgboost on to score most likely disease causing genes - giving the genes a score between 0-1 of likelihood.
I try to avoid features with a lot of missing data but this can be hard for genetic data, the largest amount of missingness I have for a feature is roughly half of values in a feature column are missing.
Currently I run my xgboost model in 2 versions, one with random forest imputation of missing values and one without where xgboost is handling the missing data directly. The imputation model performs at an r2 of 0.7 and the model with missing values performs at 0.8 on nested cross-validation.
My question is how do I choose which version to take on to further work? Can I trust that the higher 0.8 r2 with missing data is because xgboost is finding patterns in the missingness and finding this useful? Are there rules around missing data I should be trying to abide by? I have a biology background so I'm unsure what is best practice from a data science perspective for handling missing data, most resources about this that I've found online conclude this is a case by case problem, which I find hard to interpret into what I should be specifically looking into. Any help would be appreciated.