Apologies for a very case specific question. I have a dataset of genes, with which I am using machine learning to predict if a gene causes a disease. One of the features I have is a beta value (which is the effect size of the gene's impact on the disease), and I'm not sure how best to interpret and use this feature.
I condense the beta values from the variant level to the gene level, so a gene is left with multiple beta values like this:
Gene Beta ACE -0.7, 0.1 ,0.6 NOS 0.2, 0.4, 0.5 BRCA -0.1 ,0.1, 0.2
Currently I am trying 2 options of selecting a single beta value per gene, one where I select the absolute value per gene (and ignore whether it was a previous negative value) and another where I select the absolute value and return the previous negative numbers back to being negative. I am trying this as for beta values a postive or negative direction indicates the size of the effect a gene has on the disease, so I would think it's important to retain the negative information (as I understand it).
However, I've been advised to use just the absolute values with not retaining negative status, and I'm not sure if there's a way for me to know if one option is better than the other from the machine learning perspective. I am also having a problem in either case where my model values this feature as much more important than any other feature in my dataset. For example gradient boosting gives this an importance of 0.01, the next most important feature is at 0.001.
So my question is, how best can I interpret a highly important feature like this? If it is much more important is it actually a bias and is it likely due to my own handling/preprocessing of the feature or is it acceptable that is it just very important? Would it be possible for me to set my model to re-weight the importance of this particular feature? I have a biology background so not sure what is the normal or least biased approach.