How to make a classification problem into a regression problem?


I have data describing genes which each get 1 of 4 labels, I use this to train models to predict/label other unlabelled genes. I have a huge class imbalance with 10k genes in 1 label and 50-100 genes in the other 3 labels. Due to this imbalance I'm trying to change my labels into numeric values for a model to predict a score rather than a label and reduce bias.

Currently from my 4 labels (of most likely, likely, possible, and least likely to affect a disease) I convert the 4 labels into scores between 0-1: most likely: 0.9, likely: 0.7, possible: 0.4, and least likely: 0.1 (decided based on how similar the previous label definitions were in their data). I've been using scatter plots with a linear model to try to understand which model would best fit my data and reducing overfitting, but not sure if there's more I can infer from this except that the data has homoskedasticity (I think? I have a biology background so learning as I go):

enter image description here

I'm not sure if there is a more official way I should be approaching this or if this regression conversion is problematic in ways I don't realise? Should I be trying to develop more scores in my training data or is there something else I should be doing?

Edit for more information: The current 4 labels I have I create based on drug knowledge of the genes and the drug data I currently have for each gene, I could incorporate other biological knowledge I have to make further labels I think. For example, currently the 'most likely' labelled genes are labelled as such because they are drug targets for my disease, 'likely' label because they are genes which interact with drugs to cause a side effect which leads to the disease, and the other 2 labels go down in relevance until there are least likely genes with no drug related or statistical evidence to cause the disease.


Posted 2020-03-27T10:29:58.173

Reputation: 125

Did you try classic strategies to deal with imbalanced classes? To turn a classification problem into a regression one you need to change the labels, the risk of introducing artefacts is really high. Despite the close similarity between classification and regression I would never suggest to do that. – Edoardo Guerriero – 2020-03-27T12:38:00.607

Thank you for your reply, I've tried over and undersampling, but it seemed to still be overfitting, and ultimately got recommended I should try to do this as a regression. Potentially if this still has worse problems I will go back to trying a imbalance-aware classification approach though – DN1 – 2020-03-27T12:41:41.657

Then could you explain a little bit better how would you change the initial 4 classes into continuous values? You start with 4 classes, from what you wrote I understood that you want to predict a single probabilities for each class, but by doing this you're still doing classification, what should be the continos values to predict in this scenario? For example, if I want to turn a classification task, i.e. classifying ironic sentences vs non ironic, a potential regression task would then be predicting the level of irony with a real value from 0 to 5. I don't feel you're doing the same here. – Edoardo Guerriero – 2020-03-27T12:49:18.367

And to better explain why I feel this is a wrong approach, let me stress an important aspect of my previous example. If I have labels for classes, i.e. I know which sentences in a dataset are ironic and which aren't, I don't have enough information to say also how ironic the sentences are, so to turn the classification problem into a regression one I would need to label again the whole dataset, or I would need a special function that tells me how ironic a sentences is (which is impossible because that's the model I want to train). Do you have other labels or just the risk level? – Edoardo Guerriero – 2020-03-27T12:53:51.840

Thank you for answering this further. I've tried to respond by expanding my detail in my question with how I do my labeling. I think the continuous values to predict in this case would be risk of gene causing a disease. – DN1 – 2020-03-27T13:16:37.607



So, the direct answer here is clearly NO.

The answer comes from the definitions of classification and regression. In a classification task what a model predicts is the probability of an instance to belong to a class (e.g. 'image with clouds' vs 'image without clouds' ), in regression you are trying to predict continuous values (e.g. the level of 'cloudiness' of an image).

Sometimes you can turn a regression problem into a classification one. For example if I have a dataset of images with labels of their cloudiness level from 0 to 5 I could define a threshold, i.e. 2.5 and use it to turn the continuous values into discrete one, and use those discrete values as classes (cloudiness level < 2.5 equal image without clouds) but the opposite is definitely not possible.

Here's also a link to a similar question Could I turn a classification problem into regression problem by encoding the classes?

To solve the problem of imbalanced classes there are many ways, not just oversampling, you can generate artificial data, add class weights in the loss function, use active learning to gather new data, use models that returns uncertainty score for each prediction (like bayesian networks), I'm sure here is plenty of answers and strategy you can try.

Edoardo Guerriero

Posted 2020-03-27T10:29:58.173

Reputation: 460

Thank you for this, this is incredibly helpful, you mention methods I haven't heard of before for addressing class imbalance and I will look into them. For my own understanding if you also know, would it be possible for me to do classification and set probability thresholds so those genes that get like '0.4, 0.3, 0.29, 0.34' probabilities per class aren't sorted into the label with probability 0.4 but are ignored/unlabelled because they don't meet a threshold I set at 0.5? – DN1 – 2020-03-27T13:24:07.027

1Yes you definitely can, even though also here the situation can be tricky. During training I would say it is not advisable to add the possibility to ignore these data (especially because in the first iterations lot of data are misclassified). During test, you can access the single probabilities to perform whatever check you desire, this process is usually referred as 'model calibration'. Be aware that calibration do not solve at all overfitting, i.e. it does not change the final predictions, but if for research purpose you can ignore arbitrary data and focus on other then go for it. – Edoardo Guerriero – 2020-03-27T13:38:26.817


Yes, you can go this route, using regression rather than classification, but you should one-hot encode your classes. This means that your model will have 4 outputs (alternatively, you can think of it as having 4 models). The first output will be the certainty that label1 applies, the second output label2, etc.

for example, if you have 10 datapoints with labels: 1,2,3,4,2,4,3,1,1,2, your "one-hot" encoded labels look like this:

1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 1 0 0
0 0 0 1
0 0 1 0
1 0 0 0
1 0 0 0
0 1 0 0

And a prediction for one data point could look like this:

0.445 0.129 1.234 -0.231

This datapoint has a high probability of having label3, but also a little probability of having label1.

Marijn van Vliet

Posted 2020-03-27T10:29:58.173

Reputation: 99

By doing this you're not turning a classification problem into a regression one, you're just stopping to the logits instead of returning probabilities, i.e. if you apply the softmax and then argmax to the scores you wrote you arrive at the same point where you started. One hot encode does not change the labels making them suitable for regression, to do so you need to turn discrete values into continuous one in the initial dataset, that's the defenition of regression. – Edoardo Guerriero – 2020-03-27T12:41:03.680

1Marijn, One-hot encoding the target but keeping it classification is just using the one-vs-all multiclass approach. Going from there to regression on 0/1 variables is not advisable: e.g., a score of -0.231 will get penalized for being "away" from zero the same amount as +0.231. @EdoardoGuerriero, it isn't the same as modeling logits though. – Ben Reiniger – 2020-03-27T13:22:38.970