So I have a dataset containing the results of executing problem instances with different given solver strategies. Simplified example:
| Problem_instance | Problem_Size | Used_Solver | Cost | | P1 | 50 | A | 75 | | P1 | 50 | B | 125 | | P1 | 50 | C | 225 | | P1 | 50 | D | 100 | | P2 | 150 | A | 165 | | P2 | 150 | B | 360 | | P2 | 150 | C | 275 | | P2 | 150 | D | 45 | | P3 | 25 | A | 35 | | P3 | 25 | B | 65 | | ... | ... | ... | ... |
I'm trying to use machine learning to predict the best performing Solver for a given problem instance. In data processing stage, I need to standardize or scale my data, but I'm not sure how to this best.
Firstly, I'm not sure which sklearn's Scaler to use (
Secondly, I'm confused how to handle the different records for each instance. When I group the data first based on
problem_instance and then use a
MinMaxScaler, the record with
Cost = 0 would be the best solution for this problem and
Cost=1 the worst. But if I use the same strategy to scale the
Problem_Size this would be equal to 0 everywhere. On the other hand if I use a global scaling, the information about which Solver is the best for each instance is lost.
Can someone help me how to handle the data preprocessing for this problem?