4

1

So I have a dataset containing the results of executing problem instances with different given solver strategies. Simplified example:

```
| Problem_instance | Problem_Size | Used_Solver | Cost |
| P1 | 50 | A | 75 |
| P1 | 50 | B | 125 |
| P1 | 50 | C | 225 |
| P1 | 50 | D | 100 |
| P2 | 150 | A | 165 |
| P2 | 150 | B | 360 |
| P2 | 150 | C | 275 |
| P2 | 150 | D | 45 |
| P3 | 25 | A | 35 |
| P3 | 25 | B | 65 |
| ... | ... | ... | ... |
```

I'm trying to use machine learning to predict the best performing Solver for a given problem instance. In data processing stage, I need to standardize or scale my data, but I'm not sure how to this best.

Firstly, I'm not sure which sklearn's Scaler to use (`StandardScalar`

/ `MinMaxScaler`

/..).

Secondly, I'm confused how to handle the different records for each instance. When I group the data first based on `problem_instance`

and then use a `MinMaxScaler`

, the record with `Cost = 0`

would be the best solution for this problem and `Cost=1`

the worst. But if I use the same strategy to scale the `Problem_Size`

this would be equal to 0 everywhere. On the other hand if I use a global scaling, the information about which Solver is the best for each instance is lost.

Can someone help me how to handle the data preprocessing for this problem?