To put it shortly, xgboost tries to fix it and although it is very good in getting rid of overfitting, it is not perfect.
Adding new features is not always beneficial, because you increase the dimension of your search space and thus make the problem harder. In your particular case the increased complexity overweight the added value from extra features.
I understand you‘ve tested quite wide range of hyper-parameters and enough combinations. If you apply regularisation via colsample_bytree and/or colsample_bylevel it might happen that at some stage the randomly chosen columns (features) are less informative than your original features and the algorithm is forced to use these features for further splits. Does it make sense to you?
The number of added features might be crucial, if it is too high relative to original 20 the new features become just too dominant. For example, this might happen if one adds some nominal features with high cardinality, which are then dummy-encoded.
In order to improve the results with wider data you might want to play with the parameters controlling the early stopping and stop the fitting even earlier.
The rationale for my last suggestion:
I assumed that worsened performance on wider dataset is due to overfitting, i.e. significantly worse performance on test dataset than on training data. Early stopping should prevent / control the overfitting, but it seems it didn’t work so good in your case and thus should be improved.
You can and should test different combination of new features, but trustworthy quality metric is crucial for the model choice. If you significantly overfit, the performance on training data (or even cross-validation) will not help you to choose the model (or feature combination) with good performance on test data set.