I have tried 5 different types of model but all returns really low training accuracy (~64%) and low testing accuracy (~14%). What should I do?

0

I am working with a typical regressor problem. There are $6$ features in the dataset that I am concerned with. There are about $800$ data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.

So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns $%64$ accuracy and the test data returns $15%$. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.

What should I do next if you guys don't mind giving me a recommendation?

Function to tune the hyperparameters

def hyperparatuning(model, train_features, train_labels, param_grid = {}):
    grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
    grid_search.fit(train_features, train_labels)
    print(grid_search.best_params_)
    return grid_search.best_estimator_`

Function to evaluate the model

def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100*np.mean(errors/test_labels)
    accuracy = 100 - mape
    print('Model Perfomance')
    print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%. '.format(accuracy))

Hang Nguyen

Posted 2019-07-15T05:07:57.073

Reputation: 9

More of a query: are Ensemble methods some kind of cocktail models. If so, and if there's any distinct pattern in the error terms, you may try Ensemble route. (ref. ESL, pg 605). – Continue2Learn – 2019-07-15T06:33:03.623

The first thing to try is improving your validation method - 64% on training dataset and 15% on test implies that your model is severely overtrained. – timchap – 2019-07-16T09:12:39.073

No answers