how to evaluate feature quality for decision tree model

3

Most of the tutorials assume that the features are known before generating the model and give no way to select 'good' feature and to discard 'bad' ones.

The naive method is to test the model with new features and see how the new results change compared to the previous model but it can be complex to interpret when the tree is complex.

Is there an academic way to select good features and to discard bad ones?

(ressources appreciated)

Bertrand

Posted 2019-01-03T17:55:45.450

Reputation: 197

Answers

3

The main reasons for seeking an efficient feature selection are the machine learning algorithm get faster training, reduces the complexity of a model, facilitates interpretation and improves the accuracy of a model.

Look for Filter Methods , Wrapper Methods and Embedded Methods to learn more about your issue.

Filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. Here you have to look for Linear discriminant analysis, Pearson’s Correlation, Chi-Square.

Some common examples of wrapper methods are :

Forward Selection: Is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Backward Elimination: Here, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.

Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods.

Some of the most popular examples of these methods are LASSO and RIDGE regression which have inbuilt penalization functions to reduce overfitting.

Other example of embedded methods that could fits you is Regularized trees.

follow the link with some of these algorithms in sklearn.

sklearn - Feature Selection

I hope this could help you to start.

Marcelo

Posted 2019-01-03T17:55:45.450

Reputation: 46

Thx for the complete answer Marcelo! Give me some time to try that out before validating it :) – Bertrand – 2019-01-04T09:45:20.413

2

Another approach to evaluate features is called Permutation Importance. In short, this approach is random sampling values of each feature and each time measuring the negative impact this has on the model performance. The feature for which random sampling of its values has highest negative impact on the model performance is considered as most important to the model.

Franziska W.

Posted 2019-01-03T17:55:45.450

Reputation: 291

0

I have no reputation to add a comment. I think Marcelo Silva gave a very nice answer (don't know how to link his name). In "Muñoz-Mas, R., Fukuda, S., Vezza, P., Martínez-Capel, F., 2016. Comparing four methods for decision-tree induction: A case study on the invasive Iberian gudgeon (Gobio lozanoi; Doadrio and Madeira, 2004). Ecol. Inform. 34, 22–34. 10.1016/j.ecoinf.2016.04.011" we used a wrapper approach to simultaneously search for the best hyper-parameters and variables’ subset using cross-validation and a genetic algorithm. We made a nice review in that paper so it maybe deserves a check. The forest-based variant is currently on review. I'm the first author so write me in case you want to discuss anything about trees and forests. An alternative will be to use conditional random forests “Strobl, C., Hothorn, Zeileis, A., 2009. Party on! R J. 1 (2), 14–17. (and linked references)” employing the entire variables subset to then use sequentially only those that proved most relevant to train the decision-tree. Nevertheless, take into account that hyper-parameters play a role in the final decision-tree so they will need also some tuning. In case you are not interested on interpretability I would use a forest instead of a single decision-tree. Good luck.

Rafael Muñoz-Mas

Posted 2019-01-03T17:55:45.450

Reputation: 81