Determining Important Atrributes with Feature Selection



I have a labeled dataset of two classes. Let’s say sick and healthy patients. My features are patient data as well as diagnostic data such as blood-test results. It’s categorical as well as continuous data. In total several thousand features.

I want to identify the features that best differentiate the sick patients from the healthy ones.

It is not really an anomaly detection problem, because the sickness is not directly indicated by anything in my features. Meaning that for none of my features the sick patients would be indicated by outliers. For example a result could be that the risk of sickness is higher for patients over the age of 50. But being over 50 is not generally uncommon.

The obvious approach would be to use a dedicated feature selection technique. For example a chi-square test, variable importance in decision trees or backward elimination.

The problem about all of these approaches is that they are usually preprocessing steps before another learning algorithm is deployed. I on the other hand don’t want to use another learning algorithm. The entire task is to figure out the most important features.

Are feature selection algorithms really the best way to proceed, or are there any better approaches? Maybe some techniques especially designed for my kind of problems?

Many thanks in advance!


Posted 2017-06-07T11:51:19.770

Reputation: 159


Keep in mind that "variable importance" as defined so by sklearn.ensemble.RandomForest is really based on the impurity of the node, meaning the true important variable might not be the most "important" as identified so by RF. Also, think about if two important features are correlated and RF chooses one, the importance of other will be reduced. have a look on this post. Regarding your question have a look on feature selection here

– Jekaterina Kokatjuhha – 2017-06-07T16:37:13.300

@Jekaterina Kokatjuhha Thank you for your answer and the links. In fact those are exactly the problems I encountered. My data set has A LOT of correlated variables. That's why I'm not certain that 'classic' feature selection approaches are the best way to tackle my problem. – AutoMiner – 2017-06-08T07:43:11.507



Time for a more general answer.

Your approach should be pragmatic, based on your objectives and time scales. Are you trying to find nuanced links between predictors and response, are you hoping for more of a rough idea of "important" variables? Do you have the time and resources to prepare your data and run more complicated algorithms / processes?

Something to bear in mind is that there are often many ways to achieve the same goal. You may find that for your particular data and task, simple ("traditional") tests may perform as well as more complicated techniques.

That said, perhaps consider something along the lines of the following - in increasing order of complexity:

  1. Chi-square / Cramer's V for categorical predictors
  2. ANOVA test for continuous predictors
  3. Forward selection (GLM technique)
  4. Decision trees / bagged trees / random forest importance.
  5. Boosted trees

Alternatively, if you're looking for R packages to help you with feature selection, have a look at caret and FSelector.


Posted 2017-06-07T11:51:19.770

Reputation: 1 387


Sounds like pretty orthodox feature importance analysis. Easy option:

FeaturePlot from caret

It basically creates a correlation matrix for you to see which seem to separate most.

Slightly fancier:

VarImpPlot from randomForest

This uses the random forest ML technique to directly tell you which features were most important. It sounds very similar to what you're looking for.

Edit: Here is a more detailed discussion of VarImpPlot.


Posted 2017-06-07T11:51:19.770

Reputation: 1 548


You could play a bit with the classic "Pima Diabetes" dataset of native american women tested negative/positive for diabetes.

(In R, there are a few variants (with and without missing data) of this dataset in the MASS package.)

However, a "diabetes.arff" datafile it is also present as a sample dataset in the Weka software, and Weka has a few "Attribute Selection" algorithms built in.

If you know Weka, I suggest that you try them yourself. With a few clicks you can learn a lot about your dataset (without getting a definitive answer, of course).

I have added a screenshot illustrating my approach.

weka screenshot

Weka starts with the small "Weka GUI chooser" window. I have opened the Weka Experimenter (1), then the big window shows up.

I've imported the file (in the preprocessing" tab (2). I've just loaded it , didn't apply any preprocessing, as the dataset is already clean.

in (3) I've chosen the "Select Attributes" Tab, and tried a few of Weka's algorithms. Digits 4,5,6 indicate the buttons I had to click.

(7) Shows the my last run - the "GainRatioAttributeEval" Method, which Weka combined with the "Ranker" method of weighting the results.

The Main panel (8) shows what that algorithm considers the most important attributes/feature, ranked by importance:

=== Attribute selection 10 fold cross-validation (stratified), seed: 1 ===

average merit      average rank  attribute
 0.104 +- 0.008     1.1 +- 0.3     2 plas   ---  Blood plasma
 0.092 +- 0.008     1.9 +- 0.3     6 mass   ---  Body mass
 0.067 +- 0.009     3.1 +- 0.3     8 age    ---  age
 0.052 +- 0.005     3.9 +- 0.3     1 preg   ---  
 0.04  +- 0.003     5   +- 0       5 insu   ---  
 0.02  +- 0.011     6.4 +- 0.66    7 pedi   --- 
 0.016 +- 0.011     7.1 +- 0.7     4 skin   --- 
 0.009 +- 0.009     7.5 +- 0.67    3 pres   --- 

# Column Positions in the original datatable:
%    1. Number of times pregnant
%    2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
%    3. Diastolic blood pressure (mm Hg)
%    4. Triceps skin fold thickness (mm)
%    5. 2-Hour serum insulin (mu U/ml)
%    6. Body mass index (weight in kg/(height in m)^2)
%    7. Diabetes pedigree function
%    8. Age (years)
%    9. Class variable (0 or 1)

So this attribute selection algorithm thinks the top 3 important attributes (in that order) are #2, #6, #8. These are 2. blood plasma, 6. body mass index, and 8. age. Sounds pretty reasonable to me.

Keep in mind that this is an idealized example.

How many missing data does your dataset contain? You didn't tell us. I think your patient data will be very sparse and the dataset will contain lots of empty cells. I think that many attribute selection algorithms work best when there are few missing data values.


Posted 2017-06-07T11:51:19.770

Reputation: 535