Feature selection for gene expression dataset


I am searching for a feature selection algorithm which selects features that are:

  • relevant to discriminate groups of samples (for each sample a group label is provided)
  • endowed with high variance across all the samples

This should be applied to gene expression dataset, in which each sample has a group label, therefore it should be possible to select for each group a set of features to be checked against.

I have now two candidates:

  • selecting features by the feature importance result of a Random Forest classifier
  • using the Minimum Redundancy Maximum Relevance (mRMR) algorithm

However, I am unsure of which may be the best or if there are better candidates for this purpose.

If the algorithm is implemented in Python scikit-learn it would be a plus.


Posted 2016-06-28T08:17:05.813

Reputation: 849



It would be helpful if you described your dataset more. Gene expression datasets seem to often have very high dimensionality and Lasso regularized logistic regression is a popular method to approach this problem. This paper takes it a little further and might help you out: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-198

Random forest can generally certainly provide a meaningful importance ranking, but it also depends on what your dataset looks like.

mRMR sounds like it is specifically designed for identifying gene characteristics, so definitely give it a try.

There's also Principle Component Analysis which is also used for gene expression data.

Lots of options, but your questions is not detailed enough to go any further, and providing code as a solution at this point isn't realistic. The documentation for Python scikit-learn has many good explanations and examples.


Posted 2016-06-28T08:17:05.813

Reputation: 1 329