Feature Selection and PCA



I have a classification problem. I want to reduce number of features to 4 (I have 30). I'm wondering why I get better result in classification when I use correlation based feature selection(cfs) first and then employ pca in comparison with just employing pca (the latter one is worse than the first one). It also should be mentioned that data loss in the second approach (just pca) 0.2-variance cover:0.8- and in the first one is 0.4 -variance coverd: 0.6!

Thank you in advance


Posted 2016-06-17T12:32:37.780

Reputation: 433



PCA simply finds more compact ways of representing correlated data. PCA does not explicitly compact the data in order to better explain the target variable. In some cases, most of your inputs might be correlated with each other but have minimal relevance to your target variable. That's probably what is happening in your case.

Consider a toy example. Lets say I want to predict stock prices. Say I'm given four predictors:

  1. Year-over-year earnings growth (relevant)
  2. Percent chance of rain (irrelevant)
  3. Humidity (irrelevant)
  4. Temperature (irrelevant)

If I apply PCA to this data set, the first principle component would relate to weather since 75% of the predictors are weather related. Is this principle component relevant? It's not.

The two options you've highlighted boil down to using CFS or not using it. The option that uses CFS does better because it explicitly selects variables that have relevance to the target variable.

Ryan Zotti

Posted 2016-06-17T12:32:37.780

Reputation: 3 849

Thank you Ryan. Therefore, you mean that is because of the input. I mean, if the inputs of PCA be all relevant to the target the result of that compression by this method will be better than the way we act blindly and employ it on whole dataset (if we presume that most of the features aren't relevant to the target)? and also can we say that the number of features after CFS is more less than and PCA can handle it better. – Arkan – 2016-06-17T21:32:45.060

1Yes, CFS is better because it takes variable relevance into account. Therefore, in theory, PCA + CFS should always be better than PCA alone if predicting the target variable is ultimately what you care about. Regarding your last question - PCA isn't impacted by the number of features. It's not feature reduction that makes CFS better; it's the selection of variables that have a strong correlation with the target variable that matters – Ryan Zotti – 2016-06-17T22:13:05.477

Tank you Ryan. I really appreciate your help. – Arkan – 2016-06-18T08:10:21.963

Ryan, it just occurred to me to ask that, in your idea, in what rare situations the result of employing CFS before PCA will be worse than employing only PCA. I mean has a negative effect.The probable reasons. – Arkan – 2016-06-23T00:05:20.823

1In most cases I think it would be better, but if I were to play "devil's advocate" I'd say that the approach would be problematic if you implemented CFS in such a way that you selected only the N most useful features thereby completely missing out on some features that had small but still helpful predictive value. If you think this might affect you, I recommend using a technique like ridge regression; it has the effect of being like CFS and PCA combined into a single algorithm – Ryan Zotti – 2016-06-23T00:26:01.310


If you have a classification problem, you should you LDA instead of PCA. PCA ignores classes, whereas LDA is class-aware.

For example, if your data is 2D and you use PCA in the following example, you get:

enter image description here

So before PCA, the classes were perfectly linearly separable, but after PCA they are not separable at all. I'm not saying this happens in your case, but it could be.

Martin Thoma

Posted 2016-06-17T12:32:37.780

Reputation: 15 590


Correlated Variables should be removed from PCA, as the variables together tend to exaggerate the effect they are expressing. CFS selects uncorrelated subsets of variables.


Posted 2016-06-17T12:32:37.780

Reputation: 101

Thank you mandata. You mean the more efficient(important) and less input we give to PCA, the more accurate results we get? – Arkan – 2016-06-17T21:37:24.707

I would imagine that would depend on the data. What I was stating was that if variables are redundant (highly correlated, or even just (lowly?) correlated but with no other information, then they should be removed. If they have other information, then I guess it depends how much influence they have on a valid answer. – mandata – 2016-06-18T01:34:16.283

Tank you mandata. I really appreciate your help. – Arkan – 2016-06-18T08:09:54.803