## Features reduction for the not correlated data set

I am working with classification problem on a training data set, which have 100 features. All the features in pairs haven't visible correlation. One can see it in the example pair plot for the some of features:

I am trying to find the right way to decrease the number of features. All the methods of the important feature selection return different sets of features. For example:

• The MARS returns only five important features, but,
• Correlation with threshold >1% selects twenty of them.
• Lasso selects near twenty but selected features are different from the feature set, returned by correlation method.
• RandomForest selects near twenty-five features different from previous methods.
• The feature rating in the returned feature sets is different.
• The PCA method is not applicable, because there are not any visible linear correlated features.
• etc.

My point is the different tools return different important features. One can combine/union them, one can intersect them. One can use N methods of selection and as a result, one will get N feature sets.

The method: "Check them all" is not applicable there. As an example, one can calculate the time effort if N sets multiplied M prediction models, multiplied 3 rounds of the model tuning. The time effort will be even more if one will union or intersect feature sets. It will take forever! There should be some tactics, an algorithm for the filter a final selection.

How to select the best set of the features if the data set is large, non-correlated and noisy?

Have you tried Lasso (L1) Logistic Regression? And how many classes do you have ? – Elliot – 2019-09-05T08:28:08.547

@Elliot Yes, I tried Lasso with glmnet library. The rating list of the most valuable features by Lasso feature selection is almost the same as result which I achieved by the correlation matrix. – Ruben Kazumov – 2019-09-05T16:19:39.647

if you're interested in predictive power of your model, you can just search for the best feature selection/model combination. If you're interested in explaining important features, I would go with an union/intersection (if any) of features selected by trees and a lasso. – Elliot – 2019-09-06T09:42:30.747

@ Elliot I edited the part of the question to clarify it. The time effort of sets combination and the prediction models re-running are huge. – Ruben Kazumov – 2019-09-06T18:06:28.317

How many observations do you have? Whats the specs of your machine? Did you try packages that use all cores in parallel? – Ilker Kurtulus – 2019-10-07T17:03:50.347

• why do you want to select the best set of features? 2) what do you mean by „best“, do you have any metric in mind?
@aivanov I had asked this question before I introduced to the neural networks. For the "classic" RandomForest models, 100 features are a big load. It is why. – Ruben Kazumov – 2019-11-06T20:09:14.727

Interesting question. Is there any rationale behind those 100 independent features ? How they are obtained ? built ? Are they relative to a specific field ? Is there apparent independance natural or constructed ? etc... Very often, feature engineering rely on the underlying problem, not trying different techniques. – lcrmorin – 2020-01-05T10:26:32.740

@lcrmorin It was a test case. The nature of the data in unknown. It might be any formula over the random feature data. Now I know, that this type of data should be studied with the simple neural network. In this case a DFF network will be enough. The network will adjusts to the data and the features removal is not nesessary. – Ruben Kazumov – 2020-01-07T18:49:39.383

I believe that what you are looking for is Best Subset, Forward Stepwise, or Lasso?

Here is the R implementation Best Subset

If your dataset is too large, try to subsample your data and then running the feature selection algorithm several times. If you are getting similar results there you go, if the results are different then rank them and select the most frequent ones.

If anyone knows a python implementation for the best subest algorithm, please let me know!

If the dimensions are not linearly correlated, you may use an autoencoder to perform the dimensionality reduction. Just like PCA that can perform a reconstruction, but with non-linearity. Then, you can perform classification with the latent space.

Autoencoder is a multi-dimensional auto-regressive model with a dimensional bottleneck somewhere in the middle. In general, it contains 2 neural networks, one encoder and one decoder. The encoder takes one data sample as input and it produces an output in latent space with a number of dimensions smaller than the input's. The decoder takes the encoder's output as its input and it tries to produce the original data sample as output. To train an autoencoder, we try to minimize the reconstruction error (the difference between the input sample and the reconstructed sample). The dimensional bottleneck in the middle will force the autoencoder to preserve as much information of the original input as much as possible with a smaller number of dimensions.

Clarify the term autoencoder please. – Ruben Kazumov – 2020-02-05T03:04:29.530

@RubenKazumov I've edited the answer. – koonyook – 2020-02-09T03:41:54.403