What is dimensionality reduction? What is the difference between feature selection and extraction?



From wikipedia:

dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.

What is the difference between feature selection and feature extraction?

What is an example of dimensionality reduction in a Natural Language Processing task?


Posted 2014-05-18T06:26:15.673

Reputation: 2 242



Simply put:

  • feature selection: you select a subset of the original feature set; while
  • feature extraction: you build a new set of features from the original feature set.

Examples of feature extraction: extraction of contours in images, extraction of digrams from a text, extraction of phonemes from recording of spoken text, etc.

Feature extraction involves a transformation of the features, which often is not reversible because some information is lost in the process of dimensionality reduction.


Posted 2014-05-18T06:26:15.673

Reputation: 1 446

2Both of these fall into the category of feature engineering as they involve manually creating or selecting features. Dimensionality reduction typically involves a change of basis or some other mathematical re-representation of the data – ragingSloth – 2014-06-16T21:05:47.767

The way I found it, for some feature extractions you can still reconstruct the original dimensions approximately. But for feature selection, there is no reconstruction, as you have removed the useless dimensions. – Bob – 2019-01-15T13:03:00.877

1@ragingSloth, I think the first one is definitely feature selection - and not feature engineering. While image and text processing examples indeed seem to be feature engineering – Alexey Grigorev – 2015-10-02T07:26:54.503


Dimensionality reduction is typically choosing a basis or mathematical representation within which you can describe most but not all of the variance within your data, thereby retaining the relevant information, while reducing the amount of information necessary to represent it. There are a variety of techniques for doing this including but not limited to PCA, ICA, and Matrix Feature Factorization. These will take existing data and reduce it to the most discriminative components.These all allow you to represent most of the information in your dataset with fewer, more discriminative features.

Feature Selection is hand selecting features which are highly discriminative. This has a lot more to do with feature engineering than analysis, and requires significantly more work on the part of the data scientist. It requires an understanding of what aspects of your dataset are important in whatever predictions you're making, and which aren't. Feature extraction usually involves generating new features which are composites of existing features. Both of these techniques fall into the category of feature engineering. Generally feature engineering is important if you want to obtain the best results, as it involves creating information that may not exist in your dataset, and increasing your signal to noise ratio.


Posted 2014-05-18T06:26:15.673

Reputation: 1 694


I agree mostly, with a precision: Feature selection needs not be done by hand, it can be automatic. See for instance the Lasso method (http://en.wikipedia.org/wiki/Least_squares#Lasso_method).

– jrouquie – 2014-09-29T09:00:26.110

I agree with your Dimensionality Reduction clause but differ a bit on Feature Engineering usage - which from what I've seen is only Feature Extraction: Feature Selection is considered separately. It's just a difference in terminology. – StephenBoesch – 2017-12-03T23:30:37.513


As in @damienfrancois answer feature selection is about selecting a subset of features. So in NLP it would be selecting a set of specific words (the typical in NLP is that each word represents a feature with value equal to the frequency of the word or some other weight based on TF/IDF or similar).

Dimensionality reduction is the introduction of new feature space where the original features are represented. The new space is of lower dimension that the original space. In case of text an example would be the hashing trick where a piece of text is reduced to a vector of few bits (say 16 or 32) or bytes. The amazing thing is that the geometry of the space is preserved (given enough bits), so relative distances between documents remain the same as in the original space, so you can deploy standard machine learning techniques without having to deal with unbound (and huge number of) dimensions found in text.


Posted 2014-05-18T06:26:15.673

Reputation: 599


Feature selection is about choosing some of features based on some statistical score but feature extraction is using techniques to extract some second layer information from the data e.g. interesting frequencies of a signal using Fourier transform.

Dimensionality reduction is all about transforming data into a low-dimensional space in which data preserves its euclidean structure but does not suffer from curse of dimensionality. For instance assume you extract some word features $[x_1,...,x_n]$ from a data set where each document can be modeled as a point in n-dimensional space and n is too large (a toy example). In this case many algorithms do not work according to the distance distortion of high-dimensional space. Now you need to reduce dimensionality by either selecting most informative features or transforming them into a low-dimensional manifold using dimensionality reduction methods e.g. PCA, LLE, etc.


Posted 2014-05-18T06:26:15.673

Reputation: 163

Out of the answers available this one best matches what I've seen in several Data Science and ML Platform teams – StephenBoesch – 2017-12-03T23:28:24.197


To complete Damien's answer, an example of dimensionality reduction in NLP is a topic model, where you represent the document by a vector indicating the weights of its constituent topics.


Posted 2014-05-18T06:26:15.673

Reputation: 9 953


A1. What is dimensionality reduction: If you think of data in a matrix, where rows are instances and columns are attributes (or features), then dimensionality reduction is mapping this data matrix to a new matrix with fewer columns. For visualization, if you think of each matrix-column (attribute) as a dimension in feature space, then dimensionality reduction is projection of instances from the higher dimensional space (more columns) to a lower dimensional sub-space (fewer columns). Dimensionality reduction is subspace projection Typical objective for this transformation is (1) preserving information in the data matrix, while reducing computational complexity; (2) improving separability of different classes in data.

A2. Dimensionality reduction as feature selection or feature extraction: I'll use the ubiquitous Iris dataset, which is arguably the 'hello world' of data science. Briefly, the Iris dataset has 3 classes and 4 attributes (columns). I'll illustrate feature selection and extraction for the task of reducing Iris dataset dimensionality from 4 to 2.

I compute pair-wise co-variance of this dataset using library in Python called seaborn. The code is: sns.pairplot(iris, hue="species", markers=["o", "s", "D"]) The figure I get is Iris pair-plot I can select the pair of attributes (2 dimensions) that provide me the greatest separation between the 3 classes (species) in the Iris dataset. This would be a case of feature-selection.

Next up is feature extraction. Herein, I am projecting the 4-dimensional feature space of Iris to a new 2-dimensional subspace, which is not axis aligned with the original space. These are new attributes. They are typically based on the distribution in the original high dimensional space. The most popular method is Principal Component Analysis, which computes Eigenvectors in the original space. PCA using SVD Obviously, we are not restricted to using only a linear and global projection to a subspace based on Eigenvectors. We can use non-linear projection methods as well. Here is an example of non-linear PCA using neural networks non-linear PCA using NN The attributes (dimensions) in the last example are extracted from the original 4 attributes using neural networks. You can experiment with various flavors of PCA for iris dataset youself using this pca methods code.

Summary: While feature extraction methods may appear to be superior in performance to feature selection, the choice is predicated by the application. The attributes from feature extraction typically lose physical interpretation, which may or may not be an issue based on the task at hand. For example, if you are designing a very expensive data collection task with costly sensors and need to economize on the attributes (number of different sensors), you'd want to collect a small pilot sample using all available sensors and then select the ones that are most informative for the big data collection task.

Dynamic Stardust

Posted 2014-05-18T06:26:15.673

Reputation: 1 163


For a proper review and definition you may take a look at Dimension Reduction vs. Variable Selection also in the book Feature Extraction Foundations and Applications feature extraction is decomposed in to two steps: feature construction and feature selection.


Posted 2014-05-18T06:26:15.673

Reputation: 21


Extracted from Hands-on machine learning with scikit-learn & Tensorflow

  1. Data cleaning: Fix or remove outliers (optional). Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).
  2. Feature selection (optional): Drop the attributes that provide no useful information for the task.
  3. Feature engineering, where appropriate: Discretize continuous features. Decompose features (e.g., categorical, date/time, etc.). Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.). Aggregate features into promising new features.
  4. Feature scaling: standardize or normalize features.

Hadi Askari

Posted 2014-05-18T06:26:15.673

Reputation: 11


Several great answers on here, in particular, @damienfrancois's answer very succinctly captures the general idea.

However, I don't see any examples of feature engineering for relational or time-series data. In that case, data scientists generally extract statistical patterns across relationships and over time. For instance, in order to predict what customers will by in the future in an ecommerce database, one might extract quantities like the average historical purchase amount, or the frequency of prior purchases.

I wrote a piece on this topic that goes into much more detail with several examples here: https://www.featurelabs.com/blog/feature-engineering-vs-feature-selection/


Posted 2014-05-18T06:26:15.673

Reputation: 101


Let me start with reverse order which feature extraction and why there is need of feature selection and dimensionality reduction.

Starting with the usage of feature extraction which is mainly for classification purposes. The classification is the process of making a decision on which category particular object belongs. It has two phases i) training phase, where given the data or objects their properties are learned using some process (feature extraction) ii) testing phase, where the unknown object is classified using the features learned in the previous (training) phase.

Feature extraction as the name suggests given the data aim is to find the underlying pattern. This underlying pattern which is term as feature corresponding to that respective data. There are various methodologies existing for feature extraction such as Support Vector Machine(SVM).

Now, feature extraction should generate features which should be

  • robust
  • discriminative
  • optimal set of features

Feature Selection: A specific set of data can be represented either by a single feature or set of features. In the classification process, a system is trained for at least two classes. So the training system will either generate a single feature or set of features. These features should possess the properties stated above.

The problem comes when there is a feature set for each class and there exists correlation between some of the features. That implies among those correlating features one or few are sufficient for representation and that is where feature selection comes in to picture. Also, these features need to be stored with the increase in feature set memory requirement also increases.

Then comes the dimensionality reduction which is nothing but the part of feature selection process. It is the process of choosing the optimal set of features which best describe the data. There are many techniques for the same such as principal component analysis, independent component analysis, and matrix factorization etc.

Chirag Arora

Posted 2014-05-18T06:26:15.673

Reputation: 1


For example...if u have an agricultural land then selecting one particular area of that land would be feature selection.If u aim to find the affected plants in that area den u need to observe each plant based on a particular feature that is common in each plant so as to find the abnormalities...for this u would be considering feature extraction.In this example the original agricultural land corresponds to Dimensionality reduction.


Posted 2014-05-18T06:26:15.673

Reputation: 75

No, it has nothing to do with spatial data in particular. It's applicable to temporal, spatio-temporal, and other sorts of data too. – Emre – 2014-06-21T06:10:06.757