What is dimensionality reduction? What is the difference between feature selection and extraction?



From wikipedia,

dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.

What is the difference between feature selection and feature extraction?

What is an example of dimensionality reduction in a Natural Language Processing task?


Posted 2014-05-18T06:26:15.673

Reputation: 501



Simply put:

  • feature selection: you select a subset of the original feature set; while
  • feature extraction: you build a new set of features from the original feature set.

Examples of feature extraction: extraction of contours in images, extraction of digrams from a text, extraction of phonemes from recording of spoken text, etc.

Feature extraction involves a transformation of the features, which often is not reversible because some information is lost in the process of dimensionality reduction.


Posted 2014-05-18T06:26:15.673

Reputation: 1 226

2Both of these fall into the category of feature engineering as they involve manually creating or selecting features. Dimensionality reduction typically involves a change of basis or some other mathematical re-representation of the dataragingSloth 2014-06-16T21:05:47.767

1@ragingSloth, I think the first one is definitely feature selection - and not feature engineering. While image and text processing examples indeed seem to be feature engineeringAlexey Grigorev 2015-10-02T07:26:54.503


Dimensionality reduction is typically choosing a basis or mathematical representation within which you can describe most but not all of the variance within your data, thereby retaining the relevant information, while reducing the amount of information necessary to represent it. There are a variety of techniques for doing this including but not limited to PCA, ICA, and Matrix Feature Factorization. These will take existing data and reduce it to the most discriminative components.These all allow you to represent most of the information in your dataset with fewer, more discriminative features.

Feature Selection is hand selecting features which are highly discriminative. This has a lot more to do with feature engineering than analysis, and requires significantly more work on the part of the data scientist. It requires an understanding of what aspects of your dataset are important in whatever predictions you're making, and which aren't. Feature extraction usually involves generating new features which are composites of existing features. Both of these techniques fall into the category of feature engineering. Generally feature engineering is important if you want to obtain the best results, as it involves creating information that may not exist in your dataset, and increasing your signal to noise ratio.


Posted 2014-05-18T06:26:15.673

Reputation: 743


I agree mostly, with a precision: Feature selection needs not be done by hand, it can be automatic. See for instance the Lasso method (http://en.wikipedia.org/wiki/Least_squares#Lasso_method).

jrouquie 2014-09-29T09:00:26.110

I agree with your Dimensionality Reduction clause but differ a bit on Feature Engineering usage - which from what I've seen is only Feature Extraction: Feature Selection is considered separately. It's just a difference in terminology.javadba 2017-12-03T23:30:37.513


As in @damienfrancois answer feature selection is about selecting a subset of features. So in NLP it would be selecting a set of specific words (the typical in NLP is that each word represents a feature with value equal to the frequency of the word or some other weight based on TF/IDF or similar).

Dimensionality reduction is the introduction of new feature space where the original features are represented. The new space is of lower dimension that the original space. In case of text an example would be the hashing trick where a piece of text is reduced to a vector of few bits (say 16 or 32) or bytes. The amazing thing is that the geometry of the space is preserved (given enough bits), so relative distances between documents remain the same as in the original space, so you can deploy standard machine learning techniques without having to deal with unbound (and huge number of) dimensions found in text.


Posted 2014-05-18T06:26:15.673

Reputation: 526


Feature selection is about choosing some of features based on some statistical score but feature extraction is using techniques to extract some second layer information from the data e.g. interesting frequencies of a signal using Fourier transform.

Dimensionality reduction is all about transforming data into a low-dimensional space in which data preserves its euclidean structure but does not suffer from curse of dimensionality. For instance assume you extract some word features $[x_1,...,x_n]$ from a data set where each document can be modeled as a point in n-dimensional space and n is too large (a toy example). In this case many algorithms do not work according to the distance distortion of high-dimensional space. Now you need to reduce dimensionality by either selecting most informative features or transforming them into a low-dimensional manifold using dimensionality reduction methods e.g. PCA, LLE, etc.


Posted 2014-05-18T06:26:15.673

Reputation: 83

Out of the answers available this one best matches what I've seen in several Data Science and ML Platform teamsjavadba 2017-12-03T23:28:24.197


To complete Damien's answer, an example of dimensionality reduction in NLP is a topic model, where you represent the document by a vector indicating the weights of its constituent topics.


Posted 2014-05-18T06:26:15.673

Reputation: 7 676


For a proper review and definition you may take a look at Dimension Reduction vs. Variable Selection also in the book Feature Extraction Foundations and Applications feature extraction is decomposed in to two steps: feature construction and feature selection.


Posted 2014-05-18T06:26:15.673

Reputation: 21


A1. What is dimensionality reduction: If you think of data in a matrix, where rows are instances and columns are attributes (or features), then dimensionality reduction is mapping this data matrix to a new matrix with fewer columns. For visualization, if you think of each matrix-column (attribute) as a dimension in feature space, then dimensionality reduction is projection of instances from the higher dimensional space (more columns) to a lower dimensional sub-space (fewer columns). Dimensionality reduction is subspace projection Typical objective for this transformation is (1) preserving information in the data matrix, while reducing computational complexity; (2) improving separability of different classes in data.

A2. Dimensionality reduction as feature selection or feature extraction: I'll use the ubiquitous Iris dataset, which is arguably the 'hello world' of data science. Briefly, the Iris dataset has 3 classes and 4 attributes (columns). I'll illustrate feature selection and extraction for the task of reducing Iris dataset dimensionality from 4 to 2.

I compute pair-wise co-variance of this dataset using library in Python called seaborn. The code is: sns.pairplot(iris, hue="species", markers=["o", "s", "D"]) The figure I get is Iris pair-plot I can select the pair of attributes (2 dimensions) that provide me the greatest separation between the 3 classes (species) in the Iris dataset. This would be a case of feature-selection.

Next up is feature extraction. Herein, I am projecting the 4-dimensional feature space of Iris to a new 2-dimensional subspace, which is not axis aligned with the original space. These are new attributes. They are typically based on the distribution in the original high dimensional space. The most popular method is Principal Component Analysis, which computes Eigenvectors in the original space. PCA using SVD Obviously, we are not restricted to using only a linear and global projection to a subspace based on Eigenvectors. We can use non-linear projection methods as well. Here is an example of non-linear PCA using neural networks non-linear PCA using NN The attributes (dimensions) in the last example are extracted from the original 4 attributes using neural networks. You can experiment with various flavors of PCA for iris dataset youself using this pca methods code.

Summary: While feature extraction methods may appear to be superior in performance to feature selection, the choice is predicated by the application. The attributes from feature extraction typically lose physical interpretation, which may or may not be an issue based on the task at hand. For example, if you are designing a very expensive data collection task with costly sensors and need to economize on the attributes (number of different sensors), you'd want to collect a small pilot sample using all available sensors and then select the ones that are most informative for the big data collection task.

Dynamic Stardust

Posted 2014-05-18T06:26:15.673

Reputation: 633


Several great answers on here, in particular, @damienfrancois's answer very succinctly captures the general idea.

However, I don't see any examples of feature engineering for relational or time-series data. In that case, data scientists generally extract statistical patterns across relationships and over time. For instance, in order to predict what customers will by in the future in an ecommerce database, one might extract quantities like the average historical purchase amount, or the frequency of prior purchases.

I wrote a piece on this topic that goes into much more detail with several examples here: https://www.featurelabs.com/blog/feature-engineering-vs-feature-selection/


Posted 2014-05-18T06:26:15.673

Reputation: 101


For example...if u have an agricultural land then selecting one particular area of that land would be feature selection.If u aim to find the affected plants in that area den u need to observe each plant based on a particular feature that is common in each plant so as to find the abnormalities...for this u would be considering feature extraction.In this example the original agricultural land corresponds to Dimensionality reduction.


Posted 2014-05-18T06:26:15.673

Reputation: 75

No, it has nothing to do with spatial data in particular. It's applicable to temporal, spatio-temporal, and other sorts of data too.Emre 2014-06-21T06:10:06.757