A basic, yet a kind of painstaking, explanation of **PCA vs Factor analysis** with the help of scatterplots, in logical steps. (I thank @amoeba who, in his comment to the question, has encouraged me to post an answer in place of making links to elsewhere. So here is a leisure, late response.)

### PCA as variable summarization (feature extraction)

Hope you already have understanding of PCA. To revive now.

Suppose we have correlating variables $V_1$ and $V_2$. We center them (subtract the mean) and do a scatterplot. Then we perform PCA on these centered data. PCA is a form of *axes rotation* which offers axes P1 and P2 instead of V1 and V2. The key property of PCA is that P1 - called 1st principal component - gets oriented so that the variance of data points along it is maximized. The new axes are new variables which values are computable as long as we know the coefficients of rotation $a$ (PCA provides them) [**Eq.1**]:

$P1 = a1_1V_1 + a1_2V_2$

$P2 = a2_1V_1 + a2_2V_2$

Those coefficients are cosines of rotation (= direction cosines, principal directions) and comprise what are called eigenvectors, while eigenvalues of the covariance matrix are the principal component variances. In PCA, we typically discard weak last components: we thus summarize data by few first extracted components, with little information loss.

```
Covariances
V1 V2
V1 1.07652 .73915
V2 .73915 .95534
----PCA----
Eigenvalues %
P1 1.75756 86.500
P2 .27430 13.500
Eigenvectors
P1 P2
V1 .73543 -.67761
V2 .67761 .73543
```

With our plotted data, P1 component values (scores) `P1 = .73543*V1 + .67761*V2`

and component P2 we discard. P1's variance is `1.75756`

, the 1st eigenvalue of the covariance matrix, and so P1 explains `86.5%`

of the *total* variance which equals `(1.07652+.95534) = (1.75756+.27430)`

.

### PCA as variable prediction ("latent" feature)

So, we discarded P2 and expect that P1 alone can reasonably represent the data. That is equivalent to say that $P1$ can reasonably well "reconstruct" or predict $V_1$ and $V_2$ [**Eq.2**]:

$V_1 = a1_{1}P1 + E_1$

$V_2 = a1_{2}P1 + E_2$

where coefficients $a$ are what we already know and $E$ are the errors (unpredictedness). This is actually a "regressional model" where observed variables are predicted (back) by the latent variable (if to allow calling a component a "latent" one) P1 extracted from those same variables. Look at the plot **Fig.2**, it is nothing else than **Fig.1**, only detailed:

P1 axis is shown tiled with its values (P1 scores) in green (these values are the projections of data points onto P1). Some arbitrary data points were labeled A, B,..., and their departure (error) from P1 are bold black connectors. For point A, details are shown: the coordinates of the P1 score (green A) onto V1 and V2 axes are the P1-reconstructed values of V1 and V2 according to **Eq.2**, $\hat{V_1} = a1_{1}P1$ and $\hat{V_2} = a1_{2}P1$. The reconstruction errors $E_1 = V_1-\hat{V_1}$ and $E_2 = V_2-\hat{V_2}$ are also displayed, in beige. The connector "error" length squared is sum of the two errors squared, according to Pythagorean.

Now, **what is characteristic of PCA** is that if we compute E1 and E2 for every point in the data and plot these coordinates - i.e. make the scatterplot of the errors alone, **the cloud "error data" will coincide with the discarded component P2.** And it does: the cloud is plotted on the same picture as the beige cloud - and you see it actually forms axis P2 (of **Fig.1**) as tiled with P2 component scores.

No wonder, you may say. It is so obvious: **in PCA**, the discarded junior component(s) *is what precisely* decompose(s) in the prediction errors E, in the model which explains (restores) original variables V by the latent feature(s) P1. Errors E together just constitute the left out component(s). Here is where **factor analysis** starts to differ from PCA.

### The idea of common FA (latent feature)

Formally, the model predicting manifest variables by the extracted latent feature(s) is the same in FA as in PCA; [**Eq.3**]:

$V_1 = a_{1}F + E_1$

$V_2 = a_{2}F + E_2$

where F is the latent common **factor** extracted from the data and replacing what was P1 in **Eq.2**. The difference in the model is that in FA, unlike PCA, **error variables (E1 and E2) are required** to be uncorrelated with each other.

*Digression*. Here I want suddenly to interrupt the story and make a notion on what are coefficients $a$. In PCA, we said, these were entries of eigenvectors found within PCA (via eigen- or singular-value-decomposition). While latent P1 had its native variance. If we choose to standardize P1 to *unit variance* we'll have to compensate by appropriately scaling up coefficients $a$, in order to support the equation. That scaled up $a$s are called *loadings*; they are of interest numerically because they are the covariances (or correlations) between the latent and the observable variables and therefore can help interpret the latent feature. In both models - **Eq.2** and **Eq.3** - you are free to decide, without harming the equation, which way the terms are scaled. If F (or P1) is considered unit scaled, $a$ is loading; while if F (P1) has to have its native scale (variance), then $a$ should be de-scaled accordingly - in PCA that will equal eigenvector entries, but in FA they will be different and usually *not* called "eigenvectors". In most texts on factor analysis, F are assumed unit variance so $a$ are *loadings*. In PCA literature, P1 is typically discussed having its real variance and so $a$ are eigenvectors.

OK, back to the thread. E1 and E2 are uncorrelated in factor analysis; thus, they should form a cloud of errors either round or elliptic but not diagonally oriented. While in PCA their cloud formed straight line coinciding with diagonally going P2. Both ideas are demonstrated on the pic:

Note that errors are round (not diagonally elongated) cloud in FA. Factor (latent) in FA is oriented somewhat different, i.e. it is not right the first principal component which is the "latent" in PCA. On the pic, factor line is strangely conical a bit - it will become clear why in the end.

**What is the meaning of this difference between PCA and FA?** Variables correlated, which is seen in the diagonally elliptical shape of the data cloud. P1 skimmed the maximal variance, so the ellipse is co-directed to P1. Consequently P1 explained by itself the correlation; but it did not explain *the existing amount of correlation* adequately; it looked to explain *variation* in data points, not correlatedness. Actually, it over-accounted for the correlation, the result of which was the appearance of the diagonal, correlated cloud of errors which compensate for the over-account. P1 *alone* cannot explain the strength of correlation/covariation comprehensively. Factor F *can* do it alone; and the condition when it becomes able to do it is exactly where errors can be forced to be uncorrelated. Since the error cloud is round no correlatedness - positive or negative - has remained after the factor was extracted, hence it is the factor which skimmed it all.

As a dimensionality reduction, **PCA explains variance** but explains correlations imprecisely. **FA explains correlations** but cannot account (by the common factors) as much data variation as PCA can. Factor(s) in FA account for that portion of variability which is the net correlational portion, called *communality*; and therefore factors can be interpreted as real yet unobservable forces/features/traits which hide "in" or "behind" the input variables to bring them to correlate. Because they explain correlation well mathematically. Principal components (few first ones) explain it mathematically not as well and so can be called "latent trait" (or such) only at some stretch and tentatively.

Multiplication of *loadings* is what explains (restores) correlation, or correlatedness in the form of covariance - if the analysis was based on covariance matrix (as in out example) rather than correlation matrix. Factor analysis that I did with the data yielded `a_1=.87352, a_2=.84528`

, so product `a_1*a_2 = .73837`

is almost equal to the covariance `.73915`

. On the other hand, PCA loadings were `a1_1=.97497, a1_2=.89832`

, so `a1_1*a1_2 = .87584`

overestimates `.73915`

considerably.

Having explained the main theoretical distinction between PCA and FA, let's get back to our data to exemplify the idea.

### FA: approximate solution (factor scores)

Below is the scatterplot showing the results of the analysis that we'll provisionally call "sub-optimal factor analysis", **Fig.3**.

```
A technical detail (you may skip): PAF method used for factor extraction.
Factor scores computed by Regression method.
Variance of the factor scores on the plot was scaled to the true
factor variance (sum of squared loadings).
```

See departures from **Fig.2** of PCA. Beige cloud of the errors isn't round, it is diagonally elliptical, - yet it is evidently much fatter than the thin diagonal line having occured in PCA. Note also that the error connectors (shown for some points) are not parallel anymore (in PCA, they were by definition parallel to P2). Moreover, if you look, for example, at points "F" and "E" which lie mirror symmetrically over the factor's **F** axis, you'll find, unexpectedly, their corresponding factor scores to be quite different values. In other words, factor scores is not just linearly transformed principal component scores: factor F is found in its own way different from P1 way. And their axes do not fully coincide if shown together on the same plot **Fig.4**:

Apart from that they are a bit differently orienterd, F (as tiled with scores) is shorter, i.e. it accounts for smaller variance than P1 accounts for. As noted earlier, factor accounts only for variability which is responsible for correlatedness of V1 V2, i.e. the portion of total variance that is sufficient to bring the variables from primeval covariance `0`

to the factual covariance `.73915`

.

### FA: optimal solution (true factor)

An optimal factor solution is when errors are round or non-diagonal elliptic cloud: E1 and E2 are *fully uncorrelated*. Factor analysis actually **returns** such an optimal solution. I did not show it on a simple scatterplot like the ones above. Why did I? - for it would have been the most interesting thing, after all.

The reason is that it would be impossible to show on a scatterplot adequately enough, even adopting a 3D plot. It is quite an interesting point theoretically. In order to make E1 and E2 completely uncorrelated it appears that all these three variables, F, E1, E2 **have to lie not** in the space (plane) defined by V1, V2; and **the three must be uncorrelated with each other**. I believe it is possible to draw such a scatterplot in 5D (and maybe with some gimmick - in 4D), but we live in 3D world, alas. Factor F must be uncorrelated to both E1 and E2 (while they two are uncorrelated too) because F is supposed to be the **only (clean) and complete** source of correlatedness in the observed data. **Factor analysis splits total variance** of the `p`

input variables into two uncorrelated (nonoverlapping) parts: *communality* part (`m`

-dimensional, where `m`

common factors rule) and *uniqueness* part (`p`

-dimensional, where errors are, also called unique factors, mutually uncorrelated).

So pardon for not showing the true factor of our data on a scatterplot here. It could be visualized quite adequately via vectors in "subject space" as done here without showing data points.

Above, in the section "The idea of common FA (latent feature)" I displayed factor (axis F) as wedge in order to warn that true factor axis does *not* lie on the plane V1 V2. That means that - in contrast to principal component P1 - factor F as axis is not a rotation of axis V1 or V2 in their space, and F as variable is *not a linear combination* of variables V1 and V2. Therefore F is modeled (extracted from variables V1 v2) as if an outer, independent variable, not a derivation of them. Equations like **Eq.1** from where PCA begins, are inapplicable to compute *true* (optimal) factor in factor analysis, whereas formally isomorphic equations **Eq.2** and **Eq.3** are valid for both analyses. That is, in PCA variables generate components and components back predict variables; in FA **factor(s) generate/predict variables, and not back** - common factor model conceptually assumes *so*, even though technically factors are extracted from the observed variables.

Not only *true* factor is not a function of the manifest variables, *true* factor's *values* are not uniquely defined. In other words, they are simply unknown. That all is due to the fact that we're in the excessive 5D analytic space and not in our home 2D space of the data. Only good *approximations* (a number of methods exist) to true factor values, called *factor scores*, are there for us. Factor scores do lie in the plane V1 V2, like principal component scores are, they are computed as the linear functions of V1, V2, too, and it *were they* that I plotted in the section "FA: approximate solution (factor scores)". Principal component scores are true component values; factor scores are only reasonable approximation to the indetermined true factor values.

### FA: roundup of the procedure

To gather in one small clot what the two previous sections said, and add final strokes. Actually, FA can (*if* you do it right, and see also data assumptions) find the true factor solution (by "true" I mean here optimal for the data sample). However, various methods of extraction exist (they differ in some secondary constraints they put). The true factor solution **is up to loadings** $a$ only. Thus, loadings are of optimal, true factors. **Factor scores** - if you need them - are computable out of *those loadings* in various ways and return approximations to factor values.

Thus, "factor solution" displayed by me in section "FA: approximate solution (factor scores)" was based actually on optimal loadings, i.e. on true factors. But the scores were not optimal, by destiny. The scores are computed to be a linear function of the observed variables, like component scores are, so they both could be compared on a scatterplot and I did it in didactic pursuit to show like a gradual pass from the PCA idea towards FA idea.

One must be wary when plotting on the same *biplot* factor loadings with factor scores in the "space of factors", be conscious that loadings pertain to true factors while scores pertain to surrogate factors (see my comments to this answer in this thread).

Rotation of factors (loadings) help interpret the latent features. Rotation of loadings can be done also in PCA if you use PCA as if factor analysis (that is, see PCA as variable prediction). PCA tends to converge in results with FA as the number of variables grow (see the extremely rich thread on practical and conceptual similarities and differences between the two methods). See my list of differences between PCA and FA in the end of this answer. Step by step computations of PCA vs FA on *iris* dataset is found here. There is a considerable number of good links to other participants' answers on the topic outside this thread; I'm sorry I only used few of them in the current answer.

3

In addition to the answers below you might

– ttnphns – 2014-04-24T22:36:08.343also readthis and this of mine.2

The principal components analysis and factor analysis chapters in the following book, which is available in most college libraries, address your question exactly: http://www.apa.org/pubs/books/4316510.aspx

– user31256 – 2013-10-09T07:11:33.9832

And another good question like "should I use PCA or FA": http://stats.stackexchange.com/q/123063/3277.

– ttnphns – 2014-11-07T14:21:40.223Explanation of why factor scores are inexact while component scores are true: http://stats.stackexchange.com/q/127483/3277.

– ttnphns – 2015-01-04T11:13:39.7433@ttnphns: I would encourage you to issue an answer in this thread, perhaps consisting of an annotated list of your answers in other related threads. This could replace your comments above (currently four comments with links), and would be more practical, especially if you briefly annotated each link. E.g. look here for the explanation of this issue, look there for an explanation of that issue, etc. It is just a suggestion, but I believe this thread would greatly benefit from it! One particular advantage is that you can always add more links to that answer. – amoeba – 2015-02-05T15:25:56.167

Personally, I like the analogy of PCA =

– hplieninger – 2017-07-28T10:58:21.747formativeand FA =reflective, see https://stats.stackexchange.com/q/279062/27276. But probably not all share that view.