The math behind PCA


I am trying to understand the math behind PCA. I can only solve it in the case of mapping vectors to 1 Dimensional space. How to solve the math in the case we reduce the number of dimension is greater than 1?

$$\max \mathrm{Tr}(\mathbf{w}^T\mathbf{X}\mathbf{X}^T\mathbf{w})$$ $$\text{s.t. } \mathbf{w}^T\mathbf{w} = 1$$

Zac Jonathan

Posted 2020-09-01T08:29:41.013

Reputation: 19

To me, this question "How to solve the math in the case we reduce the number of dimension is greater than 1?" is unclear. Can you clarify that? – nbro – 2020-09-01T11:16:46.693


– Rodrigo de Azevedo – 2020-09-01T22:51:33.927



You might want to have a look at the wikipedia article of PCA, where it says:

"The $k$th component can be found by subtracting the first $k − 1$ principal components from $\mathbf{X}$:"

$$\hat{\mathbf{X}}_k = \mathbf{X} - \sum_{s=1}^{k-1}\mathbf{X}\mathbf{w}_s\mathbf{w}_s^T$$

Then you repeat the process to find the next component:

$$\mathbf{w}_k = \arg\max \mathbf{w}^T\mathbf{\hat{X}}^T_k\mathbf{\hat{X}}_k\mathbf{w}$$ $$\text{s.t. } \mathbf{w}_k^T\mathbf{w}_k = 1$$


Posted 2020-09-01T08:29:41.013

Reputation: 413


You can also understand the logic from the view of constrained optimisation. Introduce a Lagrange function: $$ \mathcal{L} = \text{Tr} (w^{T} X X^{T} w) - \lambda w^{T} w $$ And take the derivative with respect to $w$: $$ \frac{\partial \mathcal{L}}{\partial w} = 2 (X X^{T} - \lambda) w $$ For the general case of dimension $\geqslant 1$ $w$ is a set of vectors $w = (w_1 w_2 \ldots w_n)$. This expression vanishes, if for some index $i$ $w_i$ is an of eigenvector of $XX^{T}$ with the eigenvalue $\lambda_i$, and all other components are set to zero. In other words, stationary points are the eigenvectors of $X X^{T}$.

The condititon $w^T w = 1$ imposes the orthogonality condition on the eigenvectors. In fact, going back to the initial functional, one sees, that $w_i X X^{T} w_j = \lambda_j w_i^{T} w_j = 0$ for $i \neq j$. Therefore, we have finally: $$ \mathcal{L} =\sum \lambda_i - \lambda $$ Which is maximized for any $k \geq 1$, by taking $k$ largest eigenvalues.


Posted 2020-09-01T08:29:41.013

Reputation: 156

1.) Is the duality gap zero for such functions? 2.) The $L$ is minimised for k largest Eigen values. And the major concern I haven't seen this type of formulation without building the dual problem, can you link a resource? (Not doubting, but I am interested) – DuttaA – 2020-09-01T23:16:48.040