Principal Component Analysis

Matthieu Bloch

Feature extraction methods based: unsupervised, linear, based on sum of square errors
Idea is to find approximation of data as \[\bfx_i \approx \bfmu + \bfA\bftheta_i\textsf{ with }\bfmu\in\bbR^d,\bfA\in\bbR^{d\times k},\bftheta_i\in\bbR^{k} \] and \(\bfA\) has orthonormal columns
Principal Component Analysis consists in solving the problem \[\argmin_{\bfmu,\bfA,\bftheta_i}\sum_{i=1}^N\norm[2]{\bfx_i-\bfmu-\bfA\bftheta_i}^2\]
Hard part is finding \(\bfA\)
Given \(\bfA\), relatively easy to find \(\bftheta_i\) and \(\bfmu\)

Solving PCA

Assume that \(\bfmu\) and \(\bfA\) are fixed. Then, \[\bftheta_i=\bfA^{\intercal}(\bfx_i-\bfmu)\]
Assume \(\bfA\) is fixed and \(\bftheta_i = \bfA^\intercal(\bfx_i-\bfmu)\). Then, \[\bfmu=\frac{1}{N}\sum_{i=1}^N\bfx_i\]
One possible choice of \(\bfA\) is \[\bfA=[\bfu_1,\cdots,\bfu_k]\] where \(\bfu_i\)’s are the eigenvectors corresponding to the \(k\) largest eigenvalues of \(\bfS\eqdef\sum_{i=1}^N\bfx_i\bfx_i^\intercal\)

One possible choice of \(\bfA\) is \[\bfA=[\bfu_1,\cdots,\bfu_k]\] where \(\bfu_i\)’s are the eigenvectors corresponding to the \(k\) largest eigenvalues of \(\bfS\eqdef\sum_{i=1}^N\bfx_i\bfx_i^\intercal\)
Proof steps
- Step 1: introduce \(\bfS\eqdef\sum_{i=1}^N\bfx_i\bfx_i^\intercal=\bfX\bfX^\intercal\)
- Step 2: introduce linear program
- Step 3: solve linear program
Connection to SVD \(\bfX = \bfU\bfSigma \bfV^\dagger\) where columns of \(\bfX\) are \(\bfx_i-\bfmu\): \[ \bfX = \underbrace{\mat{ccc}{\vert & & \\ \bfA & & \\ \vert & &}}_{\bfU}\underbrace{\mat{c}{-\bfTheta-\\ \\ }}_{\bfSigma\bfV^\dagger} \]

Customary to center and scale a data set so that it has zero mean and unit variance along each feature
Typically select \(k\) such that residual error is small
PCA is a good idea when
- the data forms a single point cloud in space
- the data is approximately Gaussian, or some other elliptical distribution
- low-rank subspaces capture the majority of the variation
Big picture: PCA generates a low-dimensional embedding of the data
- Euclidean structure of data is approximately preserved
- Distances between points are roughly the same before and after PCA
- What about other distances?