Principal Component Analysis

Matthieu Bloch

Principal Component Analysis

  • Feature extraction methods based: unsupervised, linear, based on sum of square errors

  • Idea is to find approximation of data as \[\bfx_i \approx \bfmu + \bfA\bftheta_i\textsf{ with }\bfmu\in\bbR^d,\bfA\in\bbR^{d\times k},\bftheta_i\in\bbR^{k} \] and \(\bfA\) has orthonormal columns

  • Principal Component Analysis consists in solving the problem \[\argmin_{\bfmu,\bfA,\bftheta_i}\sum_{i=1}^N\norm[2]{\bfx_i-\bfmu-\bfA\bftheta_i}^2\]
  • Hard part is finding \(\bfA\)

  • Given \(\bfA\), relatively easy to find \(\bftheta_i\) and \(\bfmu\)

Solving PCA

  • Assume that \(\bfmu\) and \(\bfA\) are fixed. Then, \[\bftheta_i=\bfA^{\intercal}(\bfx_i-\bfmu)\]

  • Assume \(\bfA\) is fixed and \(\bftheta_i = \bfA^\intercal(\bfx_i-\bfmu)\). Then, \[\bfmu=\frac{1}{N}\sum_{i=1}^N\bfx_i\]

  • One possible choice of \(\bfA\) is \[\bfA=[\bfu_1,\cdots,\bfu_k]\] where \(\bfu_i\)’s are the eigenvectors corresponding to the \(k\) largest eigenvalues of \(\bfS\eqdef\sum_{i=1}^N\bfx_i\bfx_i^\intercal\)

Solving PCA

  • One possible choice of \(\bfA\) is \[\bfA=[\bfu_1,\cdots,\bfu_k]\] where \(\bfu_i\)’s are the eigenvectors corresponding to the \(k\) largest eigenvalues of \(\bfS\eqdef\sum_{i=1}^N\bfx_i\bfx_i^\intercal\)

  • Proof steps
    • Step 1: introduce \(\bfS\eqdef\sum_{i=1}^N\bfx_i\bfx_i^\intercal=\bfX\bfX^\intercal\)
    • Step 2: introduce linear program
    • Step 3: solve linear program
  • Connection to SVD \(\bfX = \bfU\bfSigma \bfV^\dagger\) where columns of \(\bfX\) are \(\bfx_i-\bfmu\): \[ \bfX = \underbrace{\mat{ccc}{\vert & & \\ \bfA & & \\ \vert & &}}_{\bfU}\underbrace{\mat{c}{-\bfTheta-\\ \\ }}_{\bfSigma\bfV^\dagger} \]

Wrapping up PCA

  • Customary to center and scale a data set so that it has zero mean and unit variance along each feature

  • Typically select \(k\) such that residual error is small

  • PCA is a good idea when
    • the data forms a single point cloud in space
    • the data is approximately Gaussian, or some other elliptical distribution
    • low-rank subspaces capture the majority of the variation
  • Big picture: PCA generates a low-dimensional embedding of the data
    • Euclidean structure of data is approximately preserved
    • Distances between points are roughly the same before and after PCA
    • What about other distances?