Multi-Dimensional Scaling

Matthieu Bloch

Multidimensional scaling

There are situations for which Euclidean distance is not appropriate
Suppose we have access to a dissimilarity matrix \(\bfD\eqdef[d_{ij}]\in\bbR^{N\times N}\) and some distance function \(\rho\)
A dissimilarity matrix \(\bfD\eqdef[d_{ij}]\in\bbR^{N\times N}\) satisfies \[\forall i,j\qquad d_{ij}\geq0\quad d_{ij}=d_{ji}\quad d_{ii}=0\]
Triangle inequality not required
Multidimensional scaling (MDS)
- Find dimension \(k\leq d\) and \(\{\bfx_i\}_{i=1}^N\in\bbR^k\) such that \(\rho(\bfx_i,\bfx_j)\approx d_{ij}\)
- In general, perfect embedding into the desired dimension will not exist
- Many variants of MDS based on choice of \(\rho\), whether \(\bfD\) is completely known, and \(\approx\)
Two types of MDS
- Metric MDS: try to ensure that \(\rho(\bfx_i,\bfx_j)\approx d_{ij}\)
- Non Metric MDS: try to ensure that \(d_{ij}\leq d_{\ell m}\Rightarrow \rho(\bfx_i,\bfx_j)\leq \rho(\bfx_\ell,\bfx_m)\)

Assume \(\bfD\) is completely known (no missing entry) and \(\rho(\bfx,\bfy)\eqdef \norm[2]{\bfx-\bfy}\)
Algorithm to create embedding \(\bfX\in\bbR^{k\times N}\)
- Form \(\bfB=-\frac{1}{2}\bfH\bfD^2\bfH\) where \(\bfH\eqdef\bfI - \frac{1}{N}\bfone\bfone^\intercal\)
- Compute eigen decomposition \(\bfB = \bfV\bfLambda \bfV^\intercal\)
- Return \(\bfX = (\bfV_k\bfLambda_k^{\frac{1}{2}})^\intercal\), where \(\bfV_k\) consists of first \(k\) columns of \(\bfV\), \(\bfLambda_k\) consists of first \(k\) columns of \(\bfLambda\)
Where is this coming from?
The above algorithm returns the best rank \(k\) approximation in the sense that it minimizes \(\norm[2]{\bfX^\intercal\bfX-\bfB}\) and \(\norm[F]{\bfX^\intercal\bfX-\bfB}\)

Suppose we have \(\{\bfx_i\}_{i=1}^N\in\bbR^d\) and \(\bfD\) such that \(d_{ij}\eqdef\norm[2]{\bfx_i-\bfx_j}\)
Set \(\bfX = \mat{ccc}{|&&|\\\bfx_1&\cdots&\bfx_N\\|&&|}\in\bbR^{d\times N}\)
PCA computes an eigendecomposition of \(\bfS=\sum_{i=1}^N\bfx_i\bfx_i^\intercal=\bfX\bfX^\intercal\in\bbR^{d\times d}\)
- Equivalent to computing the SVD of \(\bfX=\bfU\bfSigma\bfV^\intercal\)
- New representation computed as \(\bftheta_i= \bfU_k^\intercal\bfx_i\), where \(\bfU_k\in\bbR^{d\times k}\) \[\bfX = \underbrace{\mat{ccc}{\vert&&\\ \bfA&& \\\vert&&}}_{\bfU}\underbrace{\mat{c}{-\bfTheta-\\ \\ }}_{\bfSigma\bfV^\intercal}\]
MDS computes an eigendecomposition of \(\bfX^\intercal\bfX \in\bbR^{N\times N}\)
- \(\bfX^\intercal\bfX= \bfV\bfSigma^2 \bfV^\intercal\)
- Return \(\bfTheta = (\bfV_k\bfSigma_k)^\intercal\)

Subtle difference between PCS and MDS
- PCA gives us access to \(\bfA\) and \(\bfmu\): we can extract features and reconstruct approximations
- Need \(\bfX\) to recover \(\bfA\)
- How can we extract features in MDS and compute \(\bfA(\bfx-\bfmu)\)?
- Important to add new point
Assume we have access to \(\bfD\) and want to add a new point \(\bfx\) to our embedding. Define \[d_{\bfmu}\eqdef\frac{1}{N}\bfD^2\bfone\qquad d_{\bfx}\eqdef\mat{c}{\norm[2]{\bfx-\bfx_1}^2\\\vdots\\\norm[2]{\bfx-\bfx_N}^2}\] Then \(\bfA^\intercal(\bfx-\bfmu)=(\bfTheta^\dagger)^\intercal(\frac{1}{2}(d_{\bfx}-d_\bfmu))\), where \(\bfTheta^\dagger\) consists of first \(k\) columns of \(\bfV\bfSigma^{-1}\) from the SVD \(\bfX\bfH=\bfU\bfSigma\bfV^\intercal\)

Classical MDS minimizes the loss function \(\norm[F]{\bfX^\intercal\bfX-B}\)
- Many other choices exist
Common choice is stress function \(\sum_{i,j}w_{i,j}(d_{ij}-\norm[2]{\bfx_i-\bfx_j})^2\)
- \(w_{i,j}\) are fixed weights
- \(w_{i,j}\in\{0,1\}\) handles missing data, \(w_{i,j}=\frac{1}{d_{ij}^2}\) penalizes error on nearby points
Nonlinear embeddings
- High-dimensional data sets can have nonlinear structure that not captured via linear methods
- Kernelize PCA and MDS with non linear \(\Phi:\bbR^d\to\bbR^k\)
- Use PCA on \(\Phi(\bfX)\Phi(\bfX)^\intercal\) or MDS on \(\Phi(\bfX)^\intercal\Phi(\bfX)\)

Dataset \(\calD=\{\bfx_i\}_{i=1}^N\in\bbR^d\), kernel \(k\), dimension \(r\)
Kernel PCA
1. Form \(\tilde{\bfK} = \bfH\bfK\bfH\) where \(\bfK\) is kernel matrix and \(\bfH\) is centering matrix
2. Compute eigen-decomposition \(\tilde{\bfK} = \bfV\bfLambda\bfV^\intercal\)
3. Set \(\bfTheta\) to \(r\) first rows of \((\bfV\bfLambda^\frac{1}{2})^\intercal\)
Projection of transformed data computed with \(f:\bbR^d\to\bbR^r\) computed as \[f(\bfx) = (\bfTheta^\dagger)^T\Phi(\bfX)^\intercal(\Phi(\bfx)-\Phi(\bfmu)) = (\bfTheta^\dagger)^T(k(\bfx)-\frac{1}{N}\bfK\bfone)\] with \(k(\bfx)\eqdef \mat{ccc}{k(\bfx,\bfx_1)&\cdots & k(\bfx,\bfx_N)}^\intercal\)
No computation in large dimensional Hilbert space!

Can be viewed as an extension of MDS
Assumes that the data lies in low-dimensional manifold (looks Euclidean in small neighborhoods)
Given dataset \(\calD=\{\bfx_i\}_{i=1}^N\in\bbR^d\), try to compute estimate of the geodesic distance along manifold

Compute shortest path using a proximity graph
Form a matrix \(\bfW\) as follows
1. For every \(\bfx_i\) define local neighborhood \(\calN_i\) (e.g., \(k\) nearest neighbors, all \(\bfx_j\) s.t. \(\norm[2]{\bfx_j-\bfx_i}\leq \epsilon\))
2. For each \(\bfx_i\), set \(w_{ij}=\norm[2]{\bfx_i-\bfx_j}\)
\(\bfW\) is a weighted adjacency matrix of the graph
Compute \(\bfD\) by setting \(d_{ij}\) to length of shortest path from node \(i\) to node \(j\) in graph described by \(\bfW\)
Can compute embedding similarly to MDS
Challenge: isomap can become inaccurate for points far apart

Idea: a data manifold that is globally nonlinear still appears linear in local pieces
- Don’t try to explicitly model global geodesic distances
- Try to preserve structure in data by patching together local pieces of the manifold
LLE algorithm for dataset \(\calD=\{\bfx_i\}_{i=1}^N\in\bbR^d\)
1. For each \(\bfx_i\), define local neighborhood \(\calN_i\)
2. Solve \[\min_\bfW \sum_{i=1}^N\norm[2]{\bfx_i-\sum_{j=1}^N w_{ij}\bfx_j}^2\text{ s.t. }\sum_{j}w_{ij}=1\text{ and }w_{ij}=0\text{ for }j\notin\calN_i\]
3. Fix \(\bfW\) and solve \[\min_{\{\bftheta_i\}}\sum_{i=1}^N\norm[2]{\bftheta_i-\sum_{j=1}^Nw_{ij}\bftheta_j}^2\]

Eigenvalue problem in compact form: \[\min_{\bfTheta}\norm[F]{\bfTheta-\bfTheta\bfW}^2=\min_{\bfTheta}\trace{\bfTheta(\bfI-\bfW)^\intercal(\bfI-\bfW)\bfTheta^\intercal}\]
- Same problem encountered in PCA!
- Use eigendecomposition of \((\bfI-\bfW)^\intercal(\bfI-\bfW)\)
Can compute embedding of new points as \[\theta = \sum_{i=1}^N w_{\bfx,\bfx_i}\bftheta_i\] with \(w_{\bfx,\bfx_i}\) computed from same constrained least squares problem

Demo notebook