Clustering

Matthieu Bloch

Clustering problem: given samples \(\{\bfx_i\}_{i=1}^N\) assign points to disjoint subsets called clusters, so that points in the same cluster are more similar to each other than points in different clusters
Clustering is map \(C:[1:N]\to[1;K]\) with \(K\) the number of clusters; how do we choose \(C\)
\[W(C) = \frac{1}{2}\sum_{k=1}^K \sum_{i:C(i)=k}\sum_{j:C(j)=k}\norm[2]{\bfx_i-\bfx_{j}}^2\]
\(K\)-means clustering: find \(C\) minimizing \(W(C)\)
The number of possible clusters is given by the Stirling’s numbers of the second kind \[S(N,K)=\frac{1}{K!}\sum_{k=1}^K(-1)^{K-k}{K \choose k}k^N\]
No known efficient search strategy for this space
- exact solution by complete enumeration has intractable complexity

Sub-optimal K-means clustering

We want to find \(C^*=\min_C W(C)\)
\[C^* = \argmin_{C}\sum_{k=1}^KN_k\sum_{C(i)=k}\norm[2]{\bfx_i-\bfmu_k}^2\] where \(\bfmu_k=\frac{1}{N_k}\sum_{i:C(i)=k}\bfx_i\) and \(N_k=\card{\{i:C(i)=k\}}\)
For a fixed clustering \(C\) we have \[\bfmu_k=\argmin_{\bfm}\sum_{i:C(i)=k}\norm[2]{\bfx_i-\bfm}^2\]
Solve enlarged optimization problem \[\argmin_{C,\{\bfm_k\}}\sum_{k=1}^KN_k\sum_{C(i)=k}\norm[2]{\bfx_i-\bfm_k}^2\]

To find \(K\) means solution
1. Given \(C\), choose \(\{\bfm_k\}\) to minimize \[\sum_{k=1}^KN_k\sum_{i:C(i)=k}\norm[2]{\bfx_i-\bfm_k}^2\]
2. Given \(\{\bfm_k\}\) choose \(C\) to minimize \[\sum_{k=1}^KN_k\sum_{i:C(i)=k}\norm[2]{\bfx_i-\bfm_k}^2\]
The solution to subproblem 1 is \[\bfm_k^* = \frac{1}{N_k}\sum_{i:C(i)=k}\bfx_i\]
The solution to subproblem 2 is \[C(i)=\argmin \norm[2]{\bfx_i-\bfm_k}^2\]

Algorithmic notes
- Algorithm typically initialized with \(\bfm_k\) as random point in dataset
- Several random initializations to avoid local minima
Clusters boundaries are parts of hyperplanes
- Regions are intersections of halfplanes hence convex
- \(K\) means fails if clusters are non convex
- Geometry changes if we change the \(\ell_2\) norm

Extend idea behind \(K\)-means clustering to allow for more general cluster shapes
- clusters are elliptical
- cluster can be modeled using a multivariate Gaussian density \[\phi(\bfx;\bfmu,\bfSigma) = \frac{1}{(2\pi)^{\frac{d}{2}}\det{\bfSigma}}\exp\left(-\frac{1}{2}(\bfx-\bfmu)^\intercal\bfSigma^{-1}(\bfx-\bfmu)\right)\] with \(\bfx\in\bbR^d\), \(\bfmu\in\bbR^d\), \(\bfSigma\succcurlyeq 0\in\bbR^{d\times d}\)
- full data set is modeled using a Gaussian mixture model (GMM) \[f(\bfx) = \sum_{k=1}^K \alpha_k\phi(\bfx;\bfmu_k,\bfSigma_k)\] where \(\alpha_k\in[0,1]\), \(\sum_k \alpha_k=1\), \(\bfmu_k\in\bbR^d\), \(\bfSigma_k\succcurlyeq 0\in\bbR^{d\times d}\)
- cluster estimation by performing MLE on GMM
Interpretation of GMM
- State variable \(S\in\intseq{1}{K}\) such that \(\P{S=k} = \alpha_k\)
- every realization with a hidden realization of the state variable
- challenge is to perform clustering without observing hidden states

Efficient algorithm to address incomplete data
Key idea is to work with MLE for complete data and average out unobserved hidden state
EM Algorithm
1. Initialize \(\bftheta^{(0)}\)
2. For \(j\geq 1\)
  - E step: evaluate \(Q(\bftheta^{(j)},\bftheta)\) where \[Q(\bftheta^{(j)},\bftheta) = \E[\{s_i\}|\{\bfx_i\},\bftheta^{(j)}]{\ell(\bftheta, \{\bfx_i\},\{s_i\})}\]
  - Maximization step: set \(\bftheta^{(j+1)} = \argmax_{\bftheta}Q(\bftheta^{(j)},\bftheta)\)
The algorithm gets monotonically better