Clustering

Matthieu Bloch

Clustering

  • Clustering problem: given samples \(\{\bfx_i\}_{i=1}^N\) assign points to disjoint subsets called clusters, so that points in the same cluster are more similar to each other than points in different clusters

  • Clustering is map \(C:[1:N]\to[1;K]\) with \(K\) the number of clusters; how do we choose \(C\)

  • \[W(C) = \frac{1}{2}\sum_{k=1}^K \sum_{i:C(i)=k}\sum_{j:C(j)=k}\norm[2]{\bfx_i-\bfx_{j}}^2\]

  • \(K\)-means clustering: find \(C\) minimizing \(W(C)\)

  • The number of possible clusters is given by the Stirling’s numbers of the second kind \[S(N,K)=\frac{1}{K!}\sum_{k=1}^K(-1)^{K-k}{K \choose k}k^N\]

  • No known efficient search strategy for this space
    • exact solution by complete enumeration has intractable complexity

Sub-optimal K-means clustering

  • We want to find \(C^*=\min_C W(C)\)

  • \[C^* = \argmin_{C}\sum_{k=1}^KN_k\sum_{C(i)=k}\norm[2]{\bfx_i-\bfmu_k}^2\] where \(\bfmu_k=\frac{1}{N_k}\sum_{i:C(i)=k}\bfx_i\) and \(N_k=\card{\{i:C(i)=k\}}\)

  • For a fixed clustering \(C\) we have \[\bfmu_k=\argmin_{\bfm}\sum_{i:C(i)=k}\norm[2]{\bfx_i-\bfm}^2\]
  • Solve enlarged optimization problem \[\argmin_{C,\{\bfm_k\}}\sum_{k=1}^KN_k\sum_{C(i)=k}\norm[2]{\bfx_i-\bfm_k}^2\]

Alternating optimization procedure

  • To find \(K\) means solution
    1. Given \(C\), choose \(\{\bfm_k\}\) to minimize \[\sum_{k=1}^KN_k\sum_{i:C(i)=k}\norm[2]{\bfx_i-\bfm_k}^2\]
    2. Given \(\{\bfm_k\}\) choose \(C\) to minimize \[\sum_{k=1}^KN_k\sum_{i:C(i)=k}\norm[2]{\bfx_i-\bfm_k}^2\]
  • The solution to subproblem 1 is \[\bfm_k^* = \frac{1}{N_k}\sum_{i:C(i)=k}\bfx_i\]
  • The solution to subproblem 2 is \[C(i)=\argmin \norm[2]{\bfx_i-\bfm_k}^2\]

K means remarks

  • Algorithmic notes
    • Algorithm typically initialized with \(\bfm_k\) as random point in dataset
    • Several random initializations to avoid local minima
  • Clusters boundaries are parts of hyperplanes
    • Regions are intersections of halfplanes hence convex
    • \(K\) means fails if clusters are non convex
    • Geometry changes if we change the \(\ell_2\) norm

Gaussian mixture models

  • Extend idea behind \(K\)-means clustering to allow for more general cluster shapes
    • clusters are elliptical
    • cluster can be modeled using a multivariate Gaussian density \[\phi(\bfx;\bfmu,\bfSigma) = \frac{1}{(2\pi)^{\frac{d}{2}}\det{\bfSigma}}\exp\left(-\frac{1}{2}(\bfx-\bfmu)^\intercal\bfSigma^{-1}(\bfx-\bfmu)\right)\] with \(\bfx\in\bbR^d\), \(\bfmu\in\bbR^d\), \(\bfSigma\succcurlyeq 0\in\bbR^{d\times d}\)
    • full data set is modeled using a Gaussian mixture model (GMM) \[f(\bfx) = \sum_{k=1}^K \alpha_k\phi(\bfx;\bfmu_k,\bfSigma_k)\] where \(\alpha_k\in[0,1]\), \(\sum_k \alpha_k=1\), \(\bfmu_k\in\bbR^d\), \(\bfSigma_k\succcurlyeq 0\in\bbR^{d\times d}\)
    • cluster estimation by performing MLE on GMM
  • Interpretation of GMM
    • State variable \(S\in\intseq{1}{K}\) such that \(\P{S=k} = \alpha_k\)
    • every realization with a hidden realization of the state variable
    • challenge is to perform clustering without observing hidden states

Maximum likelihood estimation

  • MLE of single multivariate Gaussian

  • MLE of mixture of multivariate Gaussian with incomplete data

  • MLE of mixture of multivariate Gaussian with complete data

Expectation-Maximization (EM) algorithm

  • Efficient algorithm to address incomplete data

  • Key idea is to work with MLE for complete data and average out unobserved hidden state

  • EM Algorithm
    1. Initialize \(\bftheta^{(0)}\)
    2. For \(j\geq 1\)
      • E step: evaluate \(Q(\bftheta^{(j)},\bftheta)\) where \[Q(\bftheta^{(j)},\bftheta) = \E[\{s_i\}|\{\bfx_i\},\bftheta^{(j)}]{\ell(\bftheta, \{\bfx_i\},\{s_i\})}\]
      • Maximization step: set \(\bftheta^{(j+1)} = \argmax_{\bftheta}Q(\bftheta^{(j)},\bftheta)\)
  • The algorithm gets monotonically better