Sunday, May 17, 2009

K-means Clustering via Principal Component Analysis


Th most important result of this paper is that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering and the subspace spanned by the cluster centroids are given by the spectral expansion of the data covariance matrix truncated at K-1 terms.

I'd like to say their discovery is very similar to that of normalized cut and the spectral expansion of Laplacian matrix. Let
J_K = \sum_{k = 1}^K \sum_{i \in C_k} \|x_i - m_k \|^2
be the objective of K-means clustering,
d(C_k, C_l) = \sum_{i \in C_k} \sum_{j \in C_l} \| x_i - x_j \|^2
be the distance between two clusters. It can be shown that
J_K = N \bar{y}^2 - \frac{1}{2} J_D,
where in 2-way clustering case
J_D = \frac{N_1 N_2}{N} \left( 2 \frac{d(C_1, C_2)}{N_1 N_2} - \frac{d(C_1, C_1)}{N_1^2} - \frac{d(C_2, C_2)}{N_2^2}\right).
Further we have
\frac{d(C_1, C_2)}{N_1 N_2} = \frac{d(C_1, C_1)}{N_1^2} + \frac{d(C_2, C_2)}{N_2^2} + (m_1 - m_2)^2.
The important theorem they proved says in 2-way clustering, the continuous solution is given by the leading principal components, those positive in one cluster and negative in the other. The optimal K-means solution in this case can be bounded with the leading spectrum,
n \bar{y}^2 - \lamnda_1 \leq J_{2} \leq n \bar{y}^2.
If we create a fully connected graph with each weight \frac{1}{N}, the linearization of the N-cut algorithm is PCA and the N-cut is simply K-means, is it right? The difference is PCA is done in the feature space and spectral clustering employs the embedding space Laplacian eigenmaps learns.

For K-way clustering, we have similar results. But unlike 2-way clustering, we can not interpret the PC in a similar way. Maybe the matting paper has better interpretation. Their second result is about the centroids.

This result has several implications. Spectral clustering is done in the embedding space learned by Laplacian eigenmap; PCA/K-means is done in the feature space; KPCA/Kernel K-means is done in the kernel-induced space. So if we considering the unsupervised KDR idea, can we find a similar interpretation such as clustering?

No comments: