Tuesday, February 5, 2008

An Investigation of Computational and Informational Limits in Gaussian Mixture Clustering


I thought it was about the GMM ability in density estimation. Somehow it wasn't. I am not familiar with the literature on GMM clustering ability though I know it is trained with EM algorithms which is usu. the typical example for general EM algorithm.

They put the whole in under very restricted conditions where theoretical analysis is feasible. They assume each Gaussian has equal probability and the Gaussians have a unified scalar diagonal co-variance matrix. They list several important results on the necessary sample size when the training of GMM will lead to correct clustering results. There are several factors, the sample size N, the separation s, the number of clusters k and the dimension d. Here they further assume the covirance matrix is the identity matrix and hence only s is to be considered instead of s/σ.

They mention several tips for training with EM:
  • PCA will help cluster with GMM. The algorithm is to project data into a (k-1)-D subspace where EM is applied to train a model that tells for each sample its probability of each cluster, which is to substitute the corresponding values in the EM training in the original space. This kind of cascaded EM will improve clustering results.
  • They use a prune strategy, which sets a possible larger k', which is O(k log k). Then by merging near clusters, we get the final clustering result.

Their main result is they set up a procedure that illustrates three phases of the training of the GMM w.r.t. sample size N:
  • Given too limited data, the MLE won't yield a good clustering (a wrong one). The EM will not find the MLE and results in a model that is unstable. This is Random Phase.
  • Given more data, although EM still gives no MLE, MLE does correspond to a good clustering and hence is what we desire. This phase is Gap Phase.
  • Given enough data, EM will get MLE which is the one we desire. This is Success Phase.

They develop an approximation of the sample size where the phase transitions take place. With these they could suggest when the gap exists.

The difficulty is how we can judge the gap phase for a real data set. And we have to extend their results to more general cases. I doubt most of the time, we are dealing with random phase and gap phase :-(

No comments: