Thursday, March 26, 2009

Kernel Independent Component Analysis


ICA has long been studied by researchers from signal processing, machine learning and alike. However, though the independency can be depicted with mutual information mathematically, it is not computable in practice. Many ideas have been proposed: approximation of mutual information (e.g. using moments) or finding an alternative contrast function (kurtosis).

In recent years, I noticed Bach does have published lots of papers in machine learning and I found him another student of Jordan. What a supervisor! After reading this piece of work, I have no doubt about their novel ideas of approaching ICA. This relates to a later work closely. The idea of ICA: independent components, must be formulated with a certain idea and the independency calculated in this way will surely be applicable to other problems as well, e.g. one concept of dimensionality reduction for regression (supervised dimensionality reduction) is sufficiency: finding a subspace such that the response y is independent to x conditioned on the projection of x onto it. Therefore it is the key idea to understand the following papers to be scanned.

There are several steps in tackling the independency:
  • Correlation-to-independency: if x and z are independent, then f(x) and g(z) will be independent and therefore will be uncorrelated (correlation coefficient is 0). Though given a fixed (f, g), we cannot assert that x and z are independent given f(x) and g(z) are uncorrelated, it holds that x and z are independent if for arbitrary (f, g), f(x) and g(z) are uncorrelated (then we can put Fourier transform into it).
  • Kernel provides us with RKHS, which means we can use many many functions for the r.v. to be tested. But empirically with given samples, for ICA we have to minimize the maximum correlation of x and z with all pairs of (f, g).
  • The maximum correlation can be obtained with standard CCA idea or it's kernelized version (a generalized eigenvalue problem). This can be extended to multiple r.v.s (variables or vectors). It is argued that the smallest eigenvalues (for (C, D)-generalized eigenvalue problem, where C is the co-variance matrix, KiKj in kernelized form and D is a regularized diagonal variance matrix (Ki+sI)2) are more suitable. (KCCA as a contrastive function.)
  • The last idea is that when the r.v.s are Gaussian, the mutual information is -log( |C|/|D| )/2. So we just need to minimize this. This is another contrastive function (kernel generalized variance, KGV).
So we must get these ideas into a working algorithm. There are several important tips,
  • As in ICA, we first whiten the data Y such that we have 0 mean and unitary covariance. Then we just need to find an orthogonal matrix instead of an affine transform.
  • The matrix (C, D) which, converted into a eigenvalue problem, is too difficult to compute. Therefore we need to approximate this matrix with incomplete Cholesky decompostion (another candidate is Nystrom approximation).
  • Please notice the optimization is constrained on the orthogonal matrix, which is on a Stielfel manifold. We must use the gradient on this manifold (so as not to break the constraints). The overall optimization is simply steepest gradient decent.
  • The author then proposed two possible initial values for gradient decent. One point is when we use a kernel which implies complicate features, the optimization might be tough while a simple kernel such as polynomial kernels might be easier for the optimization.
This journal paper contains much more information about everything :-D I will return to it when considering implementation.

No comments: