Showing posts with label independent component analysis. Show all posts
Showing posts with label independent component analysis. Show all posts

Monday, May 11, 2009

Measuring Statistical Dependence with Hilbert-Schmidt Norms


This paper uses the so-called Hilbert-Schmidt norms instead of L^2 norm (corresponding to the COCO independence test) in the previously scanned paper, resulting in the so-called HSIC (Hilbert-Schmidt independence criterion), which is in practice much easier to calculate. Compared with COCO, which using the square root of the largest singular value of \tilde{K}^{(x)} \tilde{K}^{(y)}, HSIC actually uses the whole spectrum, i.e. the trace of \tilde{K}^{(x)} \tilde{K}^{(y)}. Since the Frobenius norm is an upper bound for the L^2 norm, therefore the similar independence equivalence can be obtained.

Please notice the relationship of KCC, COCO and HSIC. For KCC, we use the maximum kernel correlation as the criterion for independence test, but then we have to use the regularized version to avoid the common case when it does not give desired result. In COCO, we constrain both covariances to identity and then the the problem becomes seeking the largest singular value of the product of two centered Gram matrices while KCC solves the generalized eigenvalue problem. Therefore when we deal with multiple r.v.s, we can do it in the same ways as CCA does. So in a way you know why it is called COCO. These three ideas are closely related, just as in the KMI paper, KMI and KGV are closely related.

Sunday, May 10, 2009

Kernel Constrained Covariance for Dependence Measurement


For measuring independence, they propose another kernel-based criterion COCO (constrained covariance, in a previous paper, it is called kernel covariance, KC, however their objective actually comes from the multual information of Gaussian r.v.s) other than the previously scanned paper of Jordan's (KCC and KGV). This has been mentioned in their earlier work.

This work has quite different concerns. Since previous papers did not say how to choose the kernels. They prove COCO is only zero at independence for universal kernels. The so called universal kernel is a kernel such that on a compact metric space (\mathcal{X}, d), the RKHS induced by the kernel k(\cdot, \cdot) is dense in the continous function space over \mathcal{X}, namely C(\mathcal{X}) w.r.t. the topology induced by the infinity norm \| f - g \|_\infty. They proved Gaussian kernels and Laplacian kernels are universal.

They also point out the limitation of independence tests based on universal kernels. The proposed COCO also has its adversary. That is when we do have small COCO, the r.v.s are still dependent. But they found their calculation of COCO has an exponential convergence rate to the true COCO value.

The calculation of COCO is equivalent to
\mathrm{COCO}(z, F, G) = \frac{1}{n}\sqrt{\| \tilde{K}^f \tilde{K}^g \|_2}.
Their later paper actually uses Frobenius norm. Their experiment shows an example of applying COCO to ICA on fMRI data, compared with KMI and correlation method (only using correlation as a testing of independence).

The Kernel Mutual Information


We know mutual information can be used to test the independence of r.v.s. The difficulty with mutual information is that the distribution is usually unknown, either we have to make a density estimation or entropy estimation from samples.

This paper introduces the so-called KMI (kernel mutual information), which in practice has comparable performance with KGV (in Bach and Jordan's paper). It comes from measuring the mutual information between two multi-variate Gaussian vector,
I(x; y) = -\frac{1}{2} \log \left( \prod_{i = 1}^{\min(p_x, p_y)} (1 - \rho_i^2)\right).
We use the correlation in the RKHS for this \rho_i,
\rho_i = \frac{c_i^\top( P_{x, y} - p_x p_y) d_i}{\sqrt{c_i^\top D_x c_i d_i^\top D_y d_i}},
where P_{x, y}, p_x, p_y, D_x, D_y are approximated with samples and an assumed grid(it will be cancelled out in the end). By relaxing the denominator (to a bigger value), we find an upper bound for the mutual information,
M(z) = -\frac{1}{2} \log\left( \big| I - (\nu_x \nu_y)\tilde{K}^{(x)} \tilde{K}^{(y)} \big|\right),
where \tilde{K}^{(x)} and \tilde{K}^{(y)} are centered Gram matrices. This is the criterion for KMI. Since it's an upper bound for the mutual information, we can use it as a contast function (I don't know why the authors think we ca derive the result from the theorem 1, confused about their idea).

In their formulation, (regularized) KGV can be regarded as another way of relaxing the covariance. Therefore likewise this independence measurement can also be used for ICA task. In a way, this paper is a generalization of KGV, but I am still confused about their idea.

==
later I realized that the KMI is the upper bound of KC, therefore when KMI = 0 the KC will be 0 too.

Thursday, March 26, 2009

Kernel Independent Component Analysis


ICA has long been studied by researchers from signal processing, machine learning and alike. However, though the independency can be depicted with mutual information mathematically, it is not computable in practice. Many ideas have been proposed: approximation of mutual information (e.g. using moments) or finding an alternative contrast function (kurtosis).

In recent years, I noticed Bach does have published lots of papers in machine learning and I found him another student of Jordan. What a supervisor! After reading this piece of work, I have no doubt about their novel ideas of approaching ICA. This relates to a later work closely. The idea of ICA: independent components, must be formulated with a certain idea and the independency calculated in this way will surely be applicable to other problems as well, e.g. one concept of dimensionality reduction for regression (supervised dimensionality reduction) is sufficiency: finding a subspace such that the response y is independent to x conditioned on the projection of x onto it. Therefore it is the key idea to understand the following papers to be scanned.

There are several steps in tackling the independency:
  • Correlation-to-independency: if x and z are independent, then f(x) and g(z) will be independent and therefore will be uncorrelated (correlation coefficient is 0). Though given a fixed (f, g), we cannot assert that x and z are independent given f(x) and g(z) are uncorrelated, it holds that x and z are independent if for arbitrary (f, g), f(x) and g(z) are uncorrelated (then we can put Fourier transform into it).
  • Kernel provides us with RKHS, which means we can use many many functions for the r.v. to be tested. But empirically with given samples, for ICA we have to minimize the maximum correlation of x and z with all pairs of (f, g).
  • The maximum correlation can be obtained with standard CCA idea or it's kernelized version (a generalized eigenvalue problem). This can be extended to multiple r.v.s (variables or vectors). It is argued that the smallest eigenvalues (for (C, D)-generalized eigenvalue problem, where C is the co-variance matrix, KiKj in kernelized form and D is a regularized diagonal variance matrix (Ki+sI)2) are more suitable. (KCCA as a contrastive function.)
  • The last idea is that when the r.v.s are Gaussian, the mutual information is -log( |C|/|D| )/2. So we just need to minimize this. This is another contrastive function (kernel generalized variance, KGV).
So we must get these ideas into a working algorithm. There are several important tips,
  • As in ICA, we first whiten the data Y such that we have 0 mean and unitary covariance. Then we just need to find an orthogonal matrix instead of an affine transform.
  • The matrix (C, D) which, converted into a eigenvalue problem, is too difficult to compute. Therefore we need to approximate this matrix with incomplete Cholesky decompostion (another candidate is Nystrom approximation).
  • Please notice the optimization is constrained on the orthogonal matrix, which is on a Stielfel manifold. We must use the gradient on this manifold (so as not to break the constraints). The overall optimization is simply steepest gradient decent.
  • The author then proposed two possible initial values for gradient decent. One point is when we use a kernel which implies complicate features, the optimization might be tough while a simple kernel such as polynomial kernels might be easier for the optimization.
This journal paper contains much more information about everything :-D I will return to it when considering implementation.