Sunday, May 17, 2009

Supervised Feature Selection via Dependence Estimation


Feature selection has been mentioned in a KDR paper. This paper uses HSIC and backward greedy algorithm. That's each time they pick up a dimension to throw away while the rest has the highest HSIC. I thought they might throw away the one with lowest HSIC but they enjoyed using HSIC as an lower bound of dependence.

The paper also mentions the connection between HSIC and other criterion such as MMD in a prevously scanned paper and KTA (it looks like an uncentered version of HSIC).

K-means Clustering via Principal Component Analysis


Th most important result of this paper is that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering and the subspace spanned by the cluster centroids are given by the spectral expansion of the data covariance matrix truncated at K-1 terms.

I'd like to say their discovery is very similar to that of normalized cut and the spectral expansion of Laplacian matrix. Let
J_K = \sum_{k = 1}^K \sum_{i \in C_k} \|x_i - m_k \|^2
be the objective of K-means clustering,
d(C_k, C_l) = \sum_{i \in C_k} \sum_{j \in C_l} \| x_i - x_j \|^2
be the distance between two clusters. It can be shown that
J_K = N \bar{y}^2 - \frac{1}{2} J_D,
where in 2-way clustering case
J_D = \frac{N_1 N_2}{N} \left( 2 \frac{d(C_1, C_2)}{N_1 N_2} - \frac{d(C_1, C_1)}{N_1^2} - \frac{d(C_2, C_2)}{N_2^2}\right).
Further we have
\frac{d(C_1, C_2)}{N_1 N_2} = \frac{d(C_1, C_1)}{N_1^2} + \frac{d(C_2, C_2)}{N_2^2} + (m_1 - m_2)^2.
The important theorem they proved says in 2-way clustering, the continuous solution is given by the leading principal components, those positive in one cluster and negative in the other. The optimal K-means solution in this case can be bounded with the leading spectrum,
n \bar{y}^2 - \lamnda_1 \leq J_{2} \leq n \bar{y}^2.
If we create a fully connected graph with each weight \frac{1}{N}, the linearization of the N-cut algorithm is PCA and the N-cut is simply K-means, is it right? The difference is PCA is done in the feature space and spectral clustering employs the embedding space Laplacian eigenmaps learns.

For K-way clustering, we have similar results. But unlike 2-way clustering, we can not interpret the PC in a similar way. Maybe the matting paper has better interpretation. Their second result is about the centroids.

This result has several implications. Spectral clustering is done in the embedding space learned by Laplacian eigenmap; PCA/K-means is done in the feature space; KPCA/Kernel K-means is done in the kernel-induced space. So if we considering the unsupervised KDR idea, can we find a similar interpretation such as clustering?

Tuesday, May 12, 2009

Visualizaing Data using t-SNE

by Laurens van der Maaten and Geoffrey Hinton

This paper proposes a variant of SNE. The SNE was previously proposed by Hinton and Roweis. Its idea is to find a projection such that the neighboring probability is preserved: given the coordinates x_i, the neighboring probability is
p_{i \mid j} = \frac{g(x_i, x_j)}{\sum_k g_j(x_j, x_k)}.
In the projection, if we define a similar probability q_{i \mid j}, we wish to have a most similar distribution of these samples, therefore we intend to minimize
\sum_j \sum_i p_{i \mid j} \log \frac{p_{i \mid j}}{q_{i \mid j}}.
The original setting is taking g_j(x_i, x_j) with Gaussian kernel (with different variances). By setting a so-called perplexity, we then binary-search a proper variance for each sample. This perplexity corresponds to the continuous version of effective neighbors. And we compute the gradient of y_i:
\frac{\partial \mathrm{KL}}{\partial y_j} = 2 \sum_i ( p_{i \mid j} - q_{i \mid j} + p_{j \mid i} - q_{j \mid i})(y_j - y_i).
The gradient decent is suppllemented with a momentum.

The so-called t-SNE has two major modification to the SNE model. For one thing, it uses a symmetric p_{i, j} instead, which is simply (p_{i \md j} + p_{j \mid i}) / 2, this will simplify the derivatives of the objective function,
\frac{\partial \mathrm{KL}}{\partial y_j} = 4 \sum_i (p_{i, j} - q_{i, j})(y_j - y_i).
For another thing, we use a different pdf for the projection instead of the Gaussian kernel. The "t" comes from here, since they propose using t-distribution with 1 degree of freedom (therefore a Cauchy distribution). Then the derivatives become
\frac{\partial \mathrm{KL}}{\partial y_j} = 4 \sum_i (p_{i, j} - q_{i,j})(y_j - y_i)(1 + \| y_j - y_i\|^2)^{-1}.


The great thing about t-SNE is its capability to yield a good embedding. But somehow it is a little difficult to increase the embedding dimension. Not sure about it though, doing some experiments with it.

Monday, May 11, 2009

A Dependence Maximaization View of Clustering


HSIC is an upper bound of the COCO which in theory has been proved to be equivalent to independence. Therefore HSIC can be used as an independence test. However, as an upper bound, it might not be suitable if we want to maximize the dependence of r.v.s IMHO. So the motivation of using HSIC for this purpose (clustering) is still vague.

For clustering, we maximize
\tr HK_xH K_y
for making features and labels dependent. So we have to find a kernel for the discrete variable y: delta kernel; or the kernel in the paper,
\Pi A \Pi^\top
where A is the Gram matrix for different clusters and \Pi is the assignment matrix. The paper also consider other kernels, relating this algorithm with kernel PCA (maybe kernel k-means also?).

The optimization is quite difficult as other clustering algorithms. It initialize an assignment matrix randomly and repeat updating the assignment matrix until it converges. We loop by the samples, finding the best assignment that maximizes the HSIC. Please notice the matrix A will be updated when the assignment matrix is updated. Like k-means, each sample has the same weight and A is simply a diagonal matrix with the inverse of number of samples in this cluster. Another choice generalizes the weights. This leads to a family of clustering algorithms by choosing A.

Maybe it's worth considering about the designing idea more carefully.

A Kernel Approach to Comparing Distributions


This paper takes advantage of kernel method to test whether two r.v.s have the same distribution. The idea is very close to independence test based on kernel method. We know two r.v.s \mathbb{E} f(x) = \mathbb{E} f(y) if their distribution is identical. Given a large enough function set, we may use
\mathrm{MMD}(\mathcal{F}, x, y ) = \sup_{f \in \mathcal{F}} \mathbb{E} f(x) - \mathbb{E} f(y).
Then we try some universal kernel and its corresponding RKHS. With some derivation, the empirical evalutation of MMD is based on
\mathrm{MMD}^2(\mathcal{F}, x, y) = \mathbb{E} \langle \phi(x), \phi(x')\rangle + \mathbb{E} \langle \phi(y), \phi(y')\rangle - 2 \mathbb{E} \langle \phi(x), \phi(y)\rangle \approx \frac{1}{m^2} \sum_{i, j = 1}^m k(x_i, x_j) + \frac{1}{n^2} \sum_{i, j = 1}^n k(y_i, y_j) - 2 \frac{1}{mn} \sum_{i = 1}^m \sum_{j = 1}^n k(x_i, y_j).


I am not sure whether a later work in NIPS 2008 is based on this, which will be scanned soon.

Measuring Statistical Dependence with Hilbert-Schmidt Norms


This paper uses the so-called Hilbert-Schmidt norms instead of L^2 norm (corresponding to the COCO independence test) in the previously scanned paper, resulting in the so-called HSIC (Hilbert-Schmidt independence criterion), which is in practice much easier to calculate. Compared with COCO, which using the square root of the largest singular value of \tilde{K}^{(x)} \tilde{K}^{(y)}, HSIC actually uses the whole spectrum, i.e. the trace of \tilde{K}^{(x)} \tilde{K}^{(y)}. Since the Frobenius norm is an upper bound for the L^2 norm, therefore the similar independence equivalence can be obtained.

Please notice the relationship of KCC, COCO and HSIC. For KCC, we use the maximum kernel correlation as the criterion for independence test, but then we have to use the regularized version to avoid the common case when it does not give desired result. In COCO, we constrain both covariances to identity and then the the problem becomes seeking the largest singular value of the product of two centered Gram matrices while KCC solves the generalized eigenvalue problem. Therefore when we deal with multiple r.v.s, we can do it in the same ways as CCA does. So in a way you know why it is called COCO. These three ideas are closely related, just as in the KMI paper, KMI and KGV are closely related.

Sunday, May 10, 2009

Kernel Constrained Covariance for Dependence Measurement


For measuring independence, they propose another kernel-based criterion COCO (constrained covariance, in a previous paper, it is called kernel covariance, KC, however their objective actually comes from the multual information of Gaussian r.v.s) other than the previously scanned paper of Jordan's (KCC and KGV). This has been mentioned in their earlier work.

This work has quite different concerns. Since previous papers did not say how to choose the kernels. They prove COCO is only zero at independence for universal kernels. The so called universal kernel is a kernel such that on a compact metric space (\mathcal{X}, d), the RKHS induced by the kernel k(\cdot, \cdot) is dense in the continous function space over \mathcal{X}, namely C(\mathcal{X}) w.r.t. the topology induced by the infinity norm \| f - g \|_\infty. They proved Gaussian kernels and Laplacian kernels are universal.

They also point out the limitation of independence tests based on universal kernels. The proposed COCO also has its adversary. That is when we do have small COCO, the r.v.s are still dependent. But they found their calculation of COCO has an exponential convergence rate to the true COCO value.

The calculation of COCO is equivalent to
\mathrm{COCO}(z, F, G) = \frac{1}{n}\sqrt{\| \tilde{K}^f \tilde{K}^g \|_2}.
Their later paper actually uses Frobenius norm. Their experiment shows an example of applying COCO to ICA on fMRI data, compared with KMI and correlation method (only using correlation as a testing of independence).

The Kernel Mutual Information


We know mutual information can be used to test the independence of r.v.s. The difficulty with mutual information is that the distribution is usually unknown, either we have to make a density estimation or entropy estimation from samples.

This paper introduces the so-called KMI (kernel mutual information), which in practice has comparable performance with KGV (in Bach and Jordan's paper). It comes from measuring the mutual information between two multi-variate Gaussian vector,
I(x; y) = -\frac{1}{2} \log \left( \prod_{i = 1}^{\min(p_x, p_y)} (1 - \rho_i^2)\right).
We use the correlation in the RKHS for this \rho_i,
\rho_i = \frac{c_i^\top( P_{x, y} - p_x p_y) d_i}{\sqrt{c_i^\top D_x c_i d_i^\top D_y d_i}},
where P_{x, y}, p_x, p_y, D_x, D_y are approximated with samples and an assumed grid(it will be cancelled out in the end). By relaxing the denominator (to a bigger value), we find an upper bound for the mutual information,
M(z) = -\frac{1}{2} \log\left( \big| I - (\nu_x \nu_y)\tilde{K}^{(x)} \tilde{K}^{(y)} \big|\right),
where \tilde{K}^{(x)} and \tilde{K}^{(y)} are centered Gram matrices. This is the criterion for KMI. Since it's an upper bound for the mutual information, we can use it as a contast function (I don't know why the authors think we ca derive the result from the theorem 1, confused about their idea).

In their formulation, (regularized) KGV can be regarded as another way of relaxing the covariance. Therefore likewise this independence measurement can also be used for ICA task. In a way, this paper is a generalization of KGV, but I am still confused about their idea.

==
later I realized that the KMI is the upper bound of KC, therefore when KMI = 0 the KC will be 0 too.