Tuesday, May 12, 2009

Visualizaing Data using t-SNE

by Laurens van der Maaten and Geoffrey Hinton

This paper proposes a variant of SNE. The SNE was previously proposed by Hinton and Roweis. Its idea is to find a projection such that the neighboring probability is preserved: given the coordinates x_i, the neighboring probability is
p_{i \mid j} = \frac{g(x_i, x_j)}{\sum_k g_j(x_j, x_k)}.
In the projection, if we define a similar probability q_{i \mid j}, we wish to have a most similar distribution of these samples, therefore we intend to minimize
\sum_j \sum_i p_{i \mid j} \log \frac{p_{i \mid j}}{q_{i \mid j}}.
The original setting is taking g_j(x_i, x_j) with Gaussian kernel (with different variances). By setting a so-called perplexity, we then binary-search a proper variance for each sample. This perplexity corresponds to the continuous version of effective neighbors. And we compute the gradient of y_i:
\frac{\partial \mathrm{KL}}{\partial y_j} = 2 \sum_i ( p_{i \mid j} - q_{i \mid j} + p_{j \mid i} - q_{j \mid i})(y_j - y_i).
The gradient decent is suppllemented with a momentum.

The so-called t-SNE has two major modification to the SNE model. For one thing, it uses a symmetric p_{i, j} instead, which is simply (p_{i \md j} + p_{j \mid i}) / 2, this will simplify the derivatives of the objective function,
\frac{\partial \mathrm{KL}}{\partial y_j} = 4 \sum_i (p_{i, j} - q_{i, j})(y_j - y_i).
For another thing, we use a different pdf for the projection instead of the Gaussian kernel. The "t" comes from here, since they propose using t-distribution with 1 degree of freedom (therefore a Cauchy distribution). Then the derivatives become
\frac{\partial \mathrm{KL}}{\partial y_j} = 4 \sum_i (p_{i, j} - q_{i,j})(y_j - y_i)(1 + \| y_j - y_i\|^2)^{-1}.


The great thing about t-SNE is its capability to yield a good embedding. But somehow it is a little difficult to increase the embedding dimension. Not sure about it though, doing some experiments with it.

No comments: