This paper proposes a variant of SNE. The SNE was previously proposed by Hinton and Roweis. Its idea is to find a projection such that the neighboring probability is preserved: given the coordinates
x_i
, the neighboring probability isp_{i \mid j} = \frac{g(x_i, x_j)}{\sum_k g_j(x_j, x_k)}.
In the projection, if we define a similar probability q_{i \mid j}
, we wish to have a most similar distribution of these samples, therefore we intend to minimize\sum_j \sum_i p_{i \mid j} \log \frac{p_{i \mid j}}{q_{i \mid j}}.
The original setting is taking g_j(x_i, x_j)
with Gaussian kernel (with different variances). By setting a so-called perplexity, we then binary-search a proper variance for each sample. This perplexity corresponds to the continuous version of effective neighbors. And we compute the gradient of y_i
:\frac{\partial \mathrm{KL}}{\partial y_j} = 2 \sum_i ( p_{i \mid j} - q_{i \mid j} + p_{j \mid i} - q_{j \mid i})(y_j - y_i).
The gradient decent is suppllemented with a momentum.The so-called t-SNE has two major modification to the SNE model. For one thing, it uses a symmetric
p_{i, j}
instead, which is simply (p_{i \md j} + p_{j \mid i}) / 2
, this will simplify the derivatives of the objective function,\frac{\partial \mathrm{KL}}{\partial y_j} = 4 \sum_i (p_{i, j} - q_{i, j})(y_j - y_i).
For another thing, we use a different pdf for the projection instead of the Gaussian kernel. The "t" comes from here, since they propose using t-distribution with 1 degree of freedom (therefore a Cauchy distribution). Then the derivatives become\frac{\partial \mathrm{KL}}{\partial y_j} = 4 \sum_i (p_{i, j} - q_{i,j})(y_j - y_i)(1 + \| y_j - y_i\|^2)^{-1}.
The great thing about t-SNE is its capability to yield a good embedding. But somehow it is a little difficult to increase the embedding dimension. Not sure about it though, doing some experiments with it.
No comments:
Post a Comment