Monday, August 30, 2010

Replicated Softmax: an undirected topic model


This paper talks about an application of restricted Boltzmann machine. The visible layer contains a matrix of 0-1 variables (indicating whether the kth sample takes value i). And the hidden layer is again binary variables. After defining the conditional probability (since the conditional probability for visible layer is multinomial and therefore a softmax function on the hidden units, that's where the model's name comes from; replicated for the weight matrix is shared), we may use contrastive divergence to train the model. In topic model, the so-called log-perplexity (probability of observing testing samples) has to be computed so as to compare with other topic models (such as LDA). Annealed importance sampling is employed to compute the partition function.

There is some link of this model with semantic hash previously developed by the authors. Experiments shows favorable result with undirected topic model.

Wednesday, August 25, 2010

Learning a Parametric Embedding by Preserving Local Structure


This paper proposed a parametric embedding for t-SNE. The mapping is parametrized via deep belief nets. The objective function of t-SNE then back propogates to the weights of several layers. The idea is simple but the implementation should be tedious.

Parametric Embedding for Class Visualization


This paper addresses the problem of visualizing posterior probabilities of topic models. We have scanned another paper on how to visualize the documents via a probabilistic graphical model. Actually this paper is the inspiration of the latter.

By introducing an objective function resembling SNE, this paper solves an easier problem: the p table in SNE is now given, instead of computed via binary searches; the q table has a much smaller size.

Actually the latter paper decouples the number of topics with the embeding dimensions directly (but coupled with the number of clusters). We may observe quite similar phenomenon in this model.

As SNE, this model is also optimized with a gradient-descent method (alternatively, though).

Sunday, August 22, 2010

Probabilistic Latent Semantic Visualization: Topic Model for Visualizing Documents


This paper proposed a model based on LDA. But the Dirichlet prior is replaced with the probability generated by the latent coordinates of each documents and the topics. The paper only deals with MAP estimation of the latent coordinates, which can be solved via EM-like algorithm.

The learning is simple for the distribution of words conditioned on topics (analytic solution) while difficult for the latent coordinates due to the optimization (has to be solved via gradient-based numerical solutions).

The idea is interesting though. Instead of seeking a representation of the documents in the topic space learn by LDA-like models, the visualization is directly modeled via a probabilistic graphical model.

A Probabilistic Approach to Semantic Representation


This paper actually introduces a Gibbs sampler for LDA model. The Gibbs sampler, however, does not sample all latent variables. Only latent topics are sampled. I guess many DP models actually are making inferences in this way. But is this the so-called Bayesian way of learning? Sampling ?= learning. Well.... I don't know why they may do this. I have to see more...

Monday, August 9, 2010

Relation between PLSA and NMF and Implications


This paper's critical point is PLSA solve the NMF problem with a KL divergence loss, which can be easily seen if you are familiar with the two algorithms. The implications are:
  • The NMF algorithms can be interpreted as EM algorithms;
  • The NMF does not hold many advantages as pLSA.
  • But NMF might benefit with many loss functions for different purposes. 

Friday, August 6, 2010

Probabilistic Latent Semantic Analysis


This paper introduces a probabilistic model for LSA problem. In traditional LSA, we have a word-document matrix (each column correspond to a document, each row denotes the count of a certain word). The LSA employs a SVD of the count matrix and indicates that the left singular vectors are latent topics. NMF might be more appropriate since the bases found are nonnegative and can be seen as distributions of words.

This paper builds the first probabilistic model for the latent topics. The model is quite simple
\Pr(w, d) = \sum_z \Pr(z) \Pr(d\mid z) \Pr(w \mid z)
which can be trained with EM algorithm. The inference of this model is a bit awkward. But we may simply use \Pr(w \mid z) for inference problems.

Later, LDA actually endow Dirichlet priors to the mixing proportions to the topics and words.