Friday, August 6, 2010

Probabilistic Latent Semantic Analysis


This paper introduces a probabilistic model for LSA problem. In traditional LSA, we have a word-document matrix (each column correspond to a document, each row denotes the count of a certain word). The LSA employs a SVD of the count matrix and indicates that the left singular vectors are latent topics. NMF might be more appropriate since the bases found are nonnegative and can be seen as distributions of words.

This paper builds the first probabilistic model for the latent topics. The model is quite simple
\Pr(w, d) = \sum_z \Pr(z) \Pr(d\mid z) \Pr(w \mid z)
which can be trained with EM algorithm. The inference of this model is a bit awkward. But we may simply use \Pr(w \mid z) for inference problems.

Later, LDA actually endow Dirichlet priors to the mixing proportions to the topics and words.

No comments: