Wednesday, July 8, 2009

Probabilistic Dyadic Data Analysis with Local and Global Consistency


This paper introduces a graph regularizer for PLSA. There are some interesting stories behind the NLP algorithms such as PLSA, LDA (latent Dirichlet allocation), I think, but I am not quite familiar with them. From the modeling aspect, they deal with the same thing. Given a term-document matrix, X, the element of which is the count of occurrences of a given word w_j in the document d_i. The generative model is each document is a mixture of multinomial distribution. Namely, we have several latent topics z_k such that a document is a mixture of the topics z_k with the mixing proportion \Pr(z_k \mid d_i). Each topic decide a kind of ditribution for the words, namely \Pr(w_j \mid z_k). The PLSA algorithm simply uses EM to train the mixture model.

The LDA is not as far as I think. It's just a Bayesian version of PLSA. We know the natural extesion of mixture models in frequentists' domain is adding a Dirichlet prior for the proportion of the latent variable. And the more sophisticated technique is non-parametric Bayesian, adding a Dirichlet process as the prior. The fully Bayesian method need the posterior distribution which is approximated with (global) variation inference.

The two algorithms do not take into consideration the local information. This paper simply adds a regularizer to the PLSA loss (-log likelihood). Here the authors assume if the documents are similar, their mixing proportions must be similar. So if d_{i_1} and d_{i_2} are similar, according to some rules (labels in classification or cosine of the angle between two tf-idf vectors in unsupervised case), the two distribution \Pr(z_k \mid d_{i_1}) and \Pr( z_k \mid d_{i_2}) are similar. This similarity is measured with the symmetric KL divergence:
\min -\sum_{i = 1}^N \sum_{j = 1}^M n(d_i, w_j) \log\sum_{k = 1}^K \Pr(w_j \mid z_k) \Pr(z_k \mid d_i) + \frac{\lambda}{2} \sum_{i_1, i_2 = 1}^N W_{i_1, i_2} \big( D(\Pr(z_k \mid d_{i_1})\parallel \Pr(z_k \mid d_{i_2})) + D(\Pr(z_k \mid d_{i_2})\parallel \Pr(z_k \mid d_{i_1})) \big)
The optimization can still be solved with EM with a little approximation. The difficult part is the maximization step.

The result is really good: in unsupervised case, it is better than NMF and PLSA, LDA and even N-cut.

No comments: