Wednesday, August 25, 2010

Parametric Embedding for Class Visualization


This paper addresses the problem of visualizing posterior probabilities of topic models. We have scanned another paper on how to visualize the documents via a probabilistic graphical model. Actually this paper is the inspiration of the latter.

By introducing an objective function resembling SNE, this paper solves an easier problem: the p table in SNE is now given, instead of computed via binary searches; the q table has a much smaller size.

Actually the latter paper decouples the number of topics with the embeding dimensions directly (but coupled with the number of clusters). We may observe quite similar phenomenon in this model.

As SNE, this model is also optimized with a gradient-descent method (alternatively, though).

Sunday, August 22, 2010

Probabilistic Latent Semantic Visualization: Topic Model for Visualizing Documents


This paper proposed a model based on LDA. But the Dirichlet prior is replaced with the probability generated by the latent coordinates of each documents and the topics. The paper only deals with MAP estimation of the latent coordinates, which can be solved via EM-like algorithm.

The learning is simple for the distribution of words conditioned on topics (analytic solution) while difficult for the latent coordinates due to the optimization (has to be solved via gradient-based numerical solutions).

The idea is interesting though. Instead of seeking a representation of the documents in the topic space learn by LDA-like models, the visualization is directly modeled via a probabilistic graphical model.

A Probabilistic Approach to Semantic Representation


This paper actually introduces a Gibbs sampler for LDA model. The Gibbs sampler, however, does not sample all latent variables. Only latent topics are sampled. I guess many DP models actually are making inferences in this way. But is this the so-called Bayesian way of learning? Sampling ?= learning. Well.... I don't know why they may do this. I have to see more...

Monday, August 9, 2010

Relation between PLSA and NMF and Implications


This paper's critical point is PLSA solve the NMF problem with a KL divergence loss, which can be easily seen if you are familiar with the two algorithms. The implications are:
  • The NMF algorithms can be interpreted as EM algorithms;
  • The NMF does not hold many advantages as pLSA.
  • But NMF might benefit with many loss functions for different purposes. 

Friday, August 6, 2010

Probabilistic Latent Semantic Analysis


This paper introduces a probabilistic model for LSA problem. In traditional LSA, we have a word-document matrix (each column correspond to a document, each row denotes the count of a certain word). The LSA employs a SVD of the count matrix and indicates that the left singular vectors are latent topics. NMF might be more appropriate since the bases found are nonnegative and can be seen as distributions of words.

This paper builds the first probabilistic model for the latent topics. The model is quite simple
\Pr(w, d) = \sum_z \Pr(z) \Pr(d\mid z) \Pr(w \mid z)
which can be trained with EM algorithm. The inference of this model is a bit awkward. But we may simply use \Pr(w \mid z) for inference problems.

Later, LDA actually endow Dirichlet priors to the mixing proportions to the topics and words.

Monday, July 26, 2010

Heavy-Tailed Symmetric Stochastic Neigbor Embedding


t-SNE is a very useful visualization algorithm, which inherits the idea from SNE but modifies the neighborhood probabilities to t-distributions instead of the original. This heavy-tailed distribution works pretty well on many data. This paper discusses a more general case.

After transforming the original problem into its Lagrange dual, the authors get a unified algorithm (fixed-point iteration) for a family of heavy tailed distribution (including t-distribution). This is somewhat not as interesting as I have expected.

The authors also discussed how to integrate supervised information into the SNE algorithm. But their idea is kind of rough: inserting a similarity computed from the supervised information into the similarity in the high-dimensional space. I don't know whether this would truly work...

Dirichlet Component Analysis: Feature Extraction for Compositional Data


This paper discusses a problem of extracting features from compositional data (nonnegative features that sums up to 1). The problem is interesting though I am not sure about the major difficulties. To get a proper projection into lower dimensional space, we have to conform to a certain set of constraints (balanced rearrangement). A regularization operator is devised to preserve the Euclidean geometry. The rearrangement will shrink the data while regularization expands them.

The optimization is quite strange (solved via genetic algorithm). The maximization w.r.t. \alpha looks like a MLE but the minimization w.r.t. the rearrangement matrix doesn't make much sense to me.

I am still unsure about the application's intention.