Tuesday, April 7, 2009

Hierarchical Dirichlet Processes


This paper is relevant to hierarchical modeling of MoE (hence related to HMoE). The DP extension for mixture model has been studied in this paper (with variational approximation methods) too. This paper only use Monte Carlo methods for inference (instead of approximation algorithms). They simply introduce one DP prior for parameters of the model that generates the data and another DP prior for the generation of groups. In one word,
G_0 \mid \gamma, H \sim \mathrm{DP}( \gamma, H ), G_j \mid \alpha_0, G_0 \sim \mathrm{DP}( \alpha_0, G_0), \theta_{ji} \sim G_j, x_{ji} \sim F(\theta_{ji}).
Their idea of DP is implemented with SBP (as in this paper). However, we can only observe x_i. For a learning problem we must find a proper settings for \gamma and \alpha_0. For inference, we must get the posterior distribution of two DPs. Please notice for a DP MoE, the base distribution of the DP is on the parameter space; now in HDP, the base distribution of the intermediate DP is a random measure, which is the top DP. Therefore the base distribution of the top DP is on the parameter space: the sampling from the top DP results in \sum_{k = 1}^K \frac{m_k}{m + \gamma}\delta_{\phi_k} + \frac{\gamma}{m + \gamma} H where m_k is the number of samples with \phi_k values and \phi_k are drawn from H, m=\sum_{k=1}^K m_k; the intermediate DP put these samples \theta_i in groups according to \sum_{t = 1}^T \frac{n_t}{i-1 + \alpha_0} \delta_{\psi_t} + \frac{\alpha_0}{i-1 + \alpha_0} G_0, where n_t is the number of samples in this group and T is the number of groups.

Now we come to the details:
  • The SBP generates a set of v_i, \pi_i and \theta_i in the following way,
    v_i \sim \mathrm{Beta}(1, \alpha), \quad \pi_i = v_i \prod_{k = 1}^{i - 1}(1 - v_i), \quad \theta_i \sim G, \quad \Pr(\theta = \theta_i) = \pi_i
    This is equivalent to a DP whose parameter is \alpha and G. Then we may generate the data according to this.
  • The HDP can be formulated as a variant of CRP, Chinese restaurant franchise: a customer chooses a dish (being served or not) and sits at a table (serving the dish or a new one). We can dormulate the generation as follows: a customer \theta_i comes into the restaurant and will sit at a table according to the intermediate DP and asks for a dish (if it is a new table) according to the top DP.
  • For inference, we use SBP formulation and calculate the posterior of the hidden r.v.s. Ad we may see in the SBP formulation, we need an indicator (multinomial distribution) for one DP. The sampling is done with Gibbs sampler.
  • It can be used for enhancing LDA (latent Dirichlet allocation) and HMM.

No comments: