Monday, November 23, 2009

Two-view Feature Generation Model for Semi-supervised Learning


We first take a look at their logic: for semi-supervised learning, a generative model is usually preferred since unlabeled data help estimate the margin distribution \Pr(x). In a Bayesian MAP formulation, we are actually
\min_\alpha - \sum_i \log \Pr(y_i \mid \alpha, x_i) - \log \Pr(x_u \mid \alpha)\Pr(\alpha)
which is a little different from a direct generative model. Here the first term is actually a discriminative term and the second term is a penalty from unlabeled part (therefore it is more similar to a ``supervised loss + penalty'' model). This paper does talk about models of the latter, using auxiliary problems.

The two-view model means, analogous to co-training, we have two view of feature vector x, namely z_1(x), z_2(x), which are inpdependent conditioned on the label. The different thing about this model is in order to solve \Pr(y \mid z_1, z_2), we need \Pr(y \mid z_1), \Pr(y \mid z_2). Now we only consider \Pr(y \mid z_1). One possibility is to make a low-rank decomposition of \Pr(z_2 \mid z_1) = \sum_y \Pr(z_2 \mid y) \Pr(y \mid z_1) but the LHS is sometimes impossible to compute. An approximation is to encode z_2 with a set of binary labels t_1^k(z_2). Then \Pr( t_1^k \mid z_1) = \sum_y \Pr(t_1^k \mid y) \Pr(y \mid z_1) can be computed. By increasing the number of related binary labels t_1^k we may have a good estimation of \Pr(y \mid z_1).

They proposed two models (one linear and the other log-linear, which resembles linear regression and logistic regression in a way). The linear version coincides with the SVD-ASO model in their JMLR paper. The log-linear model is solved via EM-like algorithm.

The thing is what kind of binary auxilliary function would be essential to our semi-supervised problems? This might be a key to understanding their JMLR paper for multi-task learning.

No comments: