Thursday, February 26, 2009

Optimal Information Processing and Bayes' Theorem


It is really something easy. The input is the likelihood Pr(x | θ) and the prior π(θ) and the output is Pr(θ|D) and Pr(x). So the author proposes to minimize the information (log) or the KL divergence. This would lead to Bayes' theorem.

Well is this justification right?

The Infinite Gaussian Mixture Model


This is an even earlier paper on DP. Rasmussen employs CRP though, a little different from the SBP. He first studies the finite mixture model. Then take the number of components to infinity and modify the mixing proportion correspondingly. The inference then is done by MCMC. I am not quite clear why he sets up such a complicated model -,-

Variational Inference for Dirichlet Process Mixtures


This is the first DP paper I read. I heard about DP a long time ago but I haven't taken time for it until recent. This paper shows how to use (global) variational inference for the DP mixtures of exponential family.

The thing about DP is quite peculiar. The formal definition of DP would not yield us a model which could be computed. However, several related processes are employed, e.g. Chinese restaurant process and stick breaking process. In this paper the latter is adopted. One difficulty in understanding DP is the posterior distribution. For a mixture, we have several experts whose parameters comes from a common space, which is endowed with a prior H. The GP simly works on this space. GP is a stochastic meansure, which means that given a measurable set (event), it has a stochastic measure (probability). For DP, it refers to given a measurable finite partition, the probability of all events in this partition is Dirichlet distributed (with the parameter αH). Therefore, the posterior distribution of the GP is still something like this, added with several delta functions.

To solve this problem, the proposed solution can be interpreted as global variational approximation or mean field Gibbs sampling. To see why, we have to use the stick breaking process. This is a process, first we generate vj from a Beta(1, α) distribution. The approximate uses a truncated version (let T be the stopping time). With this we can compute the mixing proportion π. We also generate ηj from H. The observation xi is generated by taking π as the parameter of a multinomial distribution to select an index for ηj.

With this model, they propose a factorial posterior and with the idea of global variational method we can maximize the variational bound coordinate by coordinate (each coordinate is one parameter in the approximate posterior). This resembles the Gibbs sampling procedure. The difference is we use the mean (the parameter is usually the mean, first order moment) instead of do a sampling. They comapre the two in later experiments.

Tuesday, February 24, 2009

Approximations for Binary Gaussian Process Classification


This paper compares several GPC algorithms for binary classification (multi-class algorithms are more difficult to design I guess). I read some of them in the book PRML and Rasmussen's book on GP. So I might not cover all of them. The comparisons shows that EP might be the most disirable algorithm for training GPC.

First let's see the idea of using GP for classification. It is from probit regression. As you might find the role GP plays in regression, here we simply use GP to replace the parametric form <w, x>. The rest of the game is the link function in probit regression. The common choices are logit and CDF of a standard Gaussian. Sometimes the latter is more suitable for computation (unlike logistic regression is easier than probit regression). The label y in the literature usually takes value of +1 and -1 (unlike in logistic regression, 0 and 1 are more favorable) and therefore more difficult to extend to multi-class cases. Let f be the Gaussian process and the probability is usually Pr(y | f) = sig( y f ) where sig(.) is the link function.

The difficulty in GPC is that the posterior probability can't computed easily (due to the link function). So we are seeking suitable approximations for it. Usually we simply find some Gaussians to apprxomate it. Basically, we can do it globally or locally.

The simplest idea is Laplacian approximation. We take the logarithm of the posterior probability (without the normalization factor). A Taylor expansion of it with the terms of order 1 and 2 would yield the mean and covariance matrix of the approximate Gaussian (which can be solve by Newton-Raphson method). This is obviously global and works poorly most of the time.

The expectation propagation (EP) originates from ADF (assumed-density filter). It is a little complicated to fully explained its idea but it's worth mentioning that we don't know whether it really decreases the free energy or converges while it yields the best results compared with other methods. It is local.

The KL divergence minimization (KL) minimizes the KL divergence of the proposed Gaussian with the true posterior to increase the marginal likelihood (f as the latent variable). We simply take derivatives w.r.t. the proposed parameters. It is global.

Variational bound (VB) is something like local variational method which seeks a lower bound of the difficult parts in the posterior. By maximizing the lower bound of the likelihood would increase. This is local since for each sample we find the best lower bound (log-quadratic) separately,

Factorial variational method (FV, related to the global variational method in PRML) assumes the posterior is factorial and by minimizing the KL divergence of the proposed distribution and the true posterior, the variational problem yields a form for the factors. By updating the parameters one by one, we may find the most appropritate configurations. (it's dubious to say it is local)

Label regression (LR) is even simpler (treating classification as regression).

They also implement MCMC for ``accurate'' models. I am very interested in their implementation and might use it in later work.

Wednesday, February 18, 2009

Is Learning the n-th Thing Easier than Learning the First?


This is the first paper I read about transfer learning. It does not look that difficult. The idea is simple: the previously learning tasks should benefit the current one. The problem is how we might put the knowledge already known into the current model?

The concept it uses is support sets Xk which is used to train previous discriminative function fk. The author compares several algorithms. Memory based learning algorithms:
  • Nearest neighbor and Shepard's method, we can not incoporate support sets.
  • Learning a new presentation: use a neural network to find a function g, which maps data into another space such that on support sets, the samples from the same class will be near to each other while those from different classes will be far away from each other.
  • Learning a distance function: use a neural network to find a ``distance'' g such that on support sets, samples from the same class will be near to each other (value 1) and those from different class will be far away from each other (value 0). With this distance and memory method, we could do the classification on the to be learned problem.

He also studies the neural networks, the base algorithm is feedforward NN which is train with BP.
  • Learning with Hints: share the weights in all learning problems.
  • Explanation-Based Neural Network Learning: needs read more about it.

In the experiments, the author shows even given a limited number of training data, the transfered learner gain better generalization bounds. This might be the first transfer learning paper.

Tuesday, February 17, 2009

The Recurrent Temporal Restricted Boltzmann Machine


This paper is succedent to the previous motion caption paper, which does not include much details about the model (at least I don't know anything about that dynamic biases). Anyway, I find more details in this paper. This dynamic bias is actually a cross term of ht and ht-1. So given ht-1, it is functioning as the bias term.

The model proposed in this paper is a bit strange. I will read something about recurrent networks first. From the point of view of optimization, it uses BP instead of CD learning (exact vs stochastic approximation).

Anyway, I will return to this paper later. It is really weird to make a temporal model with RBM compared with HMM, CRF.

Implicit Mixtures of Restricted Boltzmann Machines


This is quite simple extension. It employs a three-order RBM instead of a mixture of RBM (mainly due to the complexity of training RBM). Usually we use v for input, h for hidden neurons. Now add another z for clustering. And v and h are binomial while z is actually a multinomial r.v. (so that it corresponds to the clustering index). This model can be trained with CD. We know given z and h the sampling of v is simple. The only thing we need to consider is how to sample h and z given v. This is still easy. One way is to find z first (well it has a number of configurations linear with the number of clusters, instead of exponential). By marginalize h, we get the free energy for each z. Then a softmax function (or just a normalization of the probability) will tell us the distribution of z. Then we can sample h.

Therefore, the three-order RBM is trained with CD. They address a problem of the training. One RBM might be too strong and others won't be chosen forever. They use a technique from annealing. By dividing the free energy with a temperature T (first high later getting smaller), we might avoid this problem.

Bayesian Conditional Random Fields


Well, as the title of the paper suggest, we have to use priors, compute the posteriors instead of tuning the parameters. I guess We can even use the GP-LVM for the features' part (since it is something like a log-linear model, now we take part of the parameter and say oh it is an r.v., so it is reasonable to say oh now we put it as a GP). The last author makes me think about the final solution must be solved by EP (the posterior can't be computed exactly as in Bayesian logistic regression). Then we have a general idea from the first glance.

Then you might wonder why bother using a Bayesian framework for the problem. The common answer is to avoid overfitting. There are comparisons of Bayesian methods of frequentists' ones. Some believe for problems with limited data, Bayesian takes an edge over the latter. As data increase, frequentists' model with regularization can also work well. When we have enough data, we can forget about all those tricky parts.

Since I have not tried EP on my own, here I just put some tricky parts and will check about it later. 1. they drop the exponential part (I don't understand why the CDF of Gaussian would work here, since now it is something like a distribution on transitions, remember the label bias problem?) 2. Gaussian prior, as expected; 3. EP or power EP (I will check later) 4. estimate the normalization factor without MCMC; 5. flatten the approximation structure (due to non-positive definiteness); 6. speed is even higher than CRF (data are too limited?)

1. the old model is log-linear; here a Gaussian CDF will not cause label-bias problem since it is not a density for yi. It will be easier to use Gaussian CDFs with EP (in the procedure of updating the moment of the approximating posterior).

3. It looks very simple, now it is not the KL divergence to minimize.

Monday, February 16, 2009

Max-Margin Markov Networks


As we might see in CME, CRF, features are just as features in SVM, the parameters we are learning from data are learnt via those in logistic model. We know MLE is just one way of training discriminative models. Another widely-used rule is max margin. How can we model a structured output as in CRF (namely it is an MRF) with margins?

The idea comes from multi-class SVM: the predicted label defines a feature vector which should be separated from other vectors of different configurations for at least γ. But now we have a variable length of each sequence. Therefore it is reasonable to constrain the margin inquality to a certain scale of γ. The scale is simply the number of classification error. Like SVM, introducing slack variable for the unseparable case will get us the margin + loss expression. This optimization is still convex, but the constraints are too many since in multi-class discrimination, the configurations are comparatively fewer (c classes, then c-1 constraints for each sample). Now for a sequence of l labels and the number of values of y is c, we have cl constraints.

The tackle this problem, they convert it into the dual problem. The dual variables can be regarded as a density (to a given scale). Though they are many, not every single variable must be computed. The structure of the MRF determines the necessary combinations of them: grouping them into edge dual variables and vertex dual variables gives us a polynomial-number of optimization variables. But now we have to satisfy the consistency the vectex variable now is the margin of related edge variables. Then we can convert the dual problem back into the prime one. The optimization is done in SVM's flavor, e.g. SMO.

Sunday, February 15, 2009

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation


This is a paper about several experiments that verify the effectiveness of deep structure. One thing I see might be interesting is the autoassociator. It is a little different from Hinton's autoencoders. I am still considering how to add supervised information in the greedy training of DBNs. Once done, it might be useful for several problems. Hope I can figure it out ASAP.

Conditional Random Fields: Probabilistic Models for Segmenting and Labelling Sequences Data

by John Lafferty, Andrew McCallum and Fernando Pereira

CRF is simple a MRF conditional on the observation. In this paper they proposed a chain MRF for modeling sequential data (actually more complex models can be used, as in the Bayesian nets, e.g. higher-order Markov chains in HMM, MEMM). CRF has all fantasy merits of CME and other models in a way: 1) human-designed features can be used; 2) The ME learning is convex unlike HMM. It also avoids the label bias problem. That is given a Markov model with few branches, since the conditional probability is on the current transition, we would get a large probability even if the transition is wrong. It simply ignores the current observation. Then we find little discrimination of the right decoding with a wrong one (since each transition in either decoding has a large probability). This is because the belief network constrains the model on transition to a probability distribution, resulting a competition among states at one transition. A better way is to model the whole as a distribution (i.e. an undirected model is more suitable). The potential function does not have the normalization requirement and this leaves the competition to the sequence of transitions instead of states.

This observation is key to the design of CRF. Given a chain MRF, we would find the training procedure involves features not only f(x,yt) as CME, but also transitional features f(x, yt, yt+1) as MEMM. By the representer theorem, the form of the result suggests GIS, IIS and other optimization algorithms that are applicable to CME may also apply to CRF. The optimization is no harder but since the normalizaton factor now simply depends on the whole sequence (unlike MEMM on the current observed data). I will check this derivation manually soon.

Maximum Entropy Markov Models for Information Extraction and Segmentation


MEMM is a modification of HMM such that: 1) we might use features in the model; 2) a discriminative model instead of a generative one. After all it is a Markov model trained in the ME way.

There are some important points in CME in NLP I want to address here: 1) the ME learning leads to an MRF; 2) in applications in NLP, it is a zero-order CRF; 3) the dual problem of the ME is MLE, whichcan be optimized in many ways, e.g. GIS (sth like coordinate descent), IIS, gradient-based search or L-BFGS.

In MEMM, unlike HMM in which each observation is determined by a Markov Chain (only on the current latent state), each latent state is determined by the current observation as well. This graphical model implies a factorization in which we need to model Pr( yt+1 | yt, xt+1). For simplicity, we just write Pr( y | x ) since yt is given (e.g. state 1) and we focus on this case. We use a feature-based method to model it. Let fa( x, y ) be a feature, where a is a pair, indicating a binary feature on x and another on y, when both satisfied, makes a true value. This would be the same as CME. Then for each trasition we have a probability model.

MEMM is trained on each kind of transition (with different starting states). Then given a sequence of observations, the decoding (finding ys) can be calculated with dynamic programming method as HMM, only maximizing the conditional probability instead. If we do not have labels (y) in the training procedure, we must use one like Baum-Welch algorithm.

The later CRF model is simply an undirected version of MEMM. Some other work in NLP suggests MEMM is more expressive than the Moore HMM (the one we commonly use) but is less than Mealy HMM.

Diffusion Kernels on Graphs and Other Discrete Structures


This paper introduces a diffusion kernel for graphs. First please notice that the exponential of any symetric matrix is positive definite and therefore is a Gram matrix. Then we consider a diffusion process. At time t, each node would give a proportion α of its value to it neighbors. In matrix form, it corresponds to the negative Laplacian of a graph and we simply use it for the generating matrix of the kernel (diffusion kernel).

The authors studied its continuous form, its relationship to random walks on graphs and cases on special graphs. The interesting findings are: 1) Gaussian kernel on graphs is a special case of diffusion kernel; 2) its application in images and strings.

Friday, February 13, 2009

Generative versus Discriminative Training of RBMs for Classification of fMRI Images


This is much simpler than the previous RBM. It simply train one RBM for each class. Then they use Bayes theorem to calculate the posterior probability to classify (with the assumption that the prior is equal of each class). The only difficult part is the normalization factors. They use a discriminative way of learning for this ratio. In the discriminative learning, the fraction is something like a logistic function and the partial derivatives can be calculated very easily.

They carry out the experiments on fMRI data, on which logistict regression with regularization does not work well.

Saturday, February 7, 2009

Spectral Matting


This work is an extension of the previous paper. The idea is that since a graph with k components has k zero-valued eigenvalues, we could use the subspace of the smallest eigenvalues of a graph as an approximation of the subspace of zero-valued eigenvalues in ideal cases. The problem is that although we can get the subspace, the basis might not be the one for matting. And in practice we usually find a lot of eigen vectors and only select a smaller basis in the subspace for matting component. The last step is to combine several matting components to get the final matting matrix.

Then take a look at how to select a suitable basis. The objective should describe the sparsity of the components and we have several constraints. This is a non-convex optimization and it is solved with Newton method (taking the constraints into account) with an initial value of k-means. The combination of matting components will be comparatively easy, since we don't have many components and a brute-force search will do us good.

A Closed Form Solution to Natural Image Matting


Matting is an interesting problem, which is very useful in digital image processing. The digital photos could be interpreted with a layer of foreground (including the subject) and another of background. The matting problem is that we have to decide a foreground image and a background image and the probability of each pixel belonging to foreground (something like an alpha channel). It is different from image segmentation (something like a soft-classifier) though. But it can also be solved with spectral clustering.

The idea is quite simple. The probability ci at pixel i is a linear combination of the intensity of each channel (RGB). Therefore ci = air Iir + aig Iig + aib Iib + b. So we have to minimize their sum of squared residue with a regularizer of e a2. This could be formulated with a quadratic form of a graph's Laplacian matrix. Therefore, given some constraints (e.g. users' input), the matting problem could be solve as an eigen value problem.

Thursday, February 5, 2009

Sparse Deep Belief Net Model for Visual Area V2


In fact the algorithm is quite simple. The sparsity is enforced by a norm regularizer (the conditional expectation of the hidden states, don't know why it works). This will not modify the CD learning process much. We simply add another update of the parameters, which is deterministic and very easy to compute.

The paper argues that the sparse RBM and DBN could learn Gabor-like filters from the natural images. And the corner-oriented filters are believed to be the V2 area in human cortex.

Wednesday, February 4, 2009

Semi-supervised Learning via Gaussian Processes


The authors present a way of modifying the discriminative model in order to use the unlabelled data. As we know the parameters are conditionally independent with the unlabelled data. They propose to add a null-category variable to the model which ensures there might exist a certain dependency between the parameters and the unlabelled data.

The main idea is the unlabelled data will get a 0 label and the parametric model is thrown away by a Gaussian latent variable. Thus, the dependecy x -> y -> z will be x -> f -> y -> z. Nonparametric model needs tuning hyperparameters (EM to maximize Pr( y | x, z)) and the inference is done with EP as in a GPC. The conditional probability of Pr( y=0 | f ) suggests a low probability around the sample, if it is unlabelled and the value of f is large. And Pr( z | y = 0 ) is simply 0-1 (indicator of labels) and Pr( z | y = 1 or -1) is defined as a binomial variable. Then our model is done.

Why does it work anyway? One thing about the labelled data is for positive ones f should be positive and have a large absolute value and negative ones negative and a large absolute values. For unlabelled data, we don't have y and we use the conditional posterior probability Pr(y | x, z, D ) which involves setting z = 1 (instead of 0 as unlabelled samples) and Pr( z = 1 | y=0 )=0. Then the posterior could be computed.

I have to do some experiments on non-parametric Bayesian methods to gain more intuitive insight.

Tuesday, February 3, 2009

Modeling Human Motion Using Binary Latent Variables


This paper introduces a method that enables RBM to deal with temporal data. The concept proposed is conditional RBM (CRBM): adding two flavors of directed edges to the RBM model (which is undirected). I am not sure how they design the whole model. It is said the two flavors of connections are treated as dynamical biases and the updating rule is quite similar to the bias in normal RBM.

The thing is that given previous frames, it is now possible to influence the distribution of the current frame (I guess that is where the `conditional' term comes in). This model does something quite similar to HMM, since they are both generative and models for temporal data.

In Hinton's course, he suggests a stacked version of CRBM (just like DBN to RBM).

I am not sure whether this idea is really good. It looks reasonable but I have no idea about the details. I will put it there and wait for new progress.

Classification using Discriminative Restricted Boltzmann Machines


This paper formulates a ``discriminative'' RBM. The visible layer contains the features and the labels (in a softmax way). It can be trained in a generative way and a discriminative one. The generative one maximizes Pr( x, y ) while the latter maximizes Pr( y | x ). As we can see, Pr( x, y ) could be maximized with contrastive divergence as the normal RBM. And I remember Hinton's course last year includes an example of classification using free energy directly with this model. Now for the discriminative case, when the number of classes is small, a exhaustive sum over y is feasible. The good news is that the partial differentials can be obtained with explicit expressions.

Regarding last scanned paper, they propose a hybrid model that relies on two loss functions (negated log-likelihoods, introducing a coefficient for the generative part). Then the discriminative part has a non-stochastic gradient and the generative part has a stochastic one (due to CD).

As for semi-supervised learning, we just marginalize out the label y from the RBM. Then the gradient becomes the CD averaged over y. This negated likelihood of Pr(x) will function as a regularizer.

It's amazing the DRBM alone could achieve 1.81% error rate on MNIST with only 500 samples. The HDRBM could achieve even lower. They mention a sparse version which I will scan soon.

Monday, February 2, 2009

On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes


This paper is a little bit theoretical. The main result is that the discriminative model usually has a lower error rate than its corresponding generative model given enough data (asymptotically speaking, the generalization capability of the former is better than the latter). However, it's easier to get to the bound for the latter model (it converges faster).

But please do notice there are something not validated. The paper compares the model of Naive Bayes and logistic regression. The conclusion might not generalize to other models. And the training of logistic regression is not MLE (minimizing the classification errors, though).

The gap between the two asymptotic bounds is O( 1 / m * log n)., where m is the number of samples and n the dimension of feature space.

This implies with limited data, a generative model might be better. With plenty of data, a discriminative model might be more suitable.