Tuesday, March 31, 2009

The Infinite Markov Model


The paper illustrates how it is possible to extend an n-gram model with Dirichlet process (here with Chinese restaurant process interpretation, actually a more general form is Pitman-Yor process). An n-gram model is simply an order-n Markov model, which can be presented with suffix tree. The generation can be then extended to arbitrary depth (unlike n-gram, the depth is fixed). The important thing is the inference part. The posterior is difficult to compute and we simply use a Gibbs sampler (the conditional posterior is easy to compute, a Beta distribution given a Beta prior).

Maybe we can try a variational version?

Friday, March 27, 2009

Sufficient Dimensionality Reduction


The authors proposed a nonparametric way for sufficient dimensionality reduction. Their idea is very similar to CME. We don't know Pr( x, y ) but we may estimated Pr(x), Pr(y) and several measurement φ(x) (the reduced features). Therefore, given a φ(x), the minimum mutual information I(q) of q(x, y) (the estimated joint density) can be seen as the goodness of φ(x) and the maximization of this minimum leads to the best φ(x) available to us.

The minimum mutual information I(q) can be obtained with variational methods. We are actually finding the best q(x, y) such that q(x) = Pr(x), q(y) = Pr(y) and the expectation of φ(x) on the conditional q(x | y) and Pr(x | y) is the same. The solution has a CME flavor (the solution is a so-called I-projection of q(x, y) on all the distributions that satisfy the constraints): convex and unique. Let the corresponding Lagrange multipliers for φ(x) be ψ(y). Then in a sense, the functional I(q) can be regarded as the Lagrange function with the parametrization of q(x, y).

The maximization step can be converted into a minimization step, which can still be solved by I-projection. This is because I(q) = I(Pr(x, y)) - KL( Pr(x, y) | q ) where q(x) in the ``dual feasible set.'' And therefore we need two I-projections.

The algorithm then works in this way: fix the φ(x) part, project the conditional q(y | x) onto the subspace for tunable ψ(y) spaces; update the joint density with q(y | x) Pr(x); fix ψ(y) part, project the conditional q(x | y) onto the subspace for tunable φ(x) spaces; update the joint density with q(x | y) Pr(y).

I am not quite sure about my own interpretation though but the convergence (to a local minimum) of the algorithm is not difficult to prove. The I-projections can be done partially with a few GIS iterations. The most important idea in this paper is from information geometry: the Pythagoras theorem can be interpreted as a geodesic distance on the manifold and the dual interpretation of φ(x) and ψ(y) is very intuitive.

A related stat article:
Francesca Chiaromnte and R. Dennis Cook: Sufficient dimension reduction and graphics in regression, AISM.

Thursday, March 26, 2009

Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces


This paper tells us one way of SDR (sufficient dimensionality reduction) using KDR (kernel dimensionality reduction), though it has a somewhat different path to reach the conclusion. I will scan the earlier nonparametric version.

This model proposed by the author is kind of semi-parametric. The sufficiency of a dimensionality reduction algorithm for a regression problem is that y is independent of x conditioned on the projection of x in a subspace (let the projection be Px = u and v is the residual). Therefore either we maximize the mutual information of y and u or we minimize the mutual information of y and v. Consider about the KDR technique: the mutual information can be approximated with Kernel CCA or KGV (kernel generalized variance).

The different derivation starts with two definitions: one for covariance operator and the other conditional covariance operator. In a way, these operators formulates the idea of ``two r.v.s are independent when arbitrary functions of them are independent,'' but in a more abstract and rigorous way. With these concept we know that the conditional covariance operator ΣYY|U is no less than ΣYY|X and the equality happens when Y is independent of V. And ΣYX|U = 0 when Y is independent of V.

When we concretize these operators with matrices, we have the same formulation as in KDR. Therefore the optimization is done in the same way. This procedure is analogous to feature selection and may be applied to variable selection.

Kernel Independent Component Analysis


ICA has long been studied by researchers from signal processing, machine learning and alike. However, though the independency can be depicted with mutual information mathematically, it is not computable in practice. Many ideas have been proposed: approximation of mutual information (e.g. using moments) or finding an alternative contrast function (kurtosis).

In recent years, I noticed Bach does have published lots of papers in machine learning and I found him another student of Jordan. What a supervisor! After reading this piece of work, I have no doubt about their novel ideas of approaching ICA. This relates to a later work closely. The idea of ICA: independent components, must be formulated with a certain idea and the independency calculated in this way will surely be applicable to other problems as well, e.g. one concept of dimensionality reduction for regression (supervised dimensionality reduction) is sufficiency: finding a subspace such that the response y is independent to x conditioned on the projection of x onto it. Therefore it is the key idea to understand the following papers to be scanned.

There are several steps in tackling the independency:
  • Correlation-to-independency: if x and z are independent, then f(x) and g(z) will be independent and therefore will be uncorrelated (correlation coefficient is 0). Though given a fixed (f, g), we cannot assert that x and z are independent given f(x) and g(z) are uncorrelated, it holds that x and z are independent if for arbitrary (f, g), f(x) and g(z) are uncorrelated (then we can put Fourier transform into it).
  • Kernel provides us with RKHS, which means we can use many many functions for the r.v. to be tested. But empirically with given samples, for ICA we have to minimize the maximum correlation of x and z with all pairs of (f, g).
  • The maximum correlation can be obtained with standard CCA idea or it's kernelized version (a generalized eigenvalue problem). This can be extended to multiple r.v.s (variables or vectors). It is argued that the smallest eigenvalues (for (C, D)-generalized eigenvalue problem, where C is the co-variance matrix, KiKj in kernelized form and D is a regularized diagonal variance matrix (Ki+sI)2) are more suitable. (KCCA as a contrastive function.)
  • The last idea is that when the r.v.s are Gaussian, the mutual information is -log( |C|/|D| )/2. So we just need to minimize this. This is another contrastive function (kernel generalized variance, KGV).
So we must get these ideas into a working algorithm. There are several important tips,
  • As in ICA, we first whiten the data Y such that we have 0 mean and unitary covariance. Then we just need to find an orthogonal matrix instead of an affine transform.
  • The matrix (C, D) which, converted into a eigenvalue problem, is too difficult to compute. Therefore we need to approximate this matrix with incomplete Cholesky decompostion (another candidate is Nystrom approximation).
  • Please notice the optimization is constrained on the orthogonal matrix, which is on a Stielfel manifold. We must use the gradient on this manifold (so as not to break the constraints). The overall optimization is simply steepest gradient decent.
  • The author then proposed two possible initial values for gradient decent. One point is when we use a kernel which implies complicate features, the optimization might be tough while a simple kernel such as polynomial kernels might be easier for the optimization.
This journal paper contains much more information about everything :-D I will return to it when considering implementation.

Tuesday, March 24, 2009

Localized Sliced Inverse Regression


Sliced inverse regression (SIR) for classification is equivalent to FDC. That is to solve a generalized eigenvalue problem Γβ=λΣβ, where Γ is the between-class covariance matrix and Σ is the total covariance matrix. The largest eigenvalues of this problem correspond to the disired directions. Well, in some cases we use the in-class covariance matrix B for Σ.

This has been discussed in Jieping Ye's paper extensively. In a way Σ is more likely to be non-singular than B. Due to the fact Σ = Γ + B, there exists a non-singular matrix A such that they can be diagonalized simultaneously with a transform Az = x. If they are non-singular, it matters not whether we use B or Σ. When B or Σ is singular, we might use its principal subspace or its null space for approximation (well it's not approximation I guess). I wonder what we shall do for SIR. Maybe we do the same rubbish too.

Now we come to the LSIR. The authors proposes a localized Γ by choosing the kNN samples of the same class. The only difference between LSIR and LFDC, I think, is the denominator part (i.e. a global covariance class v.s. a localized in-class covariance). Maybe I have to explore the corresponding graph to see their difference more clearly.

Their extension for SSL is to put all unlabelled samples into to all slices. That might be natural in the graph embedding framework. Why should I read this paper then?

Sunday, March 22, 2009

Warped Gaussian Processes


This paper exploits the concept ``warped Gaussian process,'' referring to the observed data which after a warping function applied to is a GP. Well this won't make the problem difficult. Since we can compute the posterior GP as before and add a inverse warping to the posterior in order to get the posterior distribution of observed data. The problem is how it is possible we know the warping function a priori given the data? The author argue that this method is for preprocessing data (so the warping function is known). So if the data do not present any normality, it is possible to find a warping function so that the mapped data will be better modeled with GP.

On the whole, this paper is not very intriguing.

Wednesday, March 18, 2009

Efficient Approaches to Gaussian Process Classification


This paper introduces three methods of doing classification with GP. The first one is variational method (it seems not applicable to parametric models though). But it looks a little different from the common variational inference methods. It has an explicit expression of the posterior Gaussian process's mean (why do they calculate this? isn't it a fully Bayesian methods? maybe that's why they call it a mean field method? Use the mean for inference of labels? Omg, mode, median or mean?) Their tricky part is solving a non-linear equation: the coefficient of the posterior mean equals the mean calculated with the representation. This finally yields a bound for the marginal likelihood.

The second approach actually simplifies the procedure. However the derivation is not included. At a glance, the nonlinear equation is simplified: We do not need to calculate the inverse of the covariance matrix K any more. The sequential method resembles ADF, not sure though.

Gaussian Processes for Regression


This paper might be the first one introducing GP into machine learning society. Now it is very clear to me about certain techniques: how to compute the posterior and make predictions. The training of non-parametric models requires tuning the hyperparameters (MAP -,-b in a sense, or Type II MLE) or calculating the posterior of hyperparameters (fully Bayesian way). In practice the empirical Bayesian method works pretty well. But as we know, to maximize the marginal likelihood Pr( x ), we have to calculate the partial derivatives of all hyperparameters.

That is why the matlab code the author provides include a nonlinear CG algorithms. I guess it is also possible to write another with second-order optimization techniques. The connection of GP and neural nets is addressed in Neal's earlier work.

Thursday, March 12, 2009

Gaussian Process Classification for Segmenting and Annotating Sequences


This is quite an interesting paper. I find it interesting because it does show this delicate connection between several algorithms, GP, kernel logistic regression and SVM; GPSC, CRF and M3N.

GP, when used in MAP inference, will be something comparable to KLR and SVM. We know, KLR uses the negated likelihood as the loss function while traditionally SVM employs Hinge loss. And a penalty (norm regularizer) has intuitive interpretation in SVM. GP, though as a nonparametric counterpart of PKLR in Bayesian statistics, shows that it can be viewed as an extension of KLR's dual form (from kernelization) and therefore its connection to SVM (it has a natural Lagrange dual) is just the difference of loss.

This paper further explores the relation in structured problem. This relationship is from the kernel matrix (two parts, one for node features, another for transitions). Just as in M3N, we will encounter the problem of exponential labels for a sequence. As in M3N we can explore the structure of the graph, we can also simplify the model using the graph information (thus only margins are required).

Though it is still not a fully Bayesian method (not a decent GPC :-D), it shed lights on the relationship of several famous frameworks. Maybe we can try EP on the fully Bayesian model as the previous paper on Baysian CRF.

The Infinite Hidden Markov Model


This paper extends HMM with DP. In HMM, the transition matrix and the emission probability are all multinomial distributed. The Bayesian adds a Dirichlet prior for them and now we try Dirichlet instead. Their construction is from Chinese restaurant process (CRP). The generation procedure is: first it may go to any existing states (including itself) or asking an oracle (DP); if it asks the oracle, the oracle may tell it go to an existing state or send it a new state (DP). Finally it generate a sample using a DP (may generate those it has already generates or new). Therefore it is a hierarchical DP.

The learning procedure requires five hyperparameters. The posterior has to be approximated with Gibbs sampling. The Learning of hyperparameter is rather vague, not sure why they introduce another prior for the hyperparameters (maybe we can only use sampling -,-b). The model has only been tested on simulated data though. I really doubt about its applicability in practice.

Monday, March 2, 2009

Semi-Markov Canditional Random Fields for Information Extraction


This paper modifies the CRF structure with semi-Markov property, which allow the Markov chain to go in to a state Markovian does not hold (the state persists in this period). This is useful for finding a site, e.g. NER (named entity recognition). I guess it might also be useful in bioinformatics.

Their basic strategy is extending the label y in the feature function into a pair, not only y but also the starting and ending position (under certain constraints). Then we try to get the learning/inference algorithms in the new settings.

The search range of the inference increase (but still a polynomial time algorithm) due to the segmentation search. The learning algorithm does not change too much though.