Sunday, May 18, 2008

Two-Dimensional Solution Path for Support Vector Regression


In this paper a method for tuning the parameters of SVR is proposed. There are two parameters, ε(the allowable errors) and λ(the regularization parameter). Usu. they are setted up according to experiment results.

In a previous paper in NIPS 2005, the path of λ was studied. The idea is that starting from λ=infinity and by descreasing the value to 0, we might find a path for each Lagrange multiplier and the bias, which are proved to be piece-wise linear function of λ. The paths help us in determination of the regularization parameter, since the number of samples in the elbow implies a ``degree of freedom'', which is employed for selecting the GCV parameter as the SE/(N - DF) instead of MSE. The reason is still not quite clear to me.

The ICML 06 paper aims at ε. Basically they use the same analysis and the result is similar that the Lagrange multipliers and the bias are piecewise function of ε. The paper solves the problem in the NIPS paper that requires ε known and set a priori. Now this paper emphasizes that we could choose a proper ε with the path.

Monday, May 12, 2008

Trace Ratio vc. Ratio Trace for Dimensionality Reduction

by Huan Wang, Shuicheng Yan, Dong Xu, Xiaoou Tang, Thomas Huang

This is a CVPR paper in 2007. I read Yan's CVPR paper in 2005. So I am quite familiar with this direction. Basically I agree with their argument on using trace ratio, esp. for those graph embedding algorithms. It is more sounding by directly maximizing the ratio of between-class scatterness to with-in class compactness. the optimization becomes difficult when trace ratio is the objective instead of the originally used formulations, e.g. iteratively defined optimization, trace formulation and determinant formulation.

The constraints of their proposed optimization ensure orthogonality of the projection. The algorithm, also an approximation in the principal subspace of St, iteratively updates the projection V, such that
V* = arg max tr VT (Sup - λn Sut ) V s.t. VTV = I
The iteration will increase the objective function monotonously and ensures a global optimum (main contribution). For other constraints, such as one in kernelization, an eigen decomposition is first applied.

Hmm... another work as an extension to this (vector -> tensor) says a global optimum might not be obtainable, which I might check later.

Adaptive Online Gradient Descent


In this paper, we are looking for a game-like optimization problem. We have to pick up a xt for an unknown objective ft(x), which might be a loss the adversary might ompose upon us. To evaluate the final result, the accumulated loss w.r.t time t = 1, ..., T is compared with a fixed action x, that minimize the total lost. The difference of the two is hence called regret (why not wait until the last ft is given),

There are several results with with problem:
  • Zinkevich showed for linear function, the regret will increase as sqrt(T) using his proposed online gradient descent algorithm.
  • Hazan et al showed under the assumption of strong convexity, the regret will increase as log(T).

This paper has several results:
  • With a regularization technique, they propose an adaptive online gradient descent algorithm, which ensures 1-6 times regret when the coefficients for the regularizer are determined offline for the lowest regret.
  • It might be proved that for linear functions and strongly convex functions, likewise rates, O(sqrt(T)) and O(log(T)), are to be achieved.
  • There exists a similar algorithm for general Lp norm, with the help of Bregman divergence generalization, from L2 in

    The Bregman divergence is

Thursday, May 8, 2008

Bayesian Regression with Input Noise for High Dimensional Data


Usually in the regression models, no noise is assume at the input side. This paper deals with the case the assumption doesn't hold. The basic observation is that when noise exists, the orignal model tends to underestimate the parameters (para. 3, sec. 2). I am interested in why it is (no reference and proof is given).

To filter the noise, a probabilistic model that resembles JFA is proposed. I can't help recalling those in Jordan's unpublished book. Now I can understand it better. The solution to the model (training part) relies on EM, which is commonly used when hidden variables are at hand. The inference part is done by marginalizing all hidden variables and conditioning the output on the input, as is done in conventional regression. The following figure shows evolution of their model,

I decide to review Jordan's book, really worth reading. Hopefully I'd derive several famous models for later seminars. Now let's get closer to the model. The hidden state t is assumed as a Gaussian. The observed x and y (I don't think z is necessary at all, so just forget about it) are conditioned on t and another parameter, and they are Gaussians too. The only non-Gaussian hidden variable α is a Gamma-distributed R.V.. It's paraphrased as precision of the two parameters, hence the only hyperparameter of the model. Why Gamma?

About the results... The figure shows when the test case is noise-free, the proposed model yields lowest errors. But when the test cases are also contaminated by noise, all methods perform equally worse. It's a little difficult to tell, since after all the test cases are not accurate at all, we know no groundtruth.