Tuesday, January 22, 2008

Inference with the Universum


Universum is a new concept to me. In the settings with universum, we have another data set (universum) besides the training set and testing set. Now let's have a look at the logic of the whole thing.

With the training set, we have finite separation results. With each separation, we have correspondingly an equivalent class of functions from the hypothesis space. In SVM settings, the regularizer for the loss is the margin of the function. Why do we control the margin? Since it is the ``inverse'' of capacity, which controls the generalization capability while the capacity is complexity of the hypothesis space imposed on the whole feature space.

However, real problems are different in that the samples come from only a fraction of the whole feature space. We need to measure the complexity of the function on the problem. Usually in Bayes theory, the prior distribution is considered to encapsulate the information which is priorly holden. However, it is difficult to design priors.

By designing universum samples carefully so as to incorporate the information (hence universum samples are not in any labeled class, but those relevant uninteresting samples), we might figure out a way to measure the complexity. The following optimization is proposed to tackle the issued idea:

The last term is the universum regularizer. The U() function is a epsilon-sensitive loss. The resulted function should have small values on the universum samples and hence is instable (w.r.t. the sign of the values) if there is a small purturbation (the purturbed function is most possibly in the same equivalent class). So the resulted function comes from an equivalent class which has lots of contradictions on the universum samples, which indicates it has high complexity on universum sample and therefore comparably low complexity on those samples we are interested, given the limited total complexity.

Well, this is really interesting. Adding ``useless'' samples and getting a regularizer for them will get better result :-)

No comments: