Monday, February 16, 2009

Max-Margin Markov Networks


As we might see in CME, CRF, features are just as features in SVM, the parameters we are learning from data are learnt via those in logistic model. We know MLE is just one way of training discriminative models. Another widely-used rule is max margin. How can we model a structured output as in CRF (namely it is an MRF) with margins?

The idea comes from multi-class SVM: the predicted label defines a feature vector which should be separated from other vectors of different configurations for at least γ. But now we have a variable length of each sequence. Therefore it is reasonable to constrain the margin inquality to a certain scale of γ. The scale is simply the number of classification error. Like SVM, introducing slack variable for the unseparable case will get us the margin + loss expression. This optimization is still convex, but the constraints are too many since in multi-class discrimination, the configurations are comparatively fewer (c classes, then c-1 constraints for each sample). Now for a sequence of l labels and the number of values of y is c, we have cl constraints.

The tackle this problem, they convert it into the dual problem. The dual variables can be regarded as a density (to a given scale). Though they are many, not every single variable must be computed. The structure of the MRF determines the necessary combinations of them: grouping them into edge dual variables and vertex dual variables gives us a polynomial-number of optimization variables. But now we have to satisfy the consistency the vectex variable now is the margin of related edge variables. Then we can convert the dual problem back into the prime one. The optimization is done in SVM's flavor, e.g. SMO.

No comments: