by John Lafferty, Andrew McCallum and Fernando Pereira
CRF is simple a MRF conditional on the observation. In this paper they proposed a chain MRF for modeling sequential data (actually more complex models can be used, as in the Bayesian nets, e.g. higher-order Markov chains in HMM, MEMM). CRF has all fantasy merits of CME and other models in a way: 1) human-designed features can be used; 2) The ME learning is convex unlike HMM. It also avoids the label bias problem. That is given a Markov model with few branches, since the conditional probability is on the current transition, we would get a large probability even if the transition is wrong. It simply ignores the current observation. Then we find little discrimination of the right decoding with a wrong one (since each transition in either decoding has a large probability). This is because the belief network constrains the model on transition to a probability distribution, resulting a competition among states at one transition. A better way is to model the whole as a distribution (i.e. an undirected model is more suitable). The potential function does not have the normalization requirement and this leaves the competition to the sequence of transitions instead of states.
This observation is key to the design of CRF. Given a chain MRF, we would find the training procedure involves features not only f(x,yt) as CME, but also transitional features f(x, yt, yt+1) as MEMM. By the representer theorem, the form of the result suggests GIS, IIS and other optimization algorithms that are applicable to CME may also apply to CRF. The optimization is no harder but since the normalizaton factor now simply depends on the whole sequence (unlike MEMM on the current observed data). I will check this derivation manually soon.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment