Sunday, February 15, 2009

Maximum Entropy Markov Models for Information Extraction and Segmentation


MEMM is a modification of HMM such that: 1) we might use features in the model; 2) a discriminative model instead of a generative one. After all it is a Markov model trained in the ME way.

There are some important points in CME in NLP I want to address here: 1) the ME learning leads to an MRF; 2) in applications in NLP, it is a zero-order CRF; 3) the dual problem of the ME is MLE, whichcan be optimized in many ways, e.g. GIS (sth like coordinate descent), IIS, gradient-based search or L-BFGS.

In MEMM, unlike HMM in which each observation is determined by a Markov Chain (only on the current latent state), each latent state is determined by the current observation as well. This graphical model implies a factorization in which we need to model Pr( yt+1 | yt, xt+1). For simplicity, we just write Pr( y | x ) since yt is given (e.g. state 1) and we focus on this case. We use a feature-based method to model it. Let fa( x, y ) be a feature, where a is a pair, indicating a binary feature on x and another on y, when both satisfied, makes a true value. This would be the same as CME. Then for each trasition we have a probability model.

MEMM is trained on each kind of transition (with different starting states). Then given a sequence of observations, the decoding (finding ys) can be calculated with dynamic programming method as HMM, only maximizing the conditional probability instead. If we do not have labels (y) in the training procedure, we must use one like Baum-Welch algorithm.

The later CRF model is simply an undirected version of MEMM. Some other work in NLP suggests MEMM is more expressive than the Moore HMM (the one we commonly use) but is less than Mealy HMM.

No comments: