Friday, April 24, 2009

Isotonic Conditional Random Fields and Local Sentiment Flow


This paper introduces a variant of CRF, adding a constraint by prior knowledge. Since we know for some words, they are representing a positive/negative sentiment. Therefore the corresponding feature f_{(\sigma, w)}, which means
f_{(\sigma, w)}(y) = \begin{cases} 1 & y = \sigma, x = w \\ 0 & \text{otherwise} \end{cases},
has a larger/smaller parameter \mu_{(\sigma, w)}. That is to say, if w is a special word indicating positive, then \mu_{(\sigma, w)} \geq \mu_{(\sigma', w)} when \sigma \geq \sigma'.

This style of constraints will lead to a convex optimization problem though. The author prove that given a sequence x and the corresponding labelling y, letting x' = (x_1, \ldots, x_j \cup \{ w\}, \ldots, x_n), if \mu_{(t_j, v)} \geq \mu_{(s_j, v)}, then
\frac{\Pr(s \mid x)}{\Pr(s \mid x')} \geq \frac{\Pr(t \mid x)}{\Pr(t \mid x')}.
This gives a new interpretation for the constraints. The model is reparameterized with Mobios inverse theorem (what is that?) and solved.

Monday, April 20, 2009

Tensor Decomposition and Applications


This is the first systematic paper I read on tensor. A tensor in computer science is a multi-dimensional array (I am still wondering whether it has anything to do with the term tensor in differential geometry).

So before we talk about tensors, here are several notations we'd like to use in the text:
  • The order of the tensor is the the number of dimensions: vectors are order-1 tensors, matrices are order-2.
  • An order one tensor is denoted with \mathbf{a}, an order-2 tensor with \mathbf{A} and one of higher-order with \mathcal{A}.
  • A fibre is an analog of a row/column of a matrix in tensor. A tensor \mathcal{X} of order n has a fibre in the form of \mathbf{x}_{i_1, \ldots, i_{j-1}, :,i_{j+1}, \ldots, i_n} which is the vector obtained by fixing all other index i_k except one i_j.
  • A slice of a tensor is obtained by fixing all indices except two, which can be regarded as a plane.
  • A tensor \mathcal{X} \in \mathbb{R}^{I_1} \times \cdots \times \mathbb{R}^{I_n} is said to have rank 1 if \mathcal{X} = \mathbf{a}^{(1)} \circ \cdots \circ \mathbf{a}^{(n)} where the \circ operator means x_{i_1, \ldots, i_n} = a_{i_1}^{(1)}\cdots a_{i_n}^{(n)}.
  • Matricization (a.k.a. unfolding, flattening) is the process of reordering the tensor data into a matrix. The mode-k unfolding of the order-n tensor \mathcal{X} is denoted as \mathbf{X}_{(n)} \in \mathbb{R}^{I_k\times (\prod_{s\neq k} I_s)}.
  • The k-mode product of a tensor \mathcal{X} and a matrix U \in \mathbb{R}^{J \times I_k} is
    (\mathcal{X} \times_k \mathbf{U})_{i_1, \ldots, j, \ldots, i_n} = \sum_{i_k = 1}^{I_k} x_{i_1, \ldots, x_n} u_{j, i_k}.
  • The Kronecker product of two matrices \mathbf{A}\in \mathbb{R}^{I \times J}, \mathbf{B} \in \mathbb{R}^{K \times L} is
    \mathbf{A} \otimes \mathbf{B} = \begin{pmatrix} a_{1, 1} \mathbf{B} & \cdots & a_{1, J}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{I, 1} \mathbf{B} & \cdots & a_{I, J} \mathbf{B} \end{pmatrix}.
  • The Khatri-Rao product of the two matrices \mathbf{A} \in \mathbb{R}^{I \times K}, \mathbf{B} \in \mathbb{R}^{J \times K} is
    \mathbf{A} \odot \mathbf{B} = \begin{pmatrix} \mathbf{a}_1 \otimes \mathbf{b}_1 & \cdots & \mathbf{a}_K \otimes \mathbf{b}_K \end{pmatrix}.
  • The Hadmard product is the elementwise product of two matrices \mathbf{A}, \mathbf{B} \in \mathbb{R}^{I, J}, i.e.
    \mathbf{A} * \mathbf{B} = \begin{pmatrix} a_{1,1} b_{1, 1} & \cdots & a_{1, J} b_{1, J} \\ \vdots & \ddots & \vdots \\ a_{I,1} b_{I, 1} & \cdots & a_{I, J} b_{I, J}\end{pmatrix}.
  • The so-called CP decomposition (canonical decomposition, parallel factors) of a tensor is a sum of rank 1 components
    \mathcal{X} \approx \sum_{r = 1}^R \mathbf{a}_r^{(1)} \circ \cdots \circ \mathbf{a}_r^{(n)}.
    This is usually denoted with
    \mathcal{X} \approx [\![ \mathbf{\lambda}; \mathbf{A}^{(1)}, \ldots, \mathbf{A}^{(n)} ]\!] \stackrel{\triangle}{=} \sum_{r = 1}^R \lambda_r \mathbf{a}_r^{(1)} \circ \cdots \circ \mathbf{a}_r^{(n)}.
    where \mathbf{A}^{(i)} has orthonormalized vectors.

However low-rank approximation in tensor is not as well defined as that of matrices. One thing is there may not exist a rank-2 tensor which is nearest to a rank-3 tensor. Although there are various bounds for the rank, it is still not clear how to estimate the rank of a tensor. Therefore it will be difficult for us to find a good low rank approximation.

The most popular algorithm for CP decomposition is ALS (alternating least square). This is very simple, by fixing all \mathbf{A}^{(i)} except \mathbf{A}^{(k)}. With the k-mode unfolding, we will approximate it by tuning \mathbf{A}^{(k)}, which leads to the following least square problem,
\min_{\hat{\mathbf{A}}} \| \mathbf{X}_{(k)} - \hat{\mathbf{A}} (\mathbf{A}_{(1)} \odot \cdots \odot \mathbf{A}_{(k-1)} \odot \mathbf{A}_{(k+1)} \odot \cdots \odot \mathbf{A}_{(n)})^\top\|,
with least square solution, we have
\hat{\mathbf{A}} = \mathbf{X}_{(1)} [ (\mathbf{A}_{(1)} \odot \cdots \odot \mathbf{A}_{(k-1)} \odot \mathbf{A}_{(k+1)} \odot \cdots \odot \mathbf{A}_{(n)})^\top ]^\dagger.
Please notice that (\mathbf{A} \odot \mathbf{B})^\top (\mathbf{A} \odot \mathbf{B}) = (\mathbf{A}^\top \mathbf{A}) * (\mathbf{B}^\top \mathbf{B}), therefore
\hat{\mathbf{A}} = \mathbf{X}_{(1)} (\mathbf{A}_{(1)} \odot \cdots \odot \mathbf{A}_{(k-1)} \odot \mathbf{A}_{(k+1)} \odot \cdots \odot \mathbf{A}_{(n)})[(\mathbf{A}_{(1)}^\top \mathbf{A}_{(1)} * \cdots * \mathbf{A}_{(k-1)}^\top \mathbf{A}_{(k-1)} * \mathbf{A}_{(k+1)}^\top \mathbf{A}_{(k+1)} * \cdots * \mathbf{A}_{(n)}^\top \mathbf{A}_{(n)})]^\dagger.
The values of \lambda is updated with the norm of the columns of \hat{\mathbf{A}} and \mathbf{A}_{(k)} is the normalized \hat{\mathbf{A}}.

Another important problem with tensor is Tucker decomposition (a.k.a. HOSVD, high order SVD): we seek to find
\mathcal{X} \approx \mathcal{G} \times_1 \mathbf{A}^{(1)} \times_2 \cdots \times_{n} \mathbf{A}^{(n)}.
Usually we need a degenerate version of HOSVD, which relates to the k-rank approximation. This problem can also be solved with ALS.

There are a lot of decompositions related to tensor:
  • INDSCAL (indivisual differences in scaling), for order-3 tensors where the first two dimensions are symmetric.
  • PARAFAC2 (parallel factorization) is, strictly speaking, not a problem of tensor decomposition. It seeks to find an approximation of several matrix \mathbf{X}_k, each is approximated with \mathbf{U}_k \mathbf{S}_k \mathbf{V}^\top, where \mathbf{U}_k and \mathbf{V} are orthogonal matrices and \mathbf{S}_k are diagonal matrices.
  • CANDELINC (canonical decomposition with linear constraints), in which we know the subspace of each dimension, can be turned into a standard CP decomposition this way.
  • DEDICOM (decomposition into directional components), seeks approximation of an asymmetric matrix \mathbf{X} = \mathbf{A} \mathbf{R} \mathbf{A}^\top.
  • PARATUCK2 seeks to find an approximation of \mathcal{X},
    \mathbf{X}_k \approx \mathbf{A} \mathbf{D}_k^A \mathbf{R} \mathbf{D}_k^B \mathbf{B}
    where \mathbf{A}, \mathbf{B} are matrices and \mathbf{D}_k^A, \mathbf{D}_k^B are diagonal matrices.
  • NTF (nonnegative tensor factorization), on each dimension, a NMF is applied instead of PCA.

Friday, April 10, 2009

An Introduction to Variational Methods for Graphical Models


This paper describes in details the variational methods for graphical models. There are basically two important applications of variational methods (here it refers to variational bounds): one for Bayesian models where the posterior probability is hard to evaluate and therefore is a replacement for the slow MCMC methods; the other for graphical models where too many dependencies (edges in the graph) give rise to a difficult inference problem. Here the authors describe two styles of variational bounds. No matter which is employed, the ultimate objective is the same: to simplify the graph by eliminating some edges since the bound usually provides an easier way of computation.

The first bound is to exploit the Fenchel-Legendre transform of a convex/concave function. Generally speaking, a convex (concave) function has a natural upper (lower) bound, which is the tangent hyperplane at each point on the graph of the function. The so-called Fenchel-Legendre transform (or variational transform, as referred to in the paper) of a function is
\hat{f}(z) = \min_x z^\top x - f(x)
which is the intercept of the the function f where the tangent is z. Therefore
f(x) \geq \langle z, x \rangle + \hat{f}(z),
which is a lower bound for the function and z is the variational parameter. There are two examples in the paper:
  • The QMR-DT is a disease-symptom bipartite Bayesian belief net. Due to the ``explaining away'' problem (a``v'' shape structure, when the child is not given the parental nodes are independent but when the child is given, they become dependent and the sum over these variables will be difficult since they are coupled), we have to simplify the probability \Pr(f_i=1 \mid d ) = 1 - \exp\left( -\sum_{j \in \mathrm{pa}(i)} \theta_{i,j} d_j - \theta_{i, 0}\right) with an upper bound.
  • The Boltzmann machine can also be solved with this method.They derive a lower bound as for logistic function, which can delink all neighbours of a single r.v. but an upper bound, though easy to compute and decouple the r.v. with its neighbors, connects all its neighbors.
The second bound is to exploit the Jensen's inequality,
\log \Pr(x) = \log \left( \sum_z \Pr(x, z) \right) = \log \left( \sum_z q(z) \cdot \frac{\Pr(x, z)}{q(z)} \right) \geq \sum_z q(z) \log \frac{\Pr(x, z)}{q(z)},
which resembles the standard derivation of EM. Since x is observed, we usually introduce a variational parameter for q(z \mid \lambda) and maximize the lower bound by selecting a proper \lambda. The paper illustrates the idea with several mode examples:
  • Boltzmann machine. Now we may put all the observed r.v.s to x and all latent r.v.s. The proposed q(z) is a decomposed binomial distribution and the parameter, i.e. the moment isthe variational parameter.
  • Neural network, here referring to sigmoid belief network. Just like Boltzmann machine (the difference is one is MRF while the other is a belief netowrk).
  • Factorial hidden Markov model. Since the posteror distribution of the hidden variables couples, the proposed q(z) is a decoupled prior with adjustable parameters (variational parameters).
  • Hidden Markov decision tree.
This idea has relationships with other known models:
  • Helmholtz machine: it uses two set of parameter for the generative model and recognition model instead of computing the posterior probability with Bayes' theorem.
  • Sampling method: they compare Gibbs sampler and variational methods. Gibbs sampler is said to pass messages of samples (conditioned on all other variables) while variational bounds pass statistics themselves (or the parameters). Not sure, I think the global variational method is closer to this point.
  • Bayesian methods: they are actually referring to the global variational method. Not clear why it is not mentioned here. Maybe it not their work -,-b

Thursday, April 9, 2009

High-dimensional Data Analysis: The Curses and Blessings of Dimensionality


In this non-technical paper, Donoho prospected the development of data analysis and suggested the difficulty in dealing with high dimensional data (the curse of dimensionality) and the possible advantages we may take (the blessing of dimensionality). His advisor John Tuckey coined the word data analysis, which differs from the traditional mathematics or statistics in the way of rigorous derivation: more heuristic ideas are applied. He argued that nowadays people choose quite a different method of exploration in science: only a few variables are selected to make a model which can be analytically formulated and studied in the traditional way while now it is more widely employed to collect a large amount of data or the information about one sample is enriched via all kinds of ways.

Curse of Dimensionality:
  • For a naive search algorithm in a high dimensional discrete space, the number of candidates is exponential of the dimensionality.
  • For optimization, we need (1/\epsilon)^d evaluations for a Lipschitz continuous objective function for an approximation of error less than \epsilon.
  • In function approximation to a Lipschitz continous function, we also need O\big((1/\epsilon)^d \big) evaluations for a uniform approximation of error \epsilon.
  • In numeric integration of a Lipschitz continous function, again we need O\big( (1/\epsilon)^d\big) evaluations on a grid in order to obtain an approximate integration of error \epsilon.
  • Nonparametric extimation will be difficult. The much slower convegence rate is the ugly head of the curse of dimensionality.
  • Regularization is usually employed to tackle the curse of dimensionality.
Blessing of Dimensionality:
  • The concentration of measure phenomenon is for a lot of distribution in the high dimensional space. Let (\Omega, \mathcal{F}, \mu) be a metric measurable space, d(\cdot, \cdot) is the distance and \mu is the probability measure of the \sigma-field \mathcal{F}. Let \alpha(\epsilon) = \sup_A \{ \mu( \Omega \backslash A_\epsilon) \mid \mu(A) = 1/2\} where A_\epsilon = \{ x \in \Omega \mid d(x, A) < \epsilon\} is the \epsilon-extension of the measurable set A. The so-called Levy family (\Omega_n, \mathcal{F}_n, \mu_n), which satisfies \forall \epsilon > 0, \lim_{n \to \infty} \alpha_n(\epsilon) = 0, means that the probability measure will concentrate in the center part of the support when n increase. If the convergence to 0 is C \exp(-c n \epsilon^2), then it is also called normal Levy family. In many cases, the index n can be the dimensionality, which means as the dimensionality goes up, we may observe the concentration of measure. For a normal Levy family, the tail's behavior is like a Gaussian. The concentration of measure means in high dimensional space it is more likely that the data you sample contains much information as you don't need to go further from the data, you have already covered the support. The probability decays as you leave a set with a fixed nonzero measure set.
  • Dimension asymptotics. This is the refinement of the previous argument.
  • Approach to continuum: there is an underlying compactness to the space of observed data which will be reflected by approximate finite-dimensionality and an increasing simplicity of analysis for large d. (sounds like the reason for compressive sensing).
  • Approximation through randomness. Random projections, CS are in this range.
  • Approximation through computational harmonic analysis, which is a way to the continuum.
I am not familiar with harmonic analysis. Many points in this paper are still vague to me. I must reread it when I gain more knowledge.

Tuesday, April 7, 2009

Hierarchical Dirichlet Processes


This paper is relevant to hierarchical modeling of MoE (hence related to HMoE). The DP extension for mixture model has been studied in this paper (with variational approximation methods) too. This paper only use Monte Carlo methods for inference (instead of approximation algorithms). They simply introduce one DP prior for parameters of the model that generates the data and another DP prior for the generation of groups. In one word,
G_0 \mid \gamma, H \sim \mathrm{DP}( \gamma, H ), G_j \mid \alpha_0, G_0 \sim \mathrm{DP}( \alpha_0, G_0), \theta_{ji} \sim G_j, x_{ji} \sim F(\theta_{ji}).
Their idea of DP is implemented with SBP (as in this paper). However, we can only observe x_i. For a learning problem we must find a proper settings for \gamma and \alpha_0. For inference, we must get the posterior distribution of two DPs. Please notice for a DP MoE, the base distribution of the DP is on the parameter space; now in HDP, the base distribution of the intermediate DP is a random measure, which is the top DP. Therefore the base distribution of the top DP is on the parameter space: the sampling from the top DP results in \sum_{k = 1}^K \frac{m_k}{m + \gamma}\delta_{\phi_k} + \frac{\gamma}{m + \gamma} H where m_k is the number of samples with \phi_k values and \phi_k are drawn from H, m=\sum_{k=1}^K m_k; the intermediate DP put these samples \theta_i in groups according to \sum_{t = 1}^T \frac{n_t}{i-1 + \alpha_0} \delta_{\psi_t} + \frac{\alpha_0}{i-1 + \alpha_0} G_0, where n_t is the number of samples in this group and T is the number of groups.

Now we come to the details:
  • The SBP generates a set of v_i, \pi_i and \theta_i in the following way,
    v_i \sim \mathrm{Beta}(1, \alpha), \quad \pi_i = v_i \prod_{k = 1}^{i - 1}(1 - v_i), \quad \theta_i \sim G, \quad \Pr(\theta = \theta_i) = \pi_i
    This is equivalent to a DP whose parameter is \alpha and G. Then we may generate the data according to this.
  • The HDP can be formulated as a variant of CRP, Chinese restaurant franchise: a customer chooses a dish (being served or not) and sits at a table (serving the dish or a new one). We can dormulate the generation as follows: a customer \theta_i comes into the restaurant and will sit at a table according to the intermediate DP and asks for a dish (if it is a new table) according to the top DP.
  • For inference, we use SBP formulation and calculate the posterior of the hidden r.v.s. Ad we may see in the SBP formulation, we need an indicator (multinomial distribution) for one DP. The sampling is done with Gibbs sampler.
  • It can be used for enhancing LDA (latent Dirichlet allocation) and HMM.

Monday, April 6, 2009

Random Projection for Manifold Learning


This paper tells us that we may project the data from a higher dimensional space (n dimensions) into a random subspace (m dimensions) and we will not lose much information of the intrinsic structural information. Therefore we may apply one of the manifold learning algorithms to the randomly-projected data and still recover their structure in a proper space (d dimensions). And d < m << n and m = O(d log n).

The first theorem comes from RIP satisfaction, when
m \geq O\left( \frac{d \log (n V \tau^{-1}) \log(\rho^{-1})}{\epsilon^2}\right)
with probability no less than 1-ρ, for any x, z, we have
(1 - \epsilon) \sqrt{\frac{m}{n}} \leq \frac{d_i( \Phi x, \Phi z)}{d_i( x, z )} \leq (1 - \epsilon) \sqrt{\frac{m}{n}},
where Φ is the random orthogonal projection, V is the volume and 1/τ is the condition number of the latent manifold. To estimate the intrinsic dimensions, the author propose to employ Grassberger-Procaccia algorithm (based on correlation). They found bounds for the estimated dimensions: when
m \geq O\left( \frac{d \log( n V \tau^{-1}) \log (\rho^{-1})}{\beta^2 \delta^2} \right),
then the estmated correlation dimension satisfies
(1 - \delta) \hat{d} \leq \hat{d}_\Phi \leq (1 + \delta) \hat{d},
with probability exceeding 1-ρ. And the residual variance of the ISOMAP embedding is also bounded
R_\Phi \leq R + C \Gamma^2 \epsilon,
where Γ is the diameter of the point cloud.

Since in practice we are not sure about the intrinsic dimensionality, we usually try to add more orthogonal vectors and see whether the residual variance is small enough.

Well it is atypical way of applying CS in machine learning. Put all data in a random projection subspace and prove there are bounds for later learning algorithms. It is not the way I'd like to use CS.

Hierarchical Mixtures of Experts and the EM Algorithm


This paper formulates HMoE, which can be regarded as a soft decision tree. Each leaf node is an expert, which is usually a GLIM and whose parameters are to be estimated. The internal nodes are soft partitions (something like a multinomial regression, the number of classes equals the number of children). Therefore each internal node actually corresponds to a hidden r.v. whose parameters (the coefficients for inner product) are to be estimated. On a whole the model is a Bayesian belief network (a poly-tree?). The training of this kind of model can be done with EM algorithm and the posterior of the hidden variables must be calculated (via belief propagation, a typical inference problem). This is an old paper, which was wrongly picked out -,-b

Sunday, April 5, 2009

Mixture-of-Parents Maximum Entropy Markov Models


Like skip-chain CRF, the mixture-of-parent MEMM also aims at utilizing the long-range interactions for NER problems. Skip-chain CRF, though, is beautiful in theory. The inference on it cannot be easily carried out (there are loops). This paper adds directed edges to the same r.v.s but with the assumption that Pr( y | pa(y) ) is a mixture of Pr( y | y') where y' in pa(y) (please note that x is omitted in this formulation and here we refer to a single element). With this, the conditional probability Pr(y | x) can still be calculated with dynamic programming (now we have to sum over all parents as well and here we refer to a sequence).

The tricky part they employed is the mixing proportion of the parent is assumed to be known as equal for one r.v. Then they can get a convex optimization problem which can be solved just as MEMM. By computing the gradient, L-BFGS can be applied. They actually have another objective (log conditional margins instead of log joint conditional) which is non-convex but it is well-behaved in practice.

Infinitely Imbalanced Logistic Regression


This paper studies the case of binary classification when we have a fixed set of samples from positive class but infinite samples from negative class. Their main result is formulated in the following theorem:
Let n >= 1 and xi in Rd be fixed and suppose that F0 satisfies rhe tail condition
\int e^{x^\top \beta}( 1 + \| x\|) \,\mathrm{d} F_0(x) < \infty, \qquad \forall \beta \in \mathbb{R}^d
and surrounds
\bar{x} =\frac{1}{n} \sum_{i = 1}^n x_i.
Then the maximizer of the centered log-likelihood satisfies
\lim_{N \to\infty} \frac{\int e^{x^\top \beta} x \,\mathrm{d} F_0(x)}{\int e^{x^\top \beta}\,\mathrm{d} F_0(x)} = \bar{x}.

This theorem tells us several important things:
  • There are two conditions we must satisfy if β does not diverge, either of which once violated will yield a counterexample (β will diverge).
  • The convergent β only relates to the positive samples via their mean. Therefore it does NOT really matter how their are distributed.
  • The author suggests this may be good for understanding the behavior of logistic regression and by removing the outlier or moving it towards the mean we can get a better model.
  • If F0(x) is a Gaussian or a mixture of Gaussians, β can be calculated as in LDA (the generative counterpart).

This is actually written by a stat guy and involves more derivations than other pure machine learning papers. This is quite interesting. I'd like to explore some theoretical properties of simple models but up to now I haven't found a proper point.

Wednesday, April 1, 2009

Compressive Sampling


This is the first paper I read about compressive sampling (a.k.a. compressed sensing). The main idea of CS is that we have a measure matrix Φ and a different representation matrix Ψ and they are quite different. The traditional way of signal processing would use the same matrix for measurement and representation. The Nyquist-Shannon theorem says we need a bandwidth as large as the connected support in Φ-domain. But in practice our signal is usually sparse in Ψ-domain but the sparse weights are not in a connected subset of the whole space. Therefore to apply N-S thereom, we need a large bandwidth to transmit the data. But then we might think we might miss the sparsity point.

The new findings in CS say that if the measurement Φ satisfies a certain condition, we might recover the sparse encoding in Ψ-domain with a linear programming solver:
min | x |1 subject to ΦΨx=y
where y is the measurement of the groundtruth (i.e. Φf, where f is the original signal).

The measurement matrix Φ must satisfy the UUP (uniform uncertainty principle) or RIP (restricted isometry principle). To put it simple, it means that any subset of coordinates (cardinality less than S) under the transform of the corresponding part of Φ will keep the norm in a certain bound. This bound implies when the exact decoding will be successful. Many stochastic matrix can be candidates for Φ, e.g. Gaussian, binary, Fourier (randomly selected rows of Fourier transform), incoherent measurement (randomly selected rows of random orthoganal matrix). Experts are busy proving how few rows we need for exact decoding. That is to prove a tight upper bound for S(sparsity of Ψ-domain) with N(length of the signal) and K(number of measurements, #rows of Φ).

It can be proved that the recovered x is optimal in a sense. This idea can still be extend to the case of signal with noise.

There are research of CS in machine learning, exploiting the sparsity part. Since as SVM or RVM suggests, the classifier in the dual representation should be sparse. Therefore it seems possible to train a model in a randomly projected space without losing information. There has been work in this part. I will explore more about it later. For now we just need their idea.

Exponential Family for Conditional Random Fields


This paper is from UAI 2004, theoretical version of the ICML 2004 paper previously scanned here. The concepts of CRF, RKHS, exponential family and GP. But in a sense I don't think this paper is of much interest (lots of results are known in books I guess). The important section is the optimization. Come back later when I try GP (soon~).