Monday, November 23, 2009

Two-view Feature Generation Model for Semi-supervised Learning


We first take a look at their logic: for semi-supervised learning, a generative model is usually preferred since unlabeled data help estimate the margin distribution \Pr(x). In a Bayesian MAP formulation, we are actually
\min_\alpha - \sum_i \log \Pr(y_i \mid \alpha, x_i) - \log \Pr(x_u \mid \alpha)\Pr(\alpha)
which is a little different from a direct generative model. Here the first term is actually a discriminative term and the second term is a penalty from unlabeled part (therefore it is more similar to a ``supervised loss + penalty'' model). This paper does talk about models of the latter, using auxiliary problems.

The two-view model means, analogous to co-training, we have two view of feature vector x, namely z_1(x), z_2(x), which are inpdependent conditioned on the label. The different thing about this model is in order to solve \Pr(y \mid z_1, z_2), we need \Pr(y \mid z_1), \Pr(y \mid z_2). Now we only consider \Pr(y \mid z_1). One possibility is to make a low-rank decomposition of \Pr(z_2 \mid z_1) = \sum_y \Pr(z_2 \mid y) \Pr(y \mid z_1) but the LHS is sometimes impossible to compute. An approximation is to encode z_2 with a set of binary labels t_1^k(z_2). Then \Pr( t_1^k \mid z_1) = \sum_y \Pr(t_1^k \mid y) \Pr(y \mid z_1) can be computed. By increasing the number of related binary labels t_1^k we may have a good estimation of \Pr(y \mid z_1).

They proposed two models (one linear and the other log-linear, which resembles linear regression and logistic regression in a way). The linear version coincides with the SVD-ASO model in their JMLR paper. The log-linear model is solved via EM-like algorithm.

The thing is what kind of binary auxilliary function would be essential to our semi-supervised problems? This might be a key to understanding their JMLR paper for multi-task learning.

A Framework for Learning Predicative Structures from Multiple Tasks and Unlabeled Data


This paper addresses a framework for multi-task learning. Their idea is quite simple. There is a common factor \Theta which is shared in different but related problems. Therefore in each problem P_k, our parameters include w_i, which is problem-specific and v_i which is dependent on the common feature controled by \Theta. To solve the model it usually desirable to alternatively optimize over w_i, v_i and \Theta. Usually a regularizer is also included for better generalization capacity.

using this idea, the authors proposed a linear model which is solved by the ASO using SVD in each iteration to find \Theta (SVD-ASO in their term). With this idea, they analyzed the semi-supervised learning with auxiliary functions, which are essentially those multi-tasks.

Their extention for this piece of work is scanned here.

Saturday, November 21, 2009

Discriminative Semi-supervised Feature Selection via Manifold Regularization


This paper talks about feature selection via SVM. The semi-supervised part is enabled by adding a manifold regularizer. The method is to multiply the feature with a diagonal 0-1 matrix (selecting features). With these variables in the optimization as well, we get the optimization for this problem. The key idea to solve this problem is to reformulate it with the dual of SVM but leaving the feature selecting variables alone. Then the optima is the saddle point of the optimization problem. This kind of problem can be found in multiple-kernel learning, which has a standard algorithm (alternating optimization w.r.t. difference variables).

The idea of using SVM for feature selection is not new. The contribution might be the semi-supervised setting. In my own research it seems that we still do not have a clear goal of achieving this with other methods. hmm...

Thursday, November 12, 2009

Dappled Photography: Mask Enhanced Cameras for Heterodyned Light Fields and Coded Aperture Refocusing


A light field conveys both spatial and angular distribution of light incident on the camera sensor. The pioneer work to capture a light field in one photographic exposure is the plenoptic camera, a device that uses a microlens array to rearrange a 4D light field and capture it with a 2D sensor. However, the optics of the microlens array defines a fixed resolution tradeoff between spatial and angular sampling of the light field.
In this paper, the authors propose to modulate the light field by shadowing the incoming light with a mask in the optical path. In the Fourier Light Field Space (FLS), the mask creates a train of identical kernels positioned in a slanted slice, and thereby, via convolution, pulls high angular frequencies to the central angular slice, the only slice the camera measures in the FLS. Assuming that the incident light field is band limited, the captured image is the flattened version of the incident light field in the Fourier domain.
Moreover, the slant of the mask kernel, which decides the spatial-angular resolution tradeoff, is determined by the location of the mask. Consequently, the resolution tradeoff can be adjusted by translating the mask. The minimal and the maximal angular resolution are achieved by placing the mask at the aperture and at the conjugate plane respectively.
However, the mask enhanced camera seems to trade reconstruction quality for flexibility. First, the optimal pattern of the mask varies with its location yet in practice the mask pattern is permanent. Second, the mask blocks about half of the incident light and reduces the signal-to-noise ratio of the sensed image. After all, the paper provides a profound analysis on the principle of mask enhanced cameras, a major category of computational camera, making itself influential in the Computation Photography community.

4D Frequency Analysis of Computational Cameras for Depth of Field Extension

by A. Levin, S. W. Hasinoff, P. Green, F. Durand and W. T. Freeman
Although many types of cameras are invented to extend their depth of field (DoF), none of them optimize the quality of the resulting image or, equivalently, maximize the modulation transfer function (MTF). In this paper, the authors perform a 4D frequency analysis to estimate the maximal frequency spectrums of optical systems.
The key of the analysis lies in the observation of the dimensional gap between the 3D MTF and the 4D ambiguity function that characterizes a camera: the former was a manifold embedded in the latter, called “the focal segments”. To maximize the MTF, therefore, the ambiguity function is desired to uniformly distribute all the energy on these segments. This analysis leads to an upper bound of the MTF.
Unfortunately, most contemporary computational cameras waste energy out of the region. The only exception is the focal sweep camera, but the phase incongruence of its OTF across various focus settings lowers the spectrum magnitude. The authors propose the lattice-focal lens. This lens is composed of a number of sub-squares, each responsible for focusing light rays from a specific depth. This spatial division of aperture also concentrates energy on the focal region, but achieves a much higher spectrum than the focal sweep camera.
The ambiguity function, defined as auto-correlation of the 2D scalar field of an optical system, is a redundant representation. This prohibits the authors from determining the tight upper bound of the frequency spectrum. Still, the proposed analysis sheds much light on this question. Although there is no explicit analysis, it indicates that the key of maximizing MTFs may lie in phase incoherence of the optical system.

Flexible Depth of Field Photography


The depth of field (DoF) of a lens is confined to a frontal-parallel slab. In this paper, the authors attempt to break this limitation by proposing flexible depth of field photography, i.e. to translate the detectors within the shutter. Thus at each moment the lens produces a point spread function (PSF) associated with the sensor position, and the final PSF, called integrated PSF (IPSF) is the sum of all the PSFs produced over the exposure.

Under this scheme, the authors suggested three applications for manipulating DoF. In the first application, the sensors are translated uniformly to produce a depth-independent, frequency preserving IPSF, ensuring a good quality of the restored all-in-focus image. In the second application, the authors play with non-uniform translations. Unwanted depth layers in the middle can be skipped so that their image would be blurry enough to be unnoticeable. The authors also showed that by rolling the detector’s exposure time during the translation, an arbitrary shape of DoF can be produced. All the applications are realized with a prototype flexible DoF imaging system built by the author.

The idea of flexible DoF is interesting, but it is not as novel as the authors claim in the paper. Although never thoroughly investigated, focus sweep is a widely used technique in photography under the name variable-focus photography; uniform sweeping of the DoF was even proposed as early as 1972 by Hausler. The major difference between Hausler’s work and this paper is just replacing focus sweeping with sensor translation, which is minor. The usefulness of the other two applications are somewhat vague due to the underlying assumption that scene depth is known before capture.

Time-Constrained Photography


Current designs and evaluations of depth-of-field (DoF) extended cameras all assume a single photo is captured, yet within the same exposure time, a focal stack shot may give rise to a less noisy restoration. Therefore, the authors propose to evaluate the performance of camera designs in the multiple-shots scenario.

The authors firstly develop a Bayesian approach to restore the depth map and the all-in-focus image of a scene; they also estimate the scene-independent restoration error as a function of the interested range of depth and the schedule of capture. Fixing the total exposure time, they seek the optimal number of photos to capture: a few more photos would increase the signal-to-noise ratio (SNR) of the restored image because the dominant source of image noise, the photon noise, is multiplicative; the other source, the additive read noise, would penalize excessive photos though.

According to the expected restoration error, the authors compared the performance of various camera designs. Surprisingly, the conventional camera performs as well as, if not better than, other types of cameras in all but the low time budget settings; it also benefits most from the multiple-shot scheme. In contrast, the coded aperture camera performs poorly in term of light efficiency because its mask blocks most of the light from the sensor.

This paper constitutes a strong defense against intensity masking techniques by alluding the crucial role of time budgets in DoF extension. Indeed, there is no point in designing a DoF extended camera without a time constraint: closing the aperture is most convenient. For the same reason, albeit computational cameras do not outperform conventional ones in most occasions, they serve the most demanding purpose: to capture a sharp image with the minimal exposure.

Fourier Slice Photography

by Ren Ng

Unlike the 2D definition of image, light field more expressively measures the light radiance along all possible 4D rays. This notion was initially suggested to synthesize new view of non-Lambertian scenes by ray tracing, yet in this paper, the author is interested in employing this tool to model photographic imaging.

The core of this work was the Fourier Slice Photography Theorem, developed from a generalization of Bracewell’s Fourier Slice Theorem. The latter addresses that the Fourier Transform (FT) of a signal’s integral projection corresponds to a sheared slice in its FT. As the imaging process can be seen as a sheared integral of the light field, in the Fourier domain, a photograph formed with full aperture corresponds to a 2D slice in the 4D light field. This discovery speeds up digital refocusing from O(n4) to O(n2logn) and improves precision because Fourier Transform has a fast implementation and avoids the numerical loss in integral. The advantage of the new algorithm was also validated by experiments.

To show the utility of the theorem, the author goes further to provide in close-form the performance limit of plenoptic cameras of finite aperture. Preliminarily quantitative analysis was only possible for the simplest pinhole model as the assumption of aperture complicates the problem by low-passing the light field. The author tackles this difficulty by working in the Fourier domain. Firstly he shows that the 4D convolution of the light field yields 2D convolution of photographs focused at a variety of depths. Applying the Fourier Slice Photography Theorem to this statement, he concludes that a band-limit assumption on plenoptic cameras would degrade its performance in digital refocusing, and the amount of degradation increases linearly with the directional resolution of the sampled light field.

This paper gives an in-depth insight to the relationship between a light field and its photographic images, and pioneers theoretical analysis of optical designs in the Computational Photography community. It also inspires the invention of new light field cameras, e.g. Raskar’s design of dappled photography. Although the author only focused on aberration-free lens models, the Theorem of Fourier Slice Photography can be applied to a broader range of optical systems.

Extended Depth of Field through Wavefront Coding

by E. Dowski and W. Cathey

The authors worked on extending the depth of field (DoF) of optical systems, the range of distance from which objects can be imaged in full detail. Beyond this range, the image undergoes blurred, mathematically modeled as a convolution between the in-focus image and a point spread function (PSF) associated with the distance of the object. In the more general definition, an object is also considered to be within the DoF if its image can be ideally restored from what is read from the sensor.

Previously the mainstream approach to DoF extension is to block light at the aperture with an apodizer, but the optical power at the sensor is also decreased, resulting in a much longer exposure. The only prior approach with full aperture was Hausler's focus sweep method. However, it is limited in application due to the requirement to continuously change the focus setting during exposure.

The authors proposed to attach a phase mask to the optical system to achieve a PSF that is invariant to misfocus and beneficial to recovery of the full-resolution image, in the sense that its optical transfer function (OTF) has large values within its passband. Thus, the in-focus image could be fully recovered from the sensed image without knowledge about depth of the object.

To compute the profile of the phase mask, the authors firstly computed the OTF as a function of depth and phase profile of the mask. The OTF at a specific distance was proved to be a slice of the ambiguity function across the center, and the slope of the slice corresponds to the amount of misfocus. Therefore, the PSF produced by an optical system is constant to focal distance only when its corresponding ambiguity function is rotationally invariant over the angular region that corresponds to the extended DoF.

Limiting the profile of the phase mask to be monomial, the authors further derived that it has to be in cubic form. The bandwidth of cubic phase masked(cubic-pm) systems was also analyzed as a function of the monomial coefficient to guarantee that the OTF would not have zeros in its passband.

In experiments the authors compared the half-maximum amplitude and Fisher information of the cubic-pm PSFs to standard ones. Although the cubic-pm design greatly outperforms, this comparison may be unfair to the standard one because insensitivity to focus change is not indispensable to DoF extension. Nevertheless, the above experiments did suggest that cubic-pm design avoids the obstacle of PSF identification.

A comparison of restored images was also performed by simulation under noise-free assumption. Again, the cubic-pm optics appeared superior to the standard one. The key to its success lies in a wider support of the OTF. Acknowledging that noise is unavoidable, the authors estimated the signal-to-noise ratio of cubic-pm systems to be more than 20dB. Still, the experiments could have been more persuading had the authors included results in real optical systems to account for not only noise, but also manufactural imprecision of the phase mask.

In summary, this paper presents a novel solution to DoF extension. Although it was published 15 years ago, its influence in Computational Photography (CP) is still significant. On one hand, the need to maximize the amount of light at the sensor is increasingly emphasized in the area and phase masking remains an unique solution to extended DoF under this constraint till now. On the other hand, this paper highlights the importance of ambiguity function, which has recently been found to be the bridge between light field theory in CP and wavefront optics.

Saturday, October 3, 2009

Maximum Margin Clustering


This paper is interesting in formulating a clustering problem as a convex optimization problem. Its framework is not brand new but the idea is very enlightening for a current project. referring to maximum margin method, everyone knows how SVM is modelled with a quadratic programming problem with linear constraints. In clustering, we are not interested in the parametrized separation boundary but in the partition of samples that maximizes the boundary. Then the problem turns out to be an integer optimization problem, which is often NP-hard.

The key idea is to use the out product, e.g. the kernel matrix of the assignment matrixM = y y^\top in the dual formulation of soft-margin SVM. And another interesting finding is that instead of using non-convex constraint \mathrm{rank}(M) = 1, they use three linear constraints:
  • M encodes equivalent class information, i.e. transitive M_{i, k} \geq M_{i, j} + M_{j, k} - 1, reflexive M_{i,i} = 1 and symmetric M_{i, j} = M_{j, i};
  • M has at most 2 equivalent classes, i.e. M_{j, k} \geq - M_{i, j} - M_{i, k} - 1;
  • M has at least 2 equivalent classes, \sum_i M_{i, j} \leq N - 2.

Then the problem becomes a SDP. It might not be easy to extend the idea to multi-clustering.

Saturday, September 5, 2009

On the Influence of the Kernel on the Consistency of Support Vector Machines


This is kind of math paper. I haven't really delved into some mathematical stuffs for a long time.This paper might be the first to explore the consistency of SVM (that is the asymptotic behavior of the classifier compared with the optimal, Bayes decision error). The main result might be as follows:
For universal kernels we have consistency result for L2 and L1 soft margin classifiers.
The universal kernels are those whose induced RKHS is dense in continuous function space. Gaussian and Laplacian kernels are both universal. The author derived the corresponding consistency.
At first glance I thought they find some tricks in choosing regularization parameters. But I didn't find anything truely usable (e.g. corollary 18).

The editor is Scholkopf... I guess only huge bulls like him understands the key points. For now I just have quite a faint idea.

Wednesday, July 29, 2009

Efficient Euclidean Projection in Linear Time


I am really surprised why this paper got published. The previously scanned paper in ICML 2008 noticed the linear algorithm already. Not sure...

On Sampling-based Approximate Spectral Decomposition


This paper proposed another approximation of kernel methods based on former Nystrom method and column sampling. The prime disadvantage of the earlier method is that they must compute the whole Gram matrix while the proposed adaptive method does not need to. They prove several results (not that interesting though): column sampling is best for rank 1 approximation (no one will use rank 1 approximation I guess...); for a given form, neither is optimal approximation (that is why they are not good enough?); under some cases (looks not useful in practice), Nystrom's recovery is exact.

The improvement of approximation is not salient though, but I guess the efficiency might be improved.

Tuesday, July 28, 2009

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations


This can be viewed as a deep version of convolutioal neural network (I am not quite familiar with this). In each layer (in a sense), there are two sublayers, one for detection, convolving the input with several filters, the other for max-pooling, shrinking (resize the image) the convolved sublayer. Therefore on a whole the layer's input is an image and the output is several convolved and downsampled images. They layer this kind of nets and get the deep version.

Let's see what we have to get. For each convolving sublayer, we have to train a convolution kernel and the corresponding biases (as in RBM). The change is the max-pooling layer. Each neuron in the max pooling layer only connects to a fixed size (a small patch, e.g. 2x2) of neurons in the convolutional sublayer. Since the neurons in the convolutional sublayer could be 0-1 and the max-pooling means only if none of the input neurons fires the output is 0, from the outside it is like a big neuron which can take multiple values (e.g. 2x2+1 = 5). The we may do as RBM, writing down the energy function, converting it to probability, formulating the likelihood and using CD learning with a sparsity penalty.

The good idea behind the structure is that we first get some useful filters, then parts of the object and later the whole objects. The features learned with the model gives good results on several data sets.

Thursday, July 23, 2009

Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks


This paper tells us about a method to evaluating two ranking reseults based on clicks of users. We know the ranking affects exposures of the links, i.e. the higher rank one item has, the more attension human beings pay to. Therefore it is more reasonable to use a discounted relavance called discounted cumulative gain (DCG)
\mathrm{DCG}_l = \mathrm{rel}_1 + \sum_{i = 2}^l \frac{\mathrm{rel}_i}{\log_2 i}.
That is, the rank 1 item has no discount while rank i item has a discount ratio of 1/\log_2 i.

So if we want to compare two ranking results, we have to calculate the relevance score given by given the query. This is not always known. People might have been asked to score the documents with discrete values (on a scale of 5, e.g.) but not all documents will be marked. We model this scale with a multinomial distribution and simulated the scoring procedure and compare the two DCG values. If for 95% cases, one is higher than the other, we might assert it is better.

The author proposed quite a simple model to model \Pr(X_i \mid q, c, ). It is an ordinal logistic regression model with linear and quadratic features. So when we trained the model, we can simulate the DCG and compare two rankings.

Herding Dynamical Weights to Learn


This paper talks a novel way of learning MRF such as Hopfield networks (is that also a Boltzmann machine). For these models, the primal-dual structure of maximum entrpy learning and maximum likelihood estimator tells the possible learning algorithms, e.g. gradient-based, since though the optimization is constrained in the former form, it doesn't in the dual form.

The gradient-based algorithms require solving the following inference problem
w_{\alpha}^{(t+1)} = w_\alpha^{(t)} + \eta( \bar{g}_\alpha - \mathbb{E}[g_\alpha]_P)
where the expectation is computed in the given model. The common ways of computing the expectation include a Gibbs sampler (MCMC) or mean field approximation (usually not good). For certain models, such as RBM, we might think about Hinton's contrastive divergence, but still we need non-deterministic algorithms.

The herding algorithm the author proposed here is a deterministic algorithm. It takes the limit of the annealing version of the negated log-likelihood function and it results in a tipi function. He then formulated the herding algorithm as first maximizing (due to the limit) and resulting in some pseudo samples and then calculating the gradient based on these pseudo samples.

I am not familiar with the dynamic system and the intrinsic idea behind this stuff though.

Wednesday, July 22, 2009

Statistical Geometrical Features for Texture Classification


This paper simply provides us with 16 features for texture classification. Well... try them in some applications then.

Monday, July 20, 2009

An Efficient Projection for L_{1, infty} Regularization


This optimization problem is a little different from the L1 penalized version, in that the variables to optimized is a matrix, and the corresponding penalty is the so-called L_{1, \infty} norm,
\| X \|_{1, \infty} = \sum_{i = 1}^m \max_j X_{i, j}.
The optimization is based on a previous ICML 07 paper. But now we are dealing with matrices instead of vectors. For example, the toy optimization problem is modified to its matrix version
\min_{B, \mu} \frac{1}{2} \sum_{i, j} (A_{i, j} - B_{i, j})^2,
such that \forall i, j: B_{i, j} \leq \mu_i, \sum_i \mu = C and nonnegative constraints for B given a nonnegative matrix A.

But soemhow there seems to be nothing else new.

Support Vector Machine Learning for Image Retrieval


This is actually a vision paper. I am not sure whether the active learning is really just a version. In a query of similar images, the user labels some relevant and irrelevant images. A SVM is then used to rank the images correspondingly. This will help the query though but will not help the later search. A possible improvement is adding some manifold regularization. Quite contrary to my expectation. Hmm...

Sunday, July 19, 2009

Regression by Dependence Minimization and its Application to Causal Inference in Additive Noise Models


This paper follows the NIPS 08 paper, with a different regressor. We know HSIC could be used for dependence maximization as well as independence maximization. Here we simply use its dependence. Basically there are two ways of doing regression, one to minimize the dependence of the residual and the input, the other to maximize the dependence of the response and the prediction (is that possible?)

With HSIC it is easier to minimize the dependence of the response and the residue. The pro is that we do not need to specify the additive noise's distribution. In many regression problems, we actually assume the noise is a Gaussian. Now we may forget about the possible violation of this assumption. But the con is that now the optimization of the model will be much more difficult.

The good thing about HSIC is that we have a statistical test, which allows us to judge whether it is statistically reliable to assert the independence. I am not sure whether we have a similar statistic for KDR-like terms.

Nonlinear Causal Discovery with Additive Noise Models


This paper talks about making causal inference from data. We know if two r.v.s have an additive linear relationship (with noise), the reverse relationship also holds. If there is a non-linear relationship, our causal inference will be easier since the inverse might not hold. This is the main result of this paper, finding if there exists a reverse relationship, the probability must satisfy a differential equation. From this equation we know, if \nv''' = \xi''' = 0, where the two function are the logarithm of the PDF of x and the additive noise n, f must be linear.

This also suggests a way of making causal inference on DAGs. This leaves us to find a powerful regressor to capture the possible relationship between two r.v.s. The authors chooses GPR. Then the residue should be statistically independent of x, which can be tested with HSIC.

However, if we are to find the latent structure instead of doing a statistical test, this search will be intolerable if the DAG size is beyond 7.

Friday, July 17, 2009

Boosting Products of Base Classifiers


Boosting is quite a useful technique for practical tasks, since we may get robust and high-precision classifier or regressor by combining the weak ones. Among them AdaBoost.MH is the most popular version (a little different from the one for binary version). This paper trains products of base classifiers instead of tree or MoE-like ones (considering Hinton's claim of the advantages of PoE over MoE). The design of the algorithm is not difficult though. The results on MNIST shows it is the second best algorithm (the best is DBN).

Well, implement one :-p

Thursday, July 16, 2009

Efficient Projections onto the l1-Ball for Learning in High Dimensions


The projection on L1 ball is another way of computing L1 norm penalty for certain regularization problem. That is to say
\min_w \Omega(w) + \lambda \| w \|_1 
, we may optimize
\min_{w} \Omega(w) \quad \text{s.t. } \| w \|_1 \leq z
where z is a constant. The former one can be seen as a series of optimization problem indexed by the regularization parameter \lambda and the latter is indexed by the constant z. We can show that two forms are equivalent.

Theirfore, according to the latter, the optimization becomes finding a minimizer of the loss function on the L1 ball. If the minimizer happens to be inside the ball, it is the solution; otherwise, it must be on the boundary of the ball. So the difficult part becomes how can we project the vectors onto the L1 ball efficiently. This explains why the solution should be sparse as well.

The first simple case is
\min_w \frac{1}{2} \| w - v \|_2^2 \quad \text{s.t. } \sum_{i = 1}^n w_i = z, w_i \geq 0,
where v inside the positive cone. With a little analysis, we know this is actually very simple. When \| v\|_1 \leq z, the solution is obvious. Otherwise, we should decrease the coordinates. Using Lagrange multiplier, it is seen that each coordinate are shrunk with the same value. Therefore the norm \| w \|_1 will decrease, n \theta, (n-1)\theta, \ldots, the slope changes whenever a coordinates vanishes. Therefore we stop when the piecewise linear function meets \| w \| = z. We just need to find the intersection. If we implement it with a brute search, it ends up with a O(n \log n) algorithm, but with a little care (using the quick sort idea), the magic turns it into a O(n) algorithm.

Then we come to other cases when v can be in other orthants, due to the symmetry. Finally, we will use a tree structure to boost up the speed. In our implementation of a gradient-based algorithm, such as
w^{(t + 1)} = \Pi_W \Big( w^{(t)} + g^{(t)} \Big),
we calculate the projection onto the L1 ball, i.e. \Pi_W(\cdot). The idea is to use a red-black tree (gonna review these structure). The the projection can be done in the time which scales linearly in the non-zero entries of g^{(t)} and logarithmically in the total number of features n.

Let's compare this one with the ICML 09 paper and several other papers later.

Discriminative k-Metrics


This paper tells us about an old clustering technique called k q-flats. It can be regarded as a generalization of k-means algorithm but we find a q-dimensional space to project onto instead of a centroid for each cluster,
\sum_{j = 1}^K \sum_{x \in K_j} \| x - P_{F_j} x \|^2
This can be solved with k-means-like (EM-like) algorithms. Basically if we want to apply it to classification problems, we may train a k q-flats for each class but this model lacks discriminative power.

But this can be solved in a natural way
\sum_{i = 1}^c\sum_{j = 1}^K \left( \sum_{x \in K_{i, j}}g_1(\| F_{i, j}^\top x \|^2) + \sum_{x \not \in C_i} g_2(\| F_{i, j}^\top x\|^2)\right)
A little difference is the optimization has to be done with gradient. The selected g_i are Hinge-like loss funtions and the derivatives are simple piecewise constant functions.

Wednesday, July 15, 2009

On Primal and Dual Sparsity of Markov Networks


This paper mainly taks about the relationship of several M3Ns. The primal and dual sparsity are caused by L1 norm penalty and the constraints (according to KKT conditions). Therefore adding L1 norm penalty to M3N will cause both sparsities, which increases generalization capacity and selects important features.

The LapM3N proposed by the authors earlier is a Bayesian version of M3N. The MAP estimator of LapM3N would go as M3N with different penalties, e.g. L2 norm corresponding to a Gaussian prior and L1 norm corresponding to a Laplace prior with parameter going to infinity.

Another relationship of L1 norm is found to sparse Bayesian learning, since adding a prior for each parameter of M3N (think about RVM) would result a sparse solution. The adaptive M3N will yield the same result as the L1-normed M3N.

The authors proposed an EM-like algorithm for training L1-normed M3N. Obviously, it would have connection to variational Bayesian approximation.

Maybe we should write our own structured learning tools.

Monday, July 13, 2009

Large-scale Deep Unsupervised Learning using Graphics Processors


This is a paper on implementation of DBN and sparse coding algorithms with GPU. The finer-grain parallelism provided by the modern GPU outperforms CPU architecture. The bottleneck is the IO, from main memory to the memory inside the video adapter. The finer-grain parallelism allows us to deal with data parallelism and the data are divided for each block and the job assigned to each block is further divided into threads' labor.

They said they would provide their code online. I have not found it.

Saturday, July 11, 2009

Gradient Descent with Sparsification: An Iterative Algorithm for Sparse Recovery with Restricted Isometry Property


This paper tells one story about the following optimization problem arising from compressive sampling
\min_x \quad \| x \|_0 \qquad \text{s.t.} \quad \Phi x = y,
which is usually approximated with the following version
\min_x \quad \| x \|_1 \qquad \text{s.t.} \quad \Phi x = y.
We have quite a few algorithms for solving the optimization under different conditions. This paper proposes the following gradient-based algorithm
x \leftarrow H_s \left( x + \frac{1}{\gamma} \cdot \Phi^\top (y - \Phi x)\right)
where H_s take the largest (absolute value) s elements of x and other as 0.

The theoretical result is their small computation cost (K in each iteration), fast convergence rate (2L / \log(\frac{1 - \delta_{2s}}{2\delta_{2s}})) and lenient recovery condition (\delta_{2s} < \frac{1}{3}). I think it might be a good algorithm if we late want to make some lasso experiments.

Friday, July 10, 2009

Sparse Higher Order Conditional Random Fields for Improved Sequence Labeling


This paper adds higher-order features to the CRF model with exact inference. The feasibility is caused by the sparsity. That's to say although we include higher order features in the feature vector, it seldom fires. So in the inference stage (since we have to calculate the gradient), the computational complexity will not go exponentially and the sparsity makes the exact inference possible. This paper extends the terms for linear-chain CRF into configurations which can be used for higher-order features. That's the main contribution.

I will write something related to this problem soon. First I have to finish some optimization code.

Curriculum Learning


The so-called curriculum learning states the idea of learning things from simple ones to difficult ones. The gradually hardening tasks help build a classifier with better generalization capacity.

There are evidence from cognitive sciences as well as from machine learning itself. In optimization theory, the famous continuation mathod actually has the same spirit. Another example is the deep belief nets, in which the greedy pretraining can be seen as a simpler task than the succedent fine tuning. The examples provided by the authors comes from simple toy experiments in low-dimensional space (train two Bayesian classifier with or without difficult examples), a shape learning task with neural nets (with or without a switching epoch, at which the training set is switched from simple to difficult samples), an NLP example.

Their claim includes several messages:
  • difficult examples may not be useful;
  • better curriculum might speed up online learning and guid the result to the region where better generalization can be found;
  • the idea might be connected with active learning and boosting.

Thursday, July 9, 2009

Graph Construction and b-Matching for Semi-supervised Learning


This is an ``advertisement'' paper for their 2007 paper on b-matching. In general b-matching is a graph construction algorithm as k-NN. This paper enumerate all kinds of combinations of label diffusion (propagation?):
  • graph construction, kNN and b-matching
  • weight, Gaussian kernel, LLR (like LLE).
  • diffusion algorithms, GRF, LGC and GTAM.

I don't buy their conclusion. Yes graph construction might be important to later algorithms, but the b-matching result doesn't seem so different from k-NN. Maybe we have to try for ourselves.

Minimum Volume Embedding


This is another piece of work by the authors of the previously scanned paper. This work is mainly based on the MVU paper, where the graph is embedded with isometry constraints (linear for the Gram matrix) and maximized variance (the trace of the Gram matrix). Therefore the Gram matrix can be obtained via SDP optimization techniques.

But the variance to be maximized is harmful since it might cause the variance in all directions to increase, which is not necessary (as is illustrated in the example of the paper). The author takes the difference of the eigenvalues of the Gram matrix to be optimized
\max \sum_{i= 1}^d \lambda_i - \sum_{i = d+1}^N
where \lambda_i are the eigenvalues of K and K must satisfy the same contraints as MVU.

To solve the problem, the authors proposed an iterative algorithm based on SDP. In each iteration, the eigen vectors are renewed by PCA of the Gram matrix and the Gram matrix is updated with SDP. It can be proved the algorithm will converge to a local minima. This will force those eigen values irrelevant to the embedding to zero and therefore cause a minimum volume embedding.

Structure Preserving Embedding


This paper proposes the concept structure preservaing embedding: given an algorithm to construct a graph \mathcal{G}, it would yield the same graph as the given affinity matrix A_0 with the computed Gram matrix K.

There are two kind of algorithms to construct the graph:
  • kNN and \epsilon-ball; we can find linear constraints for K which ensure the nearby samples are nearer than other samples (the constrains are at most O(N^2)).
  • Maximum weight subgraph, b-matching subgraph, maximum weight spanning tree. The number of constrains might be exponential in N.

The objective is \mathrm{tr}(K A), subject to \mathrm{tr}(K) \leq 1 and K \succeq 0. The authors prove that under these conditions, the Gram matrix has rank 1. The embedding is calculated with constraints with a common slack variable \xi, which allows possible violation of the constraints. Then the resulted embedding might has more dimensions.

For the first category of constraints, SDP can be directly applied while for the second category, we first use SDP without constraints and then add most violated constraints one-by-one, each time the model is updated with SDP until convergence is observed.

This technique is best regarded as a visualization technique (for graphs) though it could also be used in dimensionality reduction tasks for classification. The experiments show with SP contraints added to MUV and MVE, the classification rate improves.

Wednesday, July 8, 2009

Geometric-aware Metric Learning


This is an extension to the ICML 07 paper, with introduction of graph regularization. The idea is to find a pair of kernel matrices, one from task-dependent kernel set and the other from a parametrized data-dependent kernels. The two sets are quite different. The former is used in later classification or related tasks and they choose the Mahalanobis kernel. The later contains locality information and is created with the graph kernels (a subspace spanned by the eigen vectors corresponding to the smallest eigenvalues of the Laplacian matrix). To measure the similarity of the two distance, the Bregman divergence is employed. the difference between the current optimization and the one in the ICML 07 paper is that the other matrix is not fixed. The rough idea is to update them alternatively (just as in LSQ in tensor).

We may change the data-dependent kernel for different tasks (e.g. unsupervised tasks using graph kernels, supervised tasks using labels). There are connections with regularization theory: the solution has a representation of a combination of graph kernel and another term which can be interpreted as a regularizer. Another possible connection is to GP: GP could be regarded as a special case of the proposed framework. I think this topic is by then very interesting. I will come back later to this topic when I have time.

Probabilistic Dyadic Data Analysis with Local and Global Consistency


This paper introduces a graph regularizer for PLSA. There are some interesting stories behind the NLP algorithms such as PLSA, LDA (latent Dirichlet allocation), I think, but I am not quite familiar with them. From the modeling aspect, they deal with the same thing. Given a term-document matrix, X, the element of which is the count of occurrences of a given word w_j in the document d_i. The generative model is each document is a mixture of multinomial distribution. Namely, we have several latent topics z_k such that a document is a mixture of the topics z_k with the mixing proportion \Pr(z_k \mid d_i). Each topic decide a kind of ditribution for the words, namely \Pr(w_j \mid z_k). The PLSA algorithm simply uses EM to train the mixture model.

The LDA is not as far as I think. It's just a Bayesian version of PLSA. We know the natural extesion of mixture models in frequentists' domain is adding a Dirichlet prior for the proportion of the latent variable. And the more sophisticated technique is non-parametric Bayesian, adding a Dirichlet process as the prior. The fully Bayesian method need the posterior distribution which is approximated with (global) variation inference.

The two algorithms do not take into consideration the local information. This paper simply adds a regularizer to the PLSA loss (-log likelihood). Here the authors assume if the documents are similar, their mixing proportions must be similar. So if d_{i_1} and d_{i_2} are similar, according to some rules (labels in classification or cosine of the angle between two tf-idf vectors in unsupervised case), the two distribution \Pr(z_k \mid d_{i_1}) and \Pr( z_k \mid d_{i_2}) are similar. This similarity is measured with the symmetric KL divergence:
\min -\sum_{i = 1}^N \sum_{j = 1}^M n(d_i, w_j) \log\sum_{k = 1}^K \Pr(w_j \mid z_k) \Pr(z_k \mid d_i) + \frac{\lambda}{2} \sum_{i_1, i_2 = 1}^N W_{i_1, i_2} \big( D(\Pr(z_k \mid d_{i_1})\parallel \Pr(z_k \mid d_{i_2})) + D(\Pr(z_k \mid d_{i_2})\parallel \Pr(z_k \mid d_{i_1})) \big)
The optimization can still be solved with EM with a little approximation. The difficult part is the maximization step.

The result is really good: in unsupervised case, it is better than NMF and PLSA, LDA and even N-cut.

Information Theoretic Metric Learning


This paper talks about a metric learning algorithm. The basic tool is the so-called Bregman divergence. the setting is to find a Mahalanobis distance, therefore a matrix A for inner product (x^\top A x). The trick is we need our matrix A is as close as to a given A_0, which is achieved via the KL divergence of two Gaussians with A_0 and A as their covariance matrix respectively. We further have two sets, one including pairs of samples that are near (their distances are less than a threshold), the other including dis-similar ones (their distances are larger than another threshold). It can be derived the KL divergence is actually a form of Bregman divergence and equals \mathrm{LogDet}(A_0, A) = \mathrm{tr}(A A_0^{-1}) - \log \det A A_0^{-1} - n. The constraints are linear to the matrix A to be optimized. This distance is the only scale-invariant and the loss leads to uniform minimum variance unbiased estimator. As SVM, we may introduct slack variabls to tackle the infeasible cases. To solve this optimization, a former algorithm (ICML 2006 paper) is extended.

There are some connection to the previous work to since it can be proved the low-rank kernel matrix is simply induced by the Mahalanobis kernel we find in the proposed metric learning algorithm. And the paper also addresses the online metric learning problem. They prove the proposed algorithm will converge.

Thursday, July 2, 2009

Robust Feature Extraction via Information Theoretic Learning


This paper is based on the so-called Renyi's entropy, which is defined as
H_2(p) = -log \int p^2(x) \,\mathrm{d} x
Given a random variable x's sample, the empirical Renyi's entropy is estimated with kernel density estimator
\hat{H}_2(X) = - \log \hat{V}(X) + \text{const}, \qquad \hat{V}(X) = \sum_i \sum_j g(x_i - x_j, \sigma)
where g(x-z, \sigma) = \exp( -\| x - z \|^2 / 2\sigma^2). \hat{V}(X) is called information potential. For two r.v.s, we have similar result,
\hat{H}_2(X_1, X_2) = -\log \hat{V}(X_1, X_2) + \text{const}, \qquad \hat{V}(X_1, X_2) = \sum_i \sum_j g(x_i^{(1)} - x_j^{(2)}, \sigma)
where the second term measures the correlation of two r.v.s.

The proposed objective is to find a projection Y = WX such that
\max_{W} (1 - \lambda) \hat{V}(WX) + \lambda \hat{V}(WX, C) - \gamma \| W \|_\text{fro}^2
They find that when \lambda = 1 we have a so-called robust M-estimator. The optimization algorithm they find is very similar to majorization minimization (auxilliary function). The interesting relationship they listed in the paper includes that with LPP. The iterative algorithm solves a LPP/LapRLS/SRDA in each iteration.

I am afraid this looks quite like the unsupervised HSIC for SDR problem. I will check it soon.

A Majorization-Minimization Algorithm for (Multiple) Hyperparameters Learning


First we must understand what the majorization-minimization (MM) algorithm is. It is in fact the auxilliary function method, which might be regarded as a generalization of EM algorithm. To minimize (maximize) a function L(x), we find an upper (lower) bound for the objective which is more easier to minimize (maximize), Q(x; x') where the x' is the parameter for the upper (lower) bounding function. This can be seen as a generalization of the idea of CCCP, since in CCCP, we use linear functions to bound the convex (concave) functions and the parameter is simply the variable introduced by Legendre-Fenchel transform. And we require L(x) - Q(x; x') reaches its minimum (maximum) at x = x'. As we can see here, the following inequality holds
L(x) \leq Q(x; x') \qquad \text{or} \qquad L(x) \geq Q(x; x')
The idea is instead of optimizing L(x) directly due to its intractability, we may find suitable algorithm to optimize Q(x; x') where x' = x^{(n)}. Therefore for the minimization case we have the following inequality
L(x^{(n)}) = Q(x^{(n)}; x^{(n-1)}) + L(x^{(n)} - Q(x^{(n)}; x^{(n-1)})) \leq Q(x^{(n-1)}; x^{(n-1)}) + L(x^{(n-1)}) - Q(x^{(n-1)}; x^{(n-1)}) = L(x^{(n-1)})
where the inequality uses two optima, Q(x^{(n)}; x^{(n-1)}) is minimum of Q(x; x^{(n-1)}) and L(x) - Q(x; x^{(n-1)}) reaches its maximum at x = x^{(n-1)}.

Now let's come back to the Bayes learning problem. We know to select a proper hyperparameter, either we use cross validation, or we rely on maximization of Type II likelihood (EM solves it :-p). Here the author simply introduce another prior for the hyperparameter and integrate it out. E.g. for a Gaussian prior for the parameter, usually the hyperparameter is the precision, whose prior is then set to a Gamma distribution. But they kind of stealthly switch the concept. In Bayes framework, the joint distribution is then \Pr(\mathcal{D} \mid w) \Pr(w \mid \alpha) \Pr(\alpha). They first integrate out the hyperparameter \alpha, which leaves us \Pr(\mathcal{D} \mid w) \Pr(w). The we find the MAP estimation for simplicity, which results in
\min_w -\log \Pr(\mathcal{D} \mid w) - \log \Pr(w).
Now we may find the second term is of the form of - C\cdot\log (\| w \|^2 + \beta ), which is a different from the norm regularizer.

We solve this problem with MM optimization technique. We construct an upper bound for the second term
\log x \leq \log y + \frac{x - y}{y}
which is a linear function of x and therefore the optimization problem turns back to the original ``loss + norm regularizer'' form.

Don't see any thing special from this point of view.

Information Theoretic Measures for Clustering Comparisons: Is a Correction for Chance Nessary?


One problem in clustering is how to measure two clustering result. E.g. given two clustering results, which one is more closed to the true clustering?

There are many analytical methods, including Rand Index (RI) and Adjusted Rand Index (ARI). They are based on contingency table. Although RI could be useful since it is 1 iff the two clustering are identical and 0 when no pair are in the same cluster, it can not extinguish a random partition with a ``good clustering''.

To correct this (why it is called correction for chance), ARI is proposed. Another class comes from information theory. They use the mutual information of proportion of two clustering (each clustering's proportion is a density, therefore the mutual information is the common information in the two clustering). As we know we can create a distance based on entropy D(p, q) = H(p, q) - I(p, q) = H(p) + H(q) - 2 I(p, q) or its normalized version \mathrm{NMI}(p, q) = \frac{I(p, q)}{\sqrt{H(p) H(q)}}.

The authors proposed a corrected-for-chance version based on information method and from their experiments, we may find it works pretty well. Their criterion is
\mathrm{AMI}(U, V) = \frac{I(U, V) - \mathbb{E}( I(M) \mid a, b)}{\sqrt{H(U) H(V)} - \mathbb{E}(I(M) \mid a, b)}
or
\mathrm{AVI} = \frac{2 I(U, V) - 2 \mathbb{E}(I(M) \mid a, b)}{H(U) + H(V) - 2 \mathbb{E}(I(M) \mid a, b)}
where \mathbb{E}(I(M) \mid a, b) is computed from the contingency table.

The author provided examples where the correction is necessary. Esp. when samples in each cluster is few.

Characteristic Kernels on Groups and Semigroups


First let's see some key ideas in kernel-based methods for r.v.s:
To see whether two r.v.s have the same distribution, we test the mean of arbitrary function values for two r.v.s. The leads to distances between two means in the RKHS \mathcal{H}. Hence we introduce an map of the r.v. X into the linear functional of a RKHS \mathbb{E}_X K(\cdot, X). Then for arbitrary function f \in \mathcal{H}, the mean value is \mathbb{E} \langle K(\cdot, X), f(\cdot) = \mathcal{E} f(X). Given two r.v.s X, Y, we can see the discrepancies on any given function f \in \mathcal{H}, \mathrm{MMD}(X, Y; \mathcal{H}) = \max_{f \in \mathcal{H}} \E f(X) - \E f(Y). In practice we calculate the empirical squared version.

The critical problem of this method is the mapping from the r.v. to the linear functional is unique. Given that the conjugate space of the RKHS is simply itself, we can also interpret this as the uniqueness of the mapping from the r.v. to the feature space (RKHS) is unique. Please notice the RKHS is induced by the kernel and their relationship is one-to-one. Hence the thing that matters is what kind of kernel might cause the uniqueness of the mapping.

This is explored in this paper, the so-called characteristic kernel, which has the property that the induced mapping is injective. With this kind of kernel, criterions such as MMD will be a well-defined distance between two r.v.s (or measures as in the paper). Here are some main results:
Lemma 1. Let (\Omega, \mathcal{B}) be a measurable space, k be a measurable, postive definite kernel on \Omega and \mathcal{H} be the corresponding RKHS. Then k is characteristic iff \mathcal{H} \oplus \mathbb{R} is dense in L^2(\mu) for every probability \mu on (\Omega, \mathcal{B}).

A positive definite function \phi is one such that the induced kernel k(x, y) = \phi(y^{-1} x) is positive definite. Here the inverse is defined on a group. E.g. for \mathbb{R}^n the linear space, the group is simply defined by the vector addition, therefore we get kernels in the form of k(x, y) = \phi(x - y) which is simply the so-called RBF. Gaussian kernels, Laplacian kernels are both in this domain. They are called shift-invariant kernels. Bochner theorem characterizes all the shift-invariant kernels defined on \mathbb{R}^n: all these functions are characteristic function (Fourier transform) of a given Borel measure.

With some analysis they come to the result if the shift-invariant kernel induced by positive definite function \phi is characteristic, the corresponding Borel measure has \mathbb{R}^n as its support or we may say the inverse Fourier transform of \phi has \mathbb{R}^n as its support.

This paper takes the result furthur onto more general algebraic structures, i.e. groups and semi-groups. Their claim is as long as we might find an analog of Fourier transform, we might have something similar to Bochner's theorem. So the algebraic structure is a locally-compact Abelian group.

The paper also consider non-Abelian groups and semi-groups. They get similar results. It's very theoretical... hmm... not enough mathematical preparation though.

Sunday, May 17, 2009

Supervised Feature Selection via Dependence Estimation


Feature selection has been mentioned in a KDR paper. This paper uses HSIC and backward greedy algorithm. That's each time they pick up a dimension to throw away while the rest has the highest HSIC. I thought they might throw away the one with lowest HSIC but they enjoyed using HSIC as an lower bound of dependence.

The paper also mentions the connection between HSIC and other criterion such as MMD in a prevously scanned paper and KTA (it looks like an uncentered version of HSIC).

K-means Clustering via Principal Component Analysis


Th most important result of this paper is that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering and the subspace spanned by the cluster centroids are given by the spectral expansion of the data covariance matrix truncated at K-1 terms.

I'd like to say their discovery is very similar to that of normalized cut and the spectral expansion of Laplacian matrix. Let
J_K = \sum_{k = 1}^K \sum_{i \in C_k} \|x_i - m_k \|^2
be the objective of K-means clustering,
d(C_k, C_l) = \sum_{i \in C_k} \sum_{j \in C_l} \| x_i - x_j \|^2
be the distance between two clusters. It can be shown that
J_K = N \bar{y}^2 - \frac{1}{2} J_D,
where in 2-way clustering case
J_D = \frac{N_1 N_2}{N} \left( 2 \frac{d(C_1, C_2)}{N_1 N_2} - \frac{d(C_1, C_1)}{N_1^2} - \frac{d(C_2, C_2)}{N_2^2}\right).
Further we have
\frac{d(C_1, C_2)}{N_1 N_2} = \frac{d(C_1, C_1)}{N_1^2} + \frac{d(C_2, C_2)}{N_2^2} + (m_1 - m_2)^2.
The important theorem they proved says in 2-way clustering, the continuous solution is given by the leading principal components, those positive in one cluster and negative in the other. The optimal K-means solution in this case can be bounded with the leading spectrum,
n \bar{y}^2 - \lamnda_1 \leq J_{2} \leq n \bar{y}^2.
If we create a fully connected graph with each weight \frac{1}{N}, the linearization of the N-cut algorithm is PCA and the N-cut is simply K-means, is it right? The difference is PCA is done in the feature space and spectral clustering employs the embedding space Laplacian eigenmaps learns.

For K-way clustering, we have similar results. But unlike 2-way clustering, we can not interpret the PC in a similar way. Maybe the matting paper has better interpretation. Their second result is about the centroids.

This result has several implications. Spectral clustering is done in the embedding space learned by Laplacian eigenmap; PCA/K-means is done in the feature space; KPCA/Kernel K-means is done in the kernel-induced space. So if we considering the unsupervised KDR idea, can we find a similar interpretation such as clustering?

Tuesday, May 12, 2009

Visualizaing Data using t-SNE

by Laurens van der Maaten and Geoffrey Hinton

This paper proposes a variant of SNE. The SNE was previously proposed by Hinton and Roweis. Its idea is to find a projection such that the neighboring probability is preserved: given the coordinates x_i, the neighboring probability is
p_{i \mid j} = \frac{g(x_i, x_j)}{\sum_k g_j(x_j, x_k)}.
In the projection, if we define a similar probability q_{i \mid j}, we wish to have a most similar distribution of these samples, therefore we intend to minimize
\sum_j \sum_i p_{i \mid j} \log \frac{p_{i \mid j}}{q_{i \mid j}}.
The original setting is taking g_j(x_i, x_j) with Gaussian kernel (with different variances). By setting a so-called perplexity, we then binary-search a proper variance for each sample. This perplexity corresponds to the continuous version of effective neighbors. And we compute the gradient of y_i:
\frac{\partial \mathrm{KL}}{\partial y_j} = 2 \sum_i ( p_{i \mid j} - q_{i \mid j} + p_{j \mid i} - q_{j \mid i})(y_j - y_i).
The gradient decent is suppllemented with a momentum.

The so-called t-SNE has two major modification to the SNE model. For one thing, it uses a symmetric p_{i, j} instead, which is simply (p_{i \md j} + p_{j \mid i}) / 2, this will simplify the derivatives of the objective function,
\frac{\partial \mathrm{KL}}{\partial y_j} = 4 \sum_i (p_{i, j} - q_{i, j})(y_j - y_i).
For another thing, we use a different pdf for the projection instead of the Gaussian kernel. The "t" comes from here, since they propose using t-distribution with 1 degree of freedom (therefore a Cauchy distribution). Then the derivatives become
\frac{\partial \mathrm{KL}}{\partial y_j} = 4 \sum_i (p_{i, j} - q_{i,j})(y_j - y_i)(1 + \| y_j - y_i\|^2)^{-1}.


The great thing about t-SNE is its capability to yield a good embedding. But somehow it is a little difficult to increase the embedding dimension. Not sure about it though, doing some experiments with it.

Monday, May 11, 2009

A Dependence Maximaization View of Clustering


HSIC is an upper bound of the COCO which in theory has been proved to be equivalent to independence. Therefore HSIC can be used as an independence test. However, as an upper bound, it might not be suitable if we want to maximize the dependence of r.v.s IMHO. So the motivation of using HSIC for this purpose (clustering) is still vague.

For clustering, we maximize
\tr HK_xH K_y
for making features and labels dependent. So we have to find a kernel for the discrete variable y: delta kernel; or the kernel in the paper,
\Pi A \Pi^\top
where A is the Gram matrix for different clusters and \Pi is the assignment matrix. The paper also consider other kernels, relating this algorithm with kernel PCA (maybe kernel k-means also?).

The optimization is quite difficult as other clustering algorithms. It initialize an assignment matrix randomly and repeat updating the assignment matrix until it converges. We loop by the samples, finding the best assignment that maximizes the HSIC. Please notice the matrix A will be updated when the assignment matrix is updated. Like k-means, each sample has the same weight and A is simply a diagonal matrix with the inverse of number of samples in this cluster. Another choice generalizes the weights. This leads to a family of clustering algorithms by choosing A.

Maybe it's worth considering about the designing idea more carefully.

A Kernel Approach to Comparing Distributions


This paper takes advantage of kernel method to test whether two r.v.s have the same distribution. The idea is very close to independence test based on kernel method. We know two r.v.s \mathbb{E} f(x) = \mathbb{E} f(y) if their distribution is identical. Given a large enough function set, we may use
\mathrm{MMD}(\mathcal{F}, x, y ) = \sup_{f \in \mathcal{F}} \mathbb{E} f(x) - \mathbb{E} f(y).
Then we try some universal kernel and its corresponding RKHS. With some derivation, the empirical evalutation of MMD is based on
\mathrm{MMD}^2(\mathcal{F}, x, y) = \mathbb{E} \langle \phi(x), \phi(x')\rangle + \mathbb{E} \langle \phi(y), \phi(y')\rangle - 2 \mathbb{E} \langle \phi(x), \phi(y)\rangle \approx \frac{1}{m^2} \sum_{i, j = 1}^m k(x_i, x_j) + \frac{1}{n^2} \sum_{i, j = 1}^n k(y_i, y_j) - 2 \frac{1}{mn} \sum_{i = 1}^m \sum_{j = 1}^n k(x_i, y_j).


I am not sure whether a later work in NIPS 2008 is based on this, which will be scanned soon.

Measuring Statistical Dependence with Hilbert-Schmidt Norms


This paper uses the so-called Hilbert-Schmidt norms instead of L^2 norm (corresponding to the COCO independence test) in the previously scanned paper, resulting in the so-called HSIC (Hilbert-Schmidt independence criterion), which is in practice much easier to calculate. Compared with COCO, which using the square root of the largest singular value of \tilde{K}^{(x)} \tilde{K}^{(y)}, HSIC actually uses the whole spectrum, i.e. the trace of \tilde{K}^{(x)} \tilde{K}^{(y)}. Since the Frobenius norm is an upper bound for the L^2 norm, therefore the similar independence equivalence can be obtained.

Please notice the relationship of KCC, COCO and HSIC. For KCC, we use the maximum kernel correlation as the criterion for independence test, but then we have to use the regularized version to avoid the common case when it does not give desired result. In COCO, we constrain both covariances to identity and then the the problem becomes seeking the largest singular value of the product of two centered Gram matrices while KCC solves the generalized eigenvalue problem. Therefore when we deal with multiple r.v.s, we can do it in the same ways as CCA does. So in a way you know why it is called COCO. These three ideas are closely related, just as in the KMI paper, KMI and KGV are closely related.

Sunday, May 10, 2009

Kernel Constrained Covariance for Dependence Measurement


For measuring independence, they propose another kernel-based criterion COCO (constrained covariance, in a previous paper, it is called kernel covariance, KC, however their objective actually comes from the multual information of Gaussian r.v.s) other than the previously scanned paper of Jordan's (KCC and KGV). This has been mentioned in their earlier work.

This work has quite different concerns. Since previous papers did not say how to choose the kernels. They prove COCO is only zero at independence for universal kernels. The so called universal kernel is a kernel such that on a compact metric space (\mathcal{X}, d), the RKHS induced by the kernel k(\cdot, \cdot) is dense in the continous function space over \mathcal{X}, namely C(\mathcal{X}) w.r.t. the topology induced by the infinity norm \| f - g \|_\infty. They proved Gaussian kernels and Laplacian kernels are universal.

They also point out the limitation of independence tests based on universal kernels. The proposed COCO also has its adversary. That is when we do have small COCO, the r.v.s are still dependent. But they found their calculation of COCO has an exponential convergence rate to the true COCO value.

The calculation of COCO is equivalent to
\mathrm{COCO}(z, F, G) = \frac{1}{n}\sqrt{\| \tilde{K}^f \tilde{K}^g \|_2}.
Their later paper actually uses Frobenius norm. Their experiment shows an example of applying COCO to ICA on fMRI data, compared with KMI and correlation method (only using correlation as a testing of independence).

The Kernel Mutual Information


We know mutual information can be used to test the independence of r.v.s. The difficulty with mutual information is that the distribution is usually unknown, either we have to make a density estimation or entropy estimation from samples.

This paper introduces the so-called KMI (kernel mutual information), which in practice has comparable performance with KGV (in Bach and Jordan's paper). It comes from measuring the mutual information between two multi-variate Gaussian vector,
I(x; y) = -\frac{1}{2} \log \left( \prod_{i = 1}^{\min(p_x, p_y)} (1 - \rho_i^2)\right).
We use the correlation in the RKHS for this \rho_i,
\rho_i = \frac{c_i^\top( P_{x, y} - p_x p_y) d_i}{\sqrt{c_i^\top D_x c_i d_i^\top D_y d_i}},
where P_{x, y}, p_x, p_y, D_x, D_y are approximated with samples and an assumed grid(it will be cancelled out in the end). By relaxing the denominator (to a bigger value), we find an upper bound for the mutual information,
M(z) = -\frac{1}{2} \log\left( \big| I - (\nu_x \nu_y)\tilde{K}^{(x)} \tilde{K}^{(y)} \big|\right),
where \tilde{K}^{(x)} and \tilde{K}^{(y)} are centered Gram matrices. This is the criterion for KMI. Since it's an upper bound for the mutual information, we can use it as a contast function (I don't know why the authors think we ca derive the result from the theorem 1, confused about their idea).

In their formulation, (regularized) KGV can be regarded as another way of relaxing the covariance. Therefore likewise this independence measurement can also be used for ICA task. In a way, this paper is a generalization of KGV, but I am still confused about their idea.

==
later I realized that the KMI is the upper bound of KC, therefore when KMI = 0 the KC will be 0 too.

Friday, April 24, 2009

Isotonic Conditional Random Fields and Local Sentiment Flow


This paper introduces a variant of CRF, adding a constraint by prior knowledge. Since we know for some words, they are representing a positive/negative sentiment. Therefore the corresponding feature f_{(\sigma, w)}, which means
f_{(\sigma, w)}(y) = \begin{cases} 1 & y = \sigma, x = w \\ 0 & \text{otherwise} \end{cases},
has a larger/smaller parameter \mu_{(\sigma, w)}. That is to say, if w is a special word indicating positive, then \mu_{(\sigma, w)} \geq \mu_{(\sigma', w)} when \sigma \geq \sigma'.

This style of constraints will lead to a convex optimization problem though. The author prove that given a sequence x and the corresponding labelling y, letting x' = (x_1, \ldots, x_j \cup \{ w\}, \ldots, x_n), if \mu_{(t_j, v)} \geq \mu_{(s_j, v)}, then
\frac{\Pr(s \mid x)}{\Pr(s \mid x')} \geq \frac{\Pr(t \mid x)}{\Pr(t \mid x')}.
This gives a new interpretation for the constraints. The model is reparameterized with Mobios inverse theorem (what is that?) and solved.

Monday, April 20, 2009

Tensor Decomposition and Applications


This is the first systematic paper I read on tensor. A tensor in computer science is a multi-dimensional array (I am still wondering whether it has anything to do with the term tensor in differential geometry).

So before we talk about tensors, here are several notations we'd like to use in the text:
  • The order of the tensor is the the number of dimensions: vectors are order-1 tensors, matrices are order-2.
  • An order one tensor is denoted with \mathbf{a}, an order-2 tensor with \mathbf{A} and one of higher-order with \mathcal{A}.
  • A fibre is an analog of a row/column of a matrix in tensor. A tensor \mathcal{X} of order n has a fibre in the form of \mathbf{x}_{i_1, \ldots, i_{j-1}, :,i_{j+1}, \ldots, i_n} which is the vector obtained by fixing all other index i_k except one i_j.
  • A slice of a tensor is obtained by fixing all indices except two, which can be regarded as a plane.
  • A tensor \mathcal{X} \in \mathbb{R}^{I_1} \times \cdots \times \mathbb{R}^{I_n} is said to have rank 1 if \mathcal{X} = \mathbf{a}^{(1)} \circ \cdots \circ \mathbf{a}^{(n)} where the \circ operator means x_{i_1, \ldots, i_n} = a_{i_1}^{(1)}\cdots a_{i_n}^{(n)}.
  • Matricization (a.k.a. unfolding, flattening) is the process of reordering the tensor data into a matrix. The mode-k unfolding of the order-n tensor \mathcal{X} is denoted as \mathbf{X}_{(n)} \in \mathbb{R}^{I_k\times (\prod_{s\neq k} I_s)}.
  • The k-mode product of a tensor \mathcal{X} and a matrix U \in \mathbb{R}^{J \times I_k} is
    (\mathcal{X} \times_k \mathbf{U})_{i_1, \ldots, j, \ldots, i_n} = \sum_{i_k = 1}^{I_k} x_{i_1, \ldots, x_n} u_{j, i_k}.
  • The Kronecker product of two matrices \mathbf{A}\in \mathbb{R}^{I \times J}, \mathbf{B} \in \mathbb{R}^{K \times L} is
    \mathbf{A} \otimes \mathbf{B} = \begin{pmatrix} a_{1, 1} \mathbf{B} & \cdots & a_{1, J}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{I, 1} \mathbf{B} & \cdots & a_{I, J} \mathbf{B} \end{pmatrix}.
  • The Khatri-Rao product of the two matrices \mathbf{A} \in \mathbb{R}^{I \times K}, \mathbf{B} \in \mathbb{R}^{J \times K} is
    \mathbf{A} \odot \mathbf{B} = \begin{pmatrix} \mathbf{a}_1 \otimes \mathbf{b}_1 & \cdots & \mathbf{a}_K \otimes \mathbf{b}_K \end{pmatrix}.
  • The Hadmard product is the elementwise product of two matrices \mathbf{A}, \mathbf{B} \in \mathbb{R}^{I, J}, i.e.
    \mathbf{A} * \mathbf{B} = \begin{pmatrix} a_{1,1} b_{1, 1} & \cdots & a_{1, J} b_{1, J} \\ \vdots & \ddots & \vdots \\ a_{I,1} b_{I, 1} & \cdots & a_{I, J} b_{I, J}\end{pmatrix}.
  • The so-called CP decomposition (canonical decomposition, parallel factors) of a tensor is a sum of rank 1 components
    \mathcal{X} \approx \sum_{r = 1}^R \mathbf{a}_r^{(1)} \circ \cdots \circ \mathbf{a}_r^{(n)}.
    This is usually denoted with
    \mathcal{X} \approx [\![ \mathbf{\lambda}; \mathbf{A}^{(1)}, \ldots, \mathbf{A}^{(n)} ]\!] \stackrel{\triangle}{=} \sum_{r = 1}^R \lambda_r \mathbf{a}_r^{(1)} \circ \cdots \circ \mathbf{a}_r^{(n)}.
    where \mathbf{A}^{(i)} has orthonormalized vectors.

However low-rank approximation in tensor is not as well defined as that of matrices. One thing is there may not exist a rank-2 tensor which is nearest to a rank-3 tensor. Although there are various bounds for the rank, it is still not clear how to estimate the rank of a tensor. Therefore it will be difficult for us to find a good low rank approximation.

The most popular algorithm for CP decomposition is ALS (alternating least square). This is very simple, by fixing all \mathbf{A}^{(i)} except \mathbf{A}^{(k)}. With the k-mode unfolding, we will approximate it by tuning \mathbf{A}^{(k)}, which leads to the following least square problem,
\min_{\hat{\mathbf{A}}} \| \mathbf{X}_{(k)} - \hat{\mathbf{A}} (\mathbf{A}_{(1)} \odot \cdots \odot \mathbf{A}_{(k-1)} \odot \mathbf{A}_{(k+1)} \odot \cdots \odot \mathbf{A}_{(n)})^\top\|,
with least square solution, we have
\hat{\mathbf{A}} = \mathbf{X}_{(1)} [ (\mathbf{A}_{(1)} \odot \cdots \odot \mathbf{A}_{(k-1)} \odot \mathbf{A}_{(k+1)} \odot \cdots \odot \mathbf{A}_{(n)})^\top ]^\dagger.
Please notice that (\mathbf{A} \odot \mathbf{B})^\top (\mathbf{A} \odot \mathbf{B}) = (\mathbf{A}^\top \mathbf{A}) * (\mathbf{B}^\top \mathbf{B}), therefore
\hat{\mathbf{A}} = \mathbf{X}_{(1)} (\mathbf{A}_{(1)} \odot \cdots \odot \mathbf{A}_{(k-1)} \odot \mathbf{A}_{(k+1)} \odot \cdots \odot \mathbf{A}_{(n)})[(\mathbf{A}_{(1)}^\top \mathbf{A}_{(1)} * \cdots * \mathbf{A}_{(k-1)}^\top \mathbf{A}_{(k-1)} * \mathbf{A}_{(k+1)}^\top \mathbf{A}_{(k+1)} * \cdots * \mathbf{A}_{(n)}^\top \mathbf{A}_{(n)})]^\dagger.
The values of \lambda is updated with the norm of the columns of \hat{\mathbf{A}} and \mathbf{A}_{(k)} is the normalized \hat{\mathbf{A}}.

Another important problem with tensor is Tucker decomposition (a.k.a. HOSVD, high order SVD): we seek to find
\mathcal{X} \approx \mathcal{G} \times_1 \mathbf{A}^{(1)} \times_2 \cdots \times_{n} \mathbf{A}^{(n)}.
Usually we need a degenerate version of HOSVD, which relates to the k-rank approximation. This problem can also be solved with ALS.

There are a lot of decompositions related to tensor:
  • INDSCAL (indivisual differences in scaling), for order-3 tensors where the first two dimensions are symmetric.
  • PARAFAC2 (parallel factorization) is, strictly speaking, not a problem of tensor decomposition. It seeks to find an approximation of several matrix \mathbf{X}_k, each is approximated with \mathbf{U}_k \mathbf{S}_k \mathbf{V}^\top, where \mathbf{U}_k and \mathbf{V} are orthogonal matrices and \mathbf{S}_k are diagonal matrices.
  • CANDELINC (canonical decomposition with linear constraints), in which we know the subspace of each dimension, can be turned into a standard CP decomposition this way.
  • DEDICOM (decomposition into directional components), seeks approximation of an asymmetric matrix \mathbf{X} = \mathbf{A} \mathbf{R} \mathbf{A}^\top.
  • PARATUCK2 seeks to find an approximation of \mathcal{X},
    \mathbf{X}_k \approx \mathbf{A} \mathbf{D}_k^A \mathbf{R} \mathbf{D}_k^B \mathbf{B}
    where \mathbf{A}, \mathbf{B} are matrices and \mathbf{D}_k^A, \mathbf{D}_k^B are diagonal matrices.
  • NTF (nonnegative tensor factorization), on each dimension, a NMF is applied instead of PCA.

Friday, April 10, 2009

An Introduction to Variational Methods for Graphical Models


This paper describes in details the variational methods for graphical models. There are basically two important applications of variational methods (here it refers to variational bounds): one for Bayesian models where the posterior probability is hard to evaluate and therefore is a replacement for the slow MCMC methods; the other for graphical models where too many dependencies (edges in the graph) give rise to a difficult inference problem. Here the authors describe two styles of variational bounds. No matter which is employed, the ultimate objective is the same: to simplify the graph by eliminating some edges since the bound usually provides an easier way of computation.

The first bound is to exploit the Fenchel-Legendre transform of a convex/concave function. Generally speaking, a convex (concave) function has a natural upper (lower) bound, which is the tangent hyperplane at each point on the graph of the function. The so-called Fenchel-Legendre transform (or variational transform, as referred to in the paper) of a function is
\hat{f}(z) = \min_x z^\top x - f(x)
which is the intercept of the the function f where the tangent is z. Therefore
f(x) \geq \langle z, x \rangle + \hat{f}(z),
which is a lower bound for the function and z is the variational parameter. There are two examples in the paper:
  • The QMR-DT is a disease-symptom bipartite Bayesian belief net. Due to the ``explaining away'' problem (a``v'' shape structure, when the child is not given the parental nodes are independent but when the child is given, they become dependent and the sum over these variables will be difficult since they are coupled), we have to simplify the probability \Pr(f_i=1 \mid d ) = 1 - \exp\left( -\sum_{j \in \mathrm{pa}(i)} \theta_{i,j} d_j - \theta_{i, 0}\right) with an upper bound.
  • The Boltzmann machine can also be solved with this method.They derive a lower bound as for logistic function, which can delink all neighbours of a single r.v. but an upper bound, though easy to compute and decouple the r.v. with its neighbors, connects all its neighbors.
The second bound is to exploit the Jensen's inequality,
\log \Pr(x) = \log \left( \sum_z \Pr(x, z) \right) = \log \left( \sum_z q(z) \cdot \frac{\Pr(x, z)}{q(z)} \right) \geq \sum_z q(z) \log \frac{\Pr(x, z)}{q(z)},
which resembles the standard derivation of EM. Since x is observed, we usually introduce a variational parameter for q(z \mid \lambda) and maximize the lower bound by selecting a proper \lambda. The paper illustrates the idea with several mode examples:
  • Boltzmann machine. Now we may put all the observed r.v.s to x and all latent r.v.s. The proposed q(z) is a decomposed binomial distribution and the parameter, i.e. the moment isthe variational parameter.
  • Neural network, here referring to sigmoid belief network. Just like Boltzmann machine (the difference is one is MRF while the other is a belief netowrk).
  • Factorial hidden Markov model. Since the posteror distribution of the hidden variables couples, the proposed q(z) is a decoupled prior with adjustable parameters (variational parameters).
  • Hidden Markov decision tree.
This idea has relationships with other known models:
  • Helmholtz machine: it uses two set of parameter for the generative model and recognition model instead of computing the posterior probability with Bayes' theorem.
  • Sampling method: they compare Gibbs sampler and variational methods. Gibbs sampler is said to pass messages of samples (conditioned on all other variables) while variational bounds pass statistics themselves (or the parameters). Not sure, I think the global variational method is closer to this point.
  • Bayesian methods: they are actually referring to the global variational method. Not clear why it is not mentioned here. Maybe it not their work -,-b

Thursday, April 9, 2009

High-dimensional Data Analysis: The Curses and Blessings of Dimensionality


In this non-technical paper, Donoho prospected the development of data analysis and suggested the difficulty in dealing with high dimensional data (the curse of dimensionality) and the possible advantages we may take (the blessing of dimensionality). His advisor John Tuckey coined the word data analysis, which differs from the traditional mathematics or statistics in the way of rigorous derivation: more heuristic ideas are applied. He argued that nowadays people choose quite a different method of exploration in science: only a few variables are selected to make a model which can be analytically formulated and studied in the traditional way while now it is more widely employed to collect a large amount of data or the information about one sample is enriched via all kinds of ways.

Curse of Dimensionality:
  • For a naive search algorithm in a high dimensional discrete space, the number of candidates is exponential of the dimensionality.
  • For optimization, we need (1/\epsilon)^d evaluations for a Lipschitz continuous objective function for an approximation of error less than \epsilon.
  • In function approximation to a Lipschitz continous function, we also need O\big((1/\epsilon)^d \big) evaluations for a uniform approximation of error \epsilon.
  • In numeric integration of a Lipschitz continous function, again we need O\big( (1/\epsilon)^d\big) evaluations on a grid in order to obtain an approximate integration of error \epsilon.
  • Nonparametric extimation will be difficult. The much slower convegence rate is the ugly head of the curse of dimensionality.
  • Regularization is usually employed to tackle the curse of dimensionality.
Blessing of Dimensionality:
  • The concentration of measure phenomenon is for a lot of distribution in the high dimensional space. Let (\Omega, \mathcal{F}, \mu) be a metric measurable space, d(\cdot, \cdot) is the distance and \mu is the probability measure of the \sigma-field \mathcal{F}. Let \alpha(\epsilon) = \sup_A \{ \mu( \Omega \backslash A_\epsilon) \mid \mu(A) = 1/2\} where A_\epsilon = \{ x \in \Omega \mid d(x, A) < \epsilon\} is the \epsilon-extension of the measurable set A. The so-called Levy family (\Omega_n, \mathcal{F}_n, \mu_n), which satisfies \forall \epsilon > 0, \lim_{n \to \infty} \alpha_n(\epsilon) = 0, means that the probability measure will concentrate in the center part of the support when n increase. If the convergence to 0 is C \exp(-c n \epsilon^2), then it is also called normal Levy family. In many cases, the index n can be the dimensionality, which means as the dimensionality goes up, we may observe the concentration of measure. For a normal Levy family, the tail's behavior is like a Gaussian. The concentration of measure means in high dimensional space it is more likely that the data you sample contains much information as you don't need to go further from the data, you have already covered the support. The probability decays as you leave a set with a fixed nonzero measure set.
  • Dimension asymptotics. This is the refinement of the previous argument.
  • Approach to continuum: there is an underlying compactness to the space of observed data which will be reflected by approximate finite-dimensionality and an increasing simplicity of analysis for large d. (sounds like the reason for compressive sensing).
  • Approximation through randomness. Random projections, CS are in this range.
  • Approximation through computational harmonic analysis, which is a way to the continuum.
I am not familiar with harmonic analysis. Many points in this paper are still vague to me. I must reread it when I gain more knowledge.

Tuesday, April 7, 2009

Hierarchical Dirichlet Processes


This paper is relevant to hierarchical modeling of MoE (hence related to HMoE). The DP extension for mixture model has been studied in this paper (with variational approximation methods) too. This paper only use Monte Carlo methods for inference (instead of approximation algorithms). They simply introduce one DP prior for parameters of the model that generates the data and another DP prior for the generation of groups. In one word,
G_0 \mid \gamma, H \sim \mathrm{DP}( \gamma, H ), G_j \mid \alpha_0, G_0 \sim \mathrm{DP}( \alpha_0, G_0), \theta_{ji} \sim G_j, x_{ji} \sim F(\theta_{ji}).
Their idea of DP is implemented with SBP (as in this paper). However, we can only observe x_i. For a learning problem we must find a proper settings for \gamma and \alpha_0. For inference, we must get the posterior distribution of two DPs. Please notice for a DP MoE, the base distribution of the DP is on the parameter space; now in HDP, the base distribution of the intermediate DP is a random measure, which is the top DP. Therefore the base distribution of the top DP is on the parameter space: the sampling from the top DP results in \sum_{k = 1}^K \frac{m_k}{m + \gamma}\delta_{\phi_k} + \frac{\gamma}{m + \gamma} H where m_k is the number of samples with \phi_k values and \phi_k are drawn from H, m=\sum_{k=1}^K m_k; the intermediate DP put these samples \theta_i in groups according to \sum_{t = 1}^T \frac{n_t}{i-1 + \alpha_0} \delta_{\psi_t} + \frac{\alpha_0}{i-1 + \alpha_0} G_0, where n_t is the number of samples in this group and T is the number of groups.

Now we come to the details:
  • The SBP generates a set of v_i, \pi_i and \theta_i in the following way,
    v_i \sim \mathrm{Beta}(1, \alpha), \quad \pi_i = v_i \prod_{k = 1}^{i - 1}(1 - v_i), \quad \theta_i \sim G, \quad \Pr(\theta = \theta_i) = \pi_i
    This is equivalent to a DP whose parameter is \alpha and G. Then we may generate the data according to this.
  • The HDP can be formulated as a variant of CRP, Chinese restaurant franchise: a customer chooses a dish (being served or not) and sits at a table (serving the dish or a new one). We can dormulate the generation as follows: a customer \theta_i comes into the restaurant and will sit at a table according to the intermediate DP and asks for a dish (if it is a new table) according to the top DP.
  • For inference, we use SBP formulation and calculate the posterior of the hidden r.v.s. Ad we may see in the SBP formulation, we need an indicator (multinomial distribution) for one DP. The sampling is done with Gibbs sampler.
  • It can be used for enhancing LDA (latent Dirichlet allocation) and HMM.