Wednesday, July 29, 2009

Efficient Euclidean Projection in Linear Time

I am really surprised why this paper got published. The previously scanned paper in ICML 2008 noticed the linear algorithm already. Not sure...

On Sampling-based Approximate Spectral Decomposition

This paper proposed another approximation of kernel methods based on former Nystrom method and column sampling. The prime disadvantage of the earlier method is that they must compute the whole Gram matrix while the proposed adaptive method does not need to. They prove several results (not that interesting though): column sampling is best for rank 1 approximation (no one will use rank 1 approximation I guess...); for a given form, neither is optimal approximation (that is why they are not good enough?); under some cases (looks not useful in practice), Nystrom's recovery is exact.

The improvement of approximation is not salient though, but I guess the efficiency might be improved.

Tuesday, July 28, 2009

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

This can be viewed as a deep version of convolutioal neural network (I am not quite familiar with this). In each layer (in a sense), there are two sublayers, one for detection, convolving the input with several filters, the other for max-pooling, shrinking (resize the image) the convolved sublayer. Therefore on a whole the layer's input is an image and the output is several convolved and downsampled images. They layer this kind of nets and get the deep version.

Let's see what we have to get. For each convolving sublayer, we have to train a convolution kernel and the corresponding biases (as in RBM). The change is the max-pooling layer. Each neuron in the max pooling layer only connects to a fixed size (a small patch, e.g. 2x2) of neurons in the convolutional sublayer. Since the neurons in the convolutional sublayer could be 0-1 and the max-pooling means only if none of the input neurons fires the output is 0, from the outside it is like a big neuron which can take multiple values (e.g. 2x2+1 = 5). The we may do as RBM, writing down the energy function, converting it to probability, formulating the likelihood and using CD learning with a sparsity penalty.

The good idea behind the structure is that we first get some useful filters, then parts of the object and later the whole objects. The features learned with the model gives good results on several data sets.

Thursday, July 23, 2009

Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks

This paper tells us about a method to evaluating two ranking reseults based on clicks of users. We know the ranking affects exposures of the links, i.e. the higher rank one item has, the more attension human beings pay to. Therefore it is more reasonable to use a discounted relavance called discounted cumulative gain (DCG)
\mathrm{DCG}_l = \mathrm{rel}_1 + \sum_{i = 2}^l \frac{\mathrm{rel}_i}{\log_2 i}.
That is, the rank 1 item has no discount while rank i item has a discount ratio of 1/\log_2 i.

So if we want to compare two ranking results, we have to calculate the relevance score given by given the query. This is not always known. People might have been asked to score the documents with discrete values (on a scale of 5, e.g.) but not all documents will be marked. We model this scale with a multinomial distribution and simulated the scoring procedure and compare the two DCG values. If for 95% cases, one is higher than the other, we might assert it is better.

The author proposed quite a simple model to model \Pr(X_i \mid q, c, ). It is an ordinal logistic regression model with linear and quadratic features. So when we trained the model, we can simulate the DCG and compare two rankings.

Herding Dynamical Weights to Learn

This paper talks a novel way of learning MRF such as Hopfield networks (is that also a Boltzmann machine). For these models, the primal-dual structure of maximum entrpy learning and maximum likelihood estimator tells the possible learning algorithms, e.g. gradient-based, since though the optimization is constrained in the former form, it doesn't in the dual form.

The gradient-based algorithms require solving the following inference problem
w_{\alpha}^{(t+1)} = w_\alpha^{(t)} + \eta( \bar{g}_\alpha - \mathbb{E}[g_\alpha]_P)
where the expectation is computed in the given model. The common ways of computing the expectation include a Gibbs sampler (MCMC) or mean field approximation (usually not good). For certain models, such as RBM, we might think about Hinton's contrastive divergence, but still we need non-deterministic algorithms.

The herding algorithm the author proposed here is a deterministic algorithm. It takes the limit of the annealing version of the negated log-likelihood function and it results in a tipi function. He then formulated the herding algorithm as first maximizing (due to the limit) and resulting in some pseudo samples and then calculating the gradient based on these pseudo samples.

I am not familiar with the dynamic system and the intrinsic idea behind this stuff though.

Wednesday, July 22, 2009

Statistical Geometrical Features for Texture Classification

This paper simply provides us with 16 features for texture classification. Well... try them in some applications then.

Monday, July 20, 2009

An Efficient Projection for L_{1, infty} Regularization

This optimization problem is a little different from the L1 penalized version, in that the variables to optimized is a matrix, and the corresponding penalty is the so-called L_{1, \infty} norm,
\| X \|_{1, \infty} = \sum_{i = 1}^m \max_j X_{i, j}.
The optimization is based on a previous ICML 07 paper. But now we are dealing with matrices instead of vectors. For example, the toy optimization problem is modified to its matrix version
\min_{B, \mu} \frac{1}{2} \sum_{i, j} (A_{i, j} - B_{i, j})^2,
such that \forall i, j: B_{i, j} \leq \mu_i, \sum_i \mu = C and nonnegative constraints for B given a nonnegative matrix A.

But soemhow there seems to be nothing else new.

Support Vector Machine Learning for Image Retrieval

This is actually a vision paper. I am not sure whether the active learning is really just a version. In a query of similar images, the user labels some relevant and irrelevant images. A SVM is then used to rank the images correspondingly. This will help the query though but will not help the later search. A possible improvement is adding some manifold regularization. Quite contrary to my expectation. Hmm...

Sunday, July 19, 2009

Regression by Dependence Minimization and its Application to Causal Inference in Additive Noise Models

This paper follows the NIPS 08 paper, with a different regressor. We know HSIC could be used for dependence maximization as well as independence maximization. Here we simply use its dependence. Basically there are two ways of doing regression, one to minimize the dependence of the residual and the input, the other to maximize the dependence of the response and the prediction (is that possible?)

With HSIC it is easier to minimize the dependence of the response and the residue. The pro is that we do not need to specify the additive noise's distribution. In many regression problems, we actually assume the noise is a Gaussian. Now we may forget about the possible violation of this assumption. But the con is that now the optimization of the model will be much more difficult.

The good thing about HSIC is that we have a statistical test, which allows us to judge whether it is statistically reliable to assert the independence. I am not sure whether we have a similar statistic for KDR-like terms.

Nonlinear Causal Discovery with Additive Noise Models

This paper talks about making causal inference from data. We know if two r.v.s have an additive linear relationship (with noise), the reverse relationship also holds. If there is a non-linear relationship, our causal inference will be easier since the inverse might not hold. This is the main result of this paper, finding if there exists a reverse relationship, the probability must satisfy a differential equation. From this equation we know, if \nv''' = \xi''' = 0, where the two function are the logarithm of the PDF of x and the additive noise n, f must be linear.

This also suggests a way of making causal inference on DAGs. This leaves us to find a powerful regressor to capture the possible relationship between two r.v.s. The authors chooses GPR. Then the residue should be statistically independent of x, which can be tested with HSIC.

However, if we are to find the latent structure instead of doing a statistical test, this search will be intolerable if the DAG size is beyond 7.

Friday, July 17, 2009

Boosting Products of Base Classifiers

Boosting is quite a useful technique for practical tasks, since we may get robust and high-precision classifier or regressor by combining the weak ones. Among them AdaBoost.MH is the most popular version (a little different from the one for binary version). This paper trains products of base classifiers instead of tree or MoE-like ones (considering Hinton's claim of the advantages of PoE over MoE). The design of the algorithm is not difficult though. The results on MNIST shows it is the second best algorithm (the best is DBN).

Well, implement one :-p

Thursday, July 16, 2009

Efficient Projections onto the l1-Ball for Learning in High Dimensions

The projection on L1 ball is another way of computing L1 norm penalty for certain regularization problem. That is to say
\min_w \Omega(w) + \lambda \| w \|_1 
, we may optimize
\min_{w} \Omega(w) \quad \text{s.t. } \| w \|_1 \leq z
where z is a constant. The former one can be seen as a series of optimization problem indexed by the regularization parameter \lambda and the latter is indexed by the constant z. We can show that two forms are equivalent.

Theirfore, according to the latter, the optimization becomes finding a minimizer of the loss function on the L1 ball. If the minimizer happens to be inside the ball, it is the solution; otherwise, it must be on the boundary of the ball. So the difficult part becomes how can we project the vectors onto the L1 ball efficiently. This explains why the solution should be sparse as well.

The first simple case is
\min_w \frac{1}{2} \| w - v \|_2^2 \quad \text{s.t. } \sum_{i = 1}^n w_i = z, w_i \geq 0,
where v inside the positive cone. With a little analysis, we know this is actually very simple. When \| v\|_1 \leq z, the solution is obvious. Otherwise, we should decrease the coordinates. Using Lagrange multiplier, it is seen that each coordinate are shrunk with the same value. Therefore the norm \| w \|_1 will decrease, n \theta, (n-1)\theta, \ldots, the slope changes whenever a coordinates vanishes. Therefore we stop when the piecewise linear function meets \| w \| = z. We just need to find the intersection. If we implement it with a brute search, it ends up with a O(n \log n) algorithm, but with a little care (using the quick sort idea), the magic turns it into a O(n) algorithm.

Then we come to other cases when v can be in other orthants, due to the symmetry. Finally, we will use a tree structure to boost up the speed. In our implementation of a gradient-based algorithm, such as
w^{(t + 1)} = \Pi_W \Big( w^{(t)} + g^{(t)} \Big),
we calculate the projection onto the L1 ball, i.e. \Pi_W(\cdot). The idea is to use a red-black tree (gonna review these structure). The the projection can be done in the time which scales linearly in the non-zero entries of g^{(t)} and logarithmically in the total number of features n.

Let's compare this one with the ICML 09 paper and several other papers later.

Discriminative k-Metrics

This paper tells us about an old clustering technique called k q-flats. It can be regarded as a generalization of k-means algorithm but we find a q-dimensional space to project onto instead of a centroid for each cluster,
\sum_{j = 1}^K \sum_{x \in K_j} \| x - P_{F_j} x \|^2
This can be solved with k-means-like (EM-like) algorithms. Basically if we want to apply it to classification problems, we may train a k q-flats for each class but this model lacks discriminative power.

But this can be solved in a natural way
\sum_{i = 1}^c\sum_{j = 1}^K \left( \sum_{x \in K_{i, j}}g_1(\| F_{i, j}^\top x \|^2) + \sum_{x \not \in C_i} g_2(\| F_{i, j}^\top x\|^2)\right)
A little difference is the optimization has to be done with gradient. The selected g_i are Hinge-like loss funtions and the derivatives are simple piecewise constant functions.

Wednesday, July 15, 2009

On Primal and Dual Sparsity of Markov Networks

This paper mainly taks about the relationship of several M3Ns. The primal and dual sparsity are caused by L1 norm penalty and the constraints (according to KKT conditions). Therefore adding L1 norm penalty to M3N will cause both sparsities, which increases generalization capacity and selects important features.

The LapM3N proposed by the authors earlier is a Bayesian version of M3N. The MAP estimator of LapM3N would go as M3N with different penalties, e.g. L2 norm corresponding to a Gaussian prior and L1 norm corresponding to a Laplace prior with parameter going to infinity.

Another relationship of L1 norm is found to sparse Bayesian learning, since adding a prior for each parameter of M3N (think about RVM) would result a sparse solution. The adaptive M3N will yield the same result as the L1-normed M3N.

The authors proposed an EM-like algorithm for training L1-normed M3N. Obviously, it would have connection to variational Bayesian approximation.

Maybe we should write our own structured learning tools.

Monday, July 13, 2009

Large-scale Deep Unsupervised Learning using Graphics Processors

This is a paper on implementation of DBN and sparse coding algorithms with GPU. The finer-grain parallelism provided by the modern GPU outperforms CPU architecture. The bottleneck is the IO, from main memory to the memory inside the video adapter. The finer-grain parallelism allows us to deal with data parallelism and the data are divided for each block and the job assigned to each block is further divided into threads' labor.

They said they would provide their code online. I have not found it.

Saturday, July 11, 2009

Gradient Descent with Sparsification: An Iterative Algorithm for Sparse Recovery with Restricted Isometry Property

This paper tells one story about the following optimization problem arising from compressive sampling
\min_x \quad \| x \|_0 \qquad \text{s.t.} \quad \Phi x = y,
which is usually approximated with the following version
\min_x \quad \| x \|_1 \qquad \text{s.t.} \quad \Phi x = y.
We have quite a few algorithms for solving the optimization under different conditions. This paper proposes the following gradient-based algorithm
x \leftarrow H_s \left( x + \frac{1}{\gamma} \cdot \Phi^\top (y - \Phi x)\right)
where H_s take the largest (absolute value) s elements of x and other as 0.

The theoretical result is their small computation cost (K in each iteration), fast convergence rate (2L / \log(\frac{1 - \delta_{2s}}{2\delta_{2s}})) and lenient recovery condition (\delta_{2s} < \frac{1}{3}). I think it might be a good algorithm if we late want to make some lasso experiments.

Friday, July 10, 2009

Sparse Higher Order Conditional Random Fields for Improved Sequence Labeling

This paper adds higher-order features to the CRF model with exact inference. The feasibility is caused by the sparsity. That's to say although we include higher order features in the feature vector, it seldom fires. So in the inference stage (since we have to calculate the gradient), the computational complexity will not go exponentially and the sparsity makes the exact inference possible. This paper extends the terms for linear-chain CRF into configurations which can be used for higher-order features. That's the main contribution.

I will write something related to this problem soon. First I have to finish some optimization code.

Curriculum Learning

The so-called curriculum learning states the idea of learning things from simple ones to difficult ones. The gradually hardening tasks help build a classifier with better generalization capacity.

There are evidence from cognitive sciences as well as from machine learning itself. In optimization theory, the famous continuation mathod actually has the same spirit. Another example is the deep belief nets, in which the greedy pretraining can be seen as a simpler task than the succedent fine tuning. The examples provided by the authors comes from simple toy experiments in low-dimensional space (train two Bayesian classifier with or without difficult examples), a shape learning task with neural nets (with or without a switching epoch, at which the training set is switched from simple to difficult samples), an NLP example.

Their claim includes several messages:
  • difficult examples may not be useful;
  • better curriculum might speed up online learning and guid the result to the region where better generalization can be found;
  • the idea might be connected with active learning and boosting.

Thursday, July 9, 2009

Graph Construction and b-Matching for Semi-supervised Learning

This is an ``advertisement'' paper for their 2007 paper on b-matching. In general b-matching is a graph construction algorithm as k-NN. This paper enumerate all kinds of combinations of label diffusion (propagation?):
  • graph construction, kNN and b-matching
  • weight, Gaussian kernel, LLR (like LLE).
  • diffusion algorithms, GRF, LGC and GTAM.

I don't buy their conclusion. Yes graph construction might be important to later algorithms, but the b-matching result doesn't seem so different from k-NN. Maybe we have to try for ourselves.

Minimum Volume Embedding

This is another piece of work by the authors of the previously scanned paper. This work is mainly based on the MVU paper, where the graph is embedded with isometry constraints (linear for the Gram matrix) and maximized variance (the trace of the Gram matrix). Therefore the Gram matrix can be obtained via SDP optimization techniques.

But the variance to be maximized is harmful since it might cause the variance in all directions to increase, which is not necessary (as is illustrated in the example of the paper). The author takes the difference of the eigenvalues of the Gram matrix to be optimized
\max \sum_{i= 1}^d \lambda_i - \sum_{i = d+1}^N
where \lambda_i are the eigenvalues of K and K must satisfy the same contraints as MVU.

To solve the problem, the authors proposed an iterative algorithm based on SDP. In each iteration, the eigen vectors are renewed by PCA of the Gram matrix and the Gram matrix is updated with SDP. It can be proved the algorithm will converge to a local minima. This will force those eigen values irrelevant to the embedding to zero and therefore cause a minimum volume embedding.

Structure Preserving Embedding

This paper proposes the concept structure preservaing embedding: given an algorithm to construct a graph \mathcal{G}, it would yield the same graph as the given affinity matrix A_0 with the computed Gram matrix K.

There are two kind of algorithms to construct the graph:
  • kNN and \epsilon-ball; we can find linear constraints for K which ensure the nearby samples are nearer than other samples (the constrains are at most O(N^2)).
  • Maximum weight subgraph, b-matching subgraph, maximum weight spanning tree. The number of constrains might be exponential in N.

The objective is \mathrm{tr}(K A), subject to \mathrm{tr}(K) \leq 1 and K \succeq 0. The authors prove that under these conditions, the Gram matrix has rank 1. The embedding is calculated with constraints with a common slack variable \xi, which allows possible violation of the constraints. Then the resulted embedding might has more dimensions.

For the first category of constraints, SDP can be directly applied while for the second category, we first use SDP without constraints and then add most violated constraints one-by-one, each time the model is updated with SDP until convergence is observed.

This technique is best regarded as a visualization technique (for graphs) though it could also be used in dimensionality reduction tasks for classification. The experiments show with SP contraints added to MUV and MVE, the classification rate improves.

Wednesday, July 8, 2009

Geometric-aware Metric Learning

This is an extension to the ICML 07 paper, with introduction of graph regularization. The idea is to find a pair of kernel matrices, one from task-dependent kernel set and the other from a parametrized data-dependent kernels. The two sets are quite different. The former is used in later classification or related tasks and they choose the Mahalanobis kernel. The later contains locality information and is created with the graph kernels (a subspace spanned by the eigen vectors corresponding to the smallest eigenvalues of the Laplacian matrix). To measure the similarity of the two distance, the Bregman divergence is employed. the difference between the current optimization and the one in the ICML 07 paper is that the other matrix is not fixed. The rough idea is to update them alternatively (just as in LSQ in tensor).

We may change the data-dependent kernel for different tasks (e.g. unsupervised tasks using graph kernels, supervised tasks using labels). There are connections with regularization theory: the solution has a representation of a combination of graph kernel and another term which can be interpreted as a regularizer. Another possible connection is to GP: GP could be regarded as a special case of the proposed framework. I think this topic is by then very interesting. I will come back later to this topic when I have time.

Probabilistic Dyadic Data Analysis with Local and Global Consistency

This paper introduces a graph regularizer for PLSA. There are some interesting stories behind the NLP algorithms such as PLSA, LDA (latent Dirichlet allocation), I think, but I am not quite familiar with them. From the modeling aspect, they deal with the same thing. Given a term-document matrix, X, the element of which is the count of occurrences of a given word w_j in the document d_i. The generative model is each document is a mixture of multinomial distribution. Namely, we have several latent topics z_k such that a document is a mixture of the topics z_k with the mixing proportion \Pr(z_k \mid d_i). Each topic decide a kind of ditribution for the words, namely \Pr(w_j \mid z_k). The PLSA algorithm simply uses EM to train the mixture model.

The LDA is not as far as I think. It's just a Bayesian version of PLSA. We know the natural extesion of mixture models in frequentists' domain is adding a Dirichlet prior for the proportion of the latent variable. And the more sophisticated technique is non-parametric Bayesian, adding a Dirichlet process as the prior. The fully Bayesian method need the posterior distribution which is approximated with (global) variation inference.

The two algorithms do not take into consideration the local information. This paper simply adds a regularizer to the PLSA loss (-log likelihood). Here the authors assume if the documents are similar, their mixing proportions must be similar. So if d_{i_1} and d_{i_2} are similar, according to some rules (labels in classification or cosine of the angle between two tf-idf vectors in unsupervised case), the two distribution \Pr(z_k \mid d_{i_1}) and \Pr( z_k \mid d_{i_2}) are similar. This similarity is measured with the symmetric KL divergence:
\min -\sum_{i = 1}^N \sum_{j = 1}^M n(d_i, w_j) \log\sum_{k = 1}^K \Pr(w_j \mid z_k) \Pr(z_k \mid d_i) + \frac{\lambda}{2} \sum_{i_1, i_2 = 1}^N W_{i_1, i_2} \big( D(\Pr(z_k \mid d_{i_1})\parallel \Pr(z_k \mid d_{i_2})) + D(\Pr(z_k \mid d_{i_2})\parallel \Pr(z_k \mid d_{i_1})) \big)
The optimization can still be solved with EM with a little approximation. The difficult part is the maximization step.

The result is really good: in unsupervised case, it is better than NMF and PLSA, LDA and even N-cut.

Information Theoretic Metric Learning

This paper talks about a metric learning algorithm. The basic tool is the so-called Bregman divergence. the setting is to find a Mahalanobis distance, therefore a matrix A for inner product (x^\top A x). The trick is we need our matrix A is as close as to a given A_0, which is achieved via the KL divergence of two Gaussians with A_0 and A as their covariance matrix respectively. We further have two sets, one including pairs of samples that are near (their distances are less than a threshold), the other including dis-similar ones (their distances are larger than another threshold). It can be derived the KL divergence is actually a form of Bregman divergence and equals \mathrm{LogDet}(A_0, A) = \mathrm{tr}(A A_0^{-1}) - \log \det A A_0^{-1} - n. The constraints are linear to the matrix A to be optimized. This distance is the only scale-invariant and the loss leads to uniform minimum variance unbiased estimator. As SVM, we may introduct slack variabls to tackle the infeasible cases. To solve this optimization, a former algorithm (ICML 2006 paper) is extended.

There are some connection to the previous work to since it can be proved the low-rank kernel matrix is simply induced by the Mahalanobis kernel we find in the proposed metric learning algorithm. And the paper also addresses the online metric learning problem. They prove the proposed algorithm will converge.

Thursday, July 2, 2009

Robust Feature Extraction via Information Theoretic Learning

This paper is based on the so-called Renyi's entropy, which is defined as
H_2(p) = -log \int p^2(x) \,\mathrm{d} x
Given a random variable x's sample, the empirical Renyi's entropy is estimated with kernel density estimator
\hat{H}_2(X) = - \log \hat{V}(X) + \text{const}, \qquad \hat{V}(X) = \sum_i \sum_j g(x_i - x_j, \sigma)
where g(x-z, \sigma) = \exp( -\| x - z \|^2 / 2\sigma^2). \hat{V}(X) is called information potential. For two r.v.s, we have similar result,
\hat{H}_2(X_1, X_2) = -\log \hat{V}(X_1, X_2) + \text{const}, \qquad \hat{V}(X_1, X_2) = \sum_i \sum_j g(x_i^{(1)} - x_j^{(2)}, \sigma)
where the second term measures the correlation of two r.v.s.

The proposed objective is to find a projection Y = WX such that
\max_{W} (1 - \lambda) \hat{V}(WX) + \lambda \hat{V}(WX, C) - \gamma \| W \|_\text{fro}^2
They find that when \lambda = 1 we have a so-called robust M-estimator. The optimization algorithm they find is very similar to majorization minimization (auxilliary function). The interesting relationship they listed in the paper includes that with LPP. The iterative algorithm solves a LPP/LapRLS/SRDA in each iteration.

I am afraid this looks quite like the unsupervised HSIC for SDR problem. I will check it soon.

A Majorization-Minimization Algorithm for (Multiple) Hyperparameters Learning

First we must understand what the majorization-minimization (MM) algorithm is. It is in fact the auxilliary function method, which might be regarded as a generalization of EM algorithm. To minimize (maximize) a function L(x), we find an upper (lower) bound for the objective which is more easier to minimize (maximize), Q(x; x') where the x' is the parameter for the upper (lower) bounding function. This can be seen as a generalization of the idea of CCCP, since in CCCP, we use linear functions to bound the convex (concave) functions and the parameter is simply the variable introduced by Legendre-Fenchel transform. And we require L(x) - Q(x; x') reaches its minimum (maximum) at x = x'. As we can see here, the following inequality holds
L(x) \leq Q(x; x') \qquad \text{or} \qquad L(x) \geq Q(x; x')
The idea is instead of optimizing L(x) directly due to its intractability, we may find suitable algorithm to optimize Q(x; x') where x' = x^{(n)}. Therefore for the minimization case we have the following inequality
L(x^{(n)}) = Q(x^{(n)}; x^{(n-1)}) + L(x^{(n)} - Q(x^{(n)}; x^{(n-1)})) \leq Q(x^{(n-1)}; x^{(n-1)}) + L(x^{(n-1)}) - Q(x^{(n-1)}; x^{(n-1)}) = L(x^{(n-1)})
where the inequality uses two optima, Q(x^{(n)}; x^{(n-1)}) is minimum of Q(x; x^{(n-1)}) and L(x) - Q(x; x^{(n-1)}) reaches its maximum at x = x^{(n-1)}.

Now let's come back to the Bayes learning problem. We know to select a proper hyperparameter, either we use cross validation, or we rely on maximization of Type II likelihood (EM solves it :-p). Here the author simply introduce another prior for the hyperparameter and integrate it out. E.g. for a Gaussian prior for the parameter, usually the hyperparameter is the precision, whose prior is then set to a Gamma distribution. But they kind of stealthly switch the concept. In Bayes framework, the joint distribution is then \Pr(\mathcal{D} \mid w) \Pr(w \mid \alpha) \Pr(\alpha). They first integrate out the hyperparameter \alpha, which leaves us \Pr(\mathcal{D} \mid w) \Pr(w). The we find the MAP estimation for simplicity, which results in
\min_w -\log \Pr(\mathcal{D} \mid w) - \log \Pr(w).
Now we may find the second term is of the form of - C\cdot\log (\| w \|^2 + \beta ), which is a different from the norm regularizer.

We solve this problem with MM optimization technique. We construct an upper bound for the second term
\log x \leq \log y + \frac{x - y}{y}
which is a linear function of x and therefore the optimization problem turns back to the original ``loss + norm regularizer'' form.

Don't see any thing special from this point of view.

Information Theoretic Measures for Clustering Comparisons: Is a Correction for Chance Nessary?

One problem in clustering is how to measure two clustering result. E.g. given two clustering results, which one is more closed to the true clustering?

There are many analytical methods, including Rand Index (RI) and Adjusted Rand Index (ARI). They are based on contingency table. Although RI could be useful since it is 1 iff the two clustering are identical and 0 when no pair are in the same cluster, it can not extinguish a random partition with a ``good clustering''.

To correct this (why it is called correction for chance), ARI is proposed. Another class comes from information theory. They use the mutual information of proportion of two clustering (each clustering's proportion is a density, therefore the mutual information is the common information in the two clustering). As we know we can create a distance based on entropy D(p, q) = H(p, q) - I(p, q) = H(p) + H(q) - 2 I(p, q) or its normalized version \mathrm{NMI}(p, q) = \frac{I(p, q)}{\sqrt{H(p) H(q)}}.

The authors proposed a corrected-for-chance version based on information method and from their experiments, we may find it works pretty well. Their criterion is
\mathrm{AMI}(U, V) = \frac{I(U, V) - \mathbb{E}( I(M) \mid a, b)}{\sqrt{H(U) H(V)} - \mathbb{E}(I(M) \mid a, b)}
\mathrm{AVI} = \frac{2 I(U, V) - 2 \mathbb{E}(I(M) \mid a, b)}{H(U) + H(V) - 2 \mathbb{E}(I(M) \mid a, b)}
where \mathbb{E}(I(M) \mid a, b) is computed from the contingency table.

The author provided examples where the correction is necessary. Esp. when samples in each cluster is few.

Characteristic Kernels on Groups and Semigroups

First let's see some key ideas in kernel-based methods for r.v.s:
To see whether two r.v.s have the same distribution, we test the mean of arbitrary function values for two r.v.s. The leads to distances between two means in the RKHS \mathcal{H}. Hence we introduce an map of the r.v. X into the linear functional of a RKHS \mathbb{E}_X K(\cdot, X). Then for arbitrary function f \in \mathcal{H}, the mean value is \mathbb{E} \langle K(\cdot, X), f(\cdot) = \mathcal{E} f(X). Given two r.v.s X, Y, we can see the discrepancies on any given function f \in \mathcal{H}, \mathrm{MMD}(X, Y; \mathcal{H}) = \max_{f \in \mathcal{H}} \E f(X) - \E f(Y). In practice we calculate the empirical squared version.

The critical problem of this method is the mapping from the r.v. to the linear functional is unique. Given that the conjugate space of the RKHS is simply itself, we can also interpret this as the uniqueness of the mapping from the r.v. to the feature space (RKHS) is unique. Please notice the RKHS is induced by the kernel and their relationship is one-to-one. Hence the thing that matters is what kind of kernel might cause the uniqueness of the mapping.

This is explored in this paper, the so-called characteristic kernel, which has the property that the induced mapping is injective. With this kind of kernel, criterions such as MMD will be a well-defined distance between two r.v.s (or measures as in the paper). Here are some main results:
Lemma 1. Let (\Omega, \mathcal{B}) be a measurable space, k be a measurable, postive definite kernel on \Omega and \mathcal{H} be the corresponding RKHS. Then k is characteristic iff \mathcal{H} \oplus \mathbb{R} is dense in L^2(\mu) for every probability \mu on (\Omega, \mathcal{B}).

A positive definite function \phi is one such that the induced kernel k(x, y) = \phi(y^{-1} x) is positive definite. Here the inverse is defined on a group. E.g. for \mathbb{R}^n the linear space, the group is simply defined by the vector addition, therefore we get kernels in the form of k(x, y) = \phi(x - y) which is simply the so-called RBF. Gaussian kernels, Laplacian kernels are both in this domain. They are called shift-invariant kernels. Bochner theorem characterizes all the shift-invariant kernels defined on \mathbb{R}^n: all these functions are characteristic function (Fourier transform) of a given Borel measure.

With some analysis they come to the result if the shift-invariant kernel induced by positive definite function \phi is characteristic, the corresponding Borel measure has \mathbb{R}^n as its support or we may say the inverse Fourier transform of \phi has \mathbb{R}^n as its support.

This paper takes the result furthur onto more general algebraic structures, i.e. groups and semi-groups. Their claim is as long as we might find an analog of Fourier transform, we might have something similar to Bochner's theorem. So the algebraic structure is a locally-compact Abelian group.

The paper also consider non-Abelian groups and semi-groups. They get similar results. It's very theoretical... hmm... not enough mathematical preparation though.