Sunday, December 19, 2010

The Anatomy of a Large-scale Social Search Engine


This is a WWW10 paper on real-time answer. The idea is to build a village model of knowledge sharing instead of the traditional model of library, e.g. the search engine solution provided by Google.

The users of aardvark have their social graph information collected from several sources: e.g. facebook friends, email contacts, IM contacts and etc, in which the users' questions will be propagated. The users must specify their expertise: either specifying by selecting some items or provide some publishing information to analyze (e.g. twitter, blog). The system builds two indices, ISAM index for the social graph and an inverted index for user's expertise and then enables the user's behavior.

The user's query will be analyzed (to see whether it is a question or not and what topic it is) and give the user a chance to determine the type (since it is immature to automatically determine the topic so far). Then a proper question will be handled by the routing suggestion using the social graph and expertise information.

Therefore, the core of the village model is the routing algorithm. The routing procedure is actually the same as a ranking (of users) problem:
s(u_i, u_j, q) = \Pr(u_j \mid u_i) \Pr(u_j \mid q)
where the first term is measured via users intimacy using the social network (regarding to social connection, demographic similarity, profile similarity, vocabulary match, chattiness, verbosity, politeness and speed) and the second term is learned with an aspecti model (just as pLSA). In practice, the probability \Pr(t \mid u_i) is smoothed over the social nets, since if one's friends know something, he either knows it or knows who to ask about.
The rank engine works as follows: it retrives those users with matched expertise (if the question is location sensitive, this would also be considered in the retrieval); secondly it uses the connectedness to find the one with a proper relationship and lastly it computes whether the query could be dealt by the user using availability information.

The rest are many small pieces we have to put together:
  • whether the proposed text is actually a question? (need a classifier)
  • whether the question is trivial? (we may have vertical search engines for retrieving the desired result without asking someone)
  • whether the question is location sensitive?

The whole platform is kind of difficult to construct but the idea is somewhat easy to grasp.

Wednesday, December 8, 2010

Search Logs as Information Footprints: Supporting Guided Navigation for Exploratory Search


This is a paper on building a exploration interface based on a clustering algorithm (star clustering), that can arrange search queries according to the similarities of the search queries into tree structure. Then the material is shown to the users in a tree structured interface, incorporating search utility.

It's quite similar to what we are doing and may be I will develop another parallel clustering algorithm based on affinity propagation. But let's first scan the result of star clustering.

Monday, December 6, 2010

The Stochastic Gradient Boosted Distributed Decision Trees


This paper proposed two solutions forr implementing the exact stochastic GBDT, which was developed by the famous statician Friendman. I'd like to scan his two previous papers later as the first study of decision trees (others will be CART and C4.5... I guess).

The map/reduce implementation is based on the previously scanned paper, using horizontal splits while in the MPI implementation using vertical splits. The former is quite directly; the later requires communication using all-to-all broadcasting.

Maybe after studying the GBDT, I would have a better understanding of this paper.

A Framework for Learning from Dsitributed Data Using Sufficient Statistics and its Application to Learning Decision Trees


This paper addresses training a large-scale problem using two types of split of data: horizontal fragmentation (sample subsets) and vertical fragmentation (feature subsets). The orientation is determined when writing row-wise sample matrix.

The key idea behind this paper is to extract sufficient statistics from  splits of data so that we may aggregate the statistics in the last to get the exact model. This is quite direct in the case of MoE (when each  expert is an exponential family). This paper, however, has the emphasis in decision trees.

The decision trees are trained layer-wise, finding splits of data using features that maximized the information gain (or some similar criteria). The paper discussed two cases:
  • if the data are horizontally splitted, each site compute the required statistics and they are all combine to find the best split; the split then is passed back to each site so that each site knows which node the samples are in and they could take a second collection of sufficient stats.
  • if the data are vertically splitted, each site could figure out its own split and they will be compared and find the best; however, the split is then represented by the indices (so that each site may understand which samples are in which node);

This solution is accurate and can be applied to implementing the decision trees under map/reduce framework directly. And it should also apply to many hierarchical models too.

Sunday, November 21, 2010

Deep Transfer vis Second-Order Markov Logic


Transfer learning is a topic I am not familiar with. It is said to cope with training with data that differs from the testing data. There are two extents of transfer learning: so called shallow transfer that deals with different distributions in training and testing (in the same domain) and so called deep transfer that deals with different domains in training and testing.

The deep transfer is possible only because different domains shares the same logic while this is actually I think Markov logic should be able to play an important role. The paper explains why we must use second-order logic (due to finding domain-independent knowledge) and relational (for transfer learning). Their proposed algorithm is DTM (deep transfer via Markov logic).

For the experiments, the authors uses three domains, which seem no-in-the-least related to each other (yeast protein, webkb and social nets data from facebook). I am still not sure whether these experiments really show how their transfers work. Maybe we should return to this paper after a more careful study.

Statistical Predicate Invention


This paper talks about how to find statistical predicate (SPI problems, statistical predicate invention). This, from another talk given by Domingos, is equivalent to find latent variables in traditional probabilistical learning, one of the ten most important problems in the following decades in machine learning. The setting of this paper is second-order Markov logic.

Their proposed approach for this problem is MRC (multiple relational clustering). The multiple relational clustering is interpreted via a simple example in the paper: one's technical skills and hobbies should be mined from different groups (clusters) of people, e.g. coworkers may share similar technical skills while friends share similar hobbies. The relational clustering is to find the latent relationship between people, and therefore ultimately finds who are coworkers, friends (latent predicates or r.v.). The resulting algorithm is not easy to understand. It looks like a clustering algorithm in logic language.

There is another piece of work (infinite relational model) using CRP in relational modelling, which I think should be very interesting. We will try to see the details later.

Saturday, November 13, 2010

The PageRank Citation Ranking: Bringing Order to the Web


This might be the famous paper introducing the PageRank to the ranking research and the famous search engine Google to the Internet. The key idea behind the PageRank that differentiate it from the back link counts is that a back link from an authorized site should be more valuable. Therefore the backlinks must be weighted by its own rank. So the last scanned paper's recursion makes sense.

A more interesting question is how to make a distributed version. I think in a way this is equivalent to solving some linear system but I haven't really try to derive it.

This could be a contributing factor for real ranking algorithm (which uses many other features as well).

Multiple Instance Learning for Computer Aided Diagnosis


This paper talks about MIL poblems, addressing them by a convex hull based algorithm. The key idea comes from a former proposed relaxation (in another paper). In MIL, the basic assumption is that each positive bag contains at least one positive sample. The relaxation of this assumption is that in the convex hull of each positive bag, there exists a point that could be correctly classified.

With this relaxation, the author proposed a general form for MIL, with which many known algorithms can be formulated, e.g. SVM and Fisher discriminant criterion for MIL. On a whole this is an application paper.

Wednesday, November 10, 2010

The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank


I started to read those papers by Domingos to follow their idea on Markov logic. Now I have already read a few pages from their tutorials and wish to know more about their past research background.

This paper talks about ranking, which I am not quite familiar with. This paper introduces two previous methods, i.e. HITS and PageRank, which I'd like to scan later. The HITS model is little bit complicated and can't be served online (they have to compute hubs and authorities at query time) while PageRank can utilize the PageRank at offline stage (i.e. after the crawling, they may compute the PageRank using incoming links and outgoing links, which later serves as a factor contributing to the final ranking function).

The graph PageRank builds consists of links:
  • using the outgoing links and incoming links;
  • if a webpage has no outgoing links, it links to all other pages;
So we may set a random walker on the graph and find the probability it gets to the page using the formula
\displaystyle P(j) = \frac{1 - \beta}{N} + \beta \sum_{i \in B(j)} \frac{P(i)}{|F_i|}
where B(j) contains all pages link to page j and F(i) contains the pages linked from page i.
This paper's main idea is to incorporate the query information into the PageRank, which, you may have already seen from tabove, contains nothing about the query q. Let
P_q(j) = (1 - \beta) P_q'(j) + \beta \sum_{i \in B_j} P_q(i) P_q(i \to j)
, where those query related terms are all derived from query relevance scores, e.g.
P_q'(j) = R_q(i) / \sum_{i \in W}R_q(i), \qquad P_q(i \to j) = R_q(i) / \sum_{j \in F_i} R_q(j)
.
To overcome the computation problem, the author suggests we pre-compute the ranking score for the search queries offline.

Thursday, October 7, 2010

Bundle Methods for Regularized Risk Minimization

by Choon Hui Teo, S.V.N. Vishwanathan, Alex Smola and Quoc V. Le

This paper introduces a bundle methods for RRM. In optimization, the standard optimization technique that derives bundle method is cutting plane method.

In cutting plane methods, the convex functions are bounded from below by a series of cutting planes, constructed via subgradients. If we have many of them, they may reduce the space to search for optima. And actually, we select the next point using w_t = \arg\min_w J_t^\text{CP} (w) where J_t^\text{CP}(w) = \max_{i} J(w_{i-1}) + \langle w - w_{i-1}, s_i \rangle.

The cutting plane method suffers from slow convergence. Bundle methods improve it with proximal functions (basically quadratic functions to approximate the objective function). There are several ways of constructing the proximal functions:
  • proximal: w_t = \arg\min_w \frac{\zeta_t}{2} \| w - \hat{w}_{t-1} \|^2 + J_t^\text{CP} (w)
  • trust region: w_t = \arg\min_w \{ J_t^\text{CP}(w) \mid \frac{1}{2} \|w - \hat{w}_{t-1} \|^2 \leq \kappa_t\}
  • level set: w_t = \arg\min_w \{ \frac{1}{2} \| w - \hat{w}_{t-1} \|^2 \mid J_t^\text{CP} (w) \leq \tau_t\}

This paper proposed another bundle methods, yielding a O(log(1/\epsilon)) convergence bound. It has several variant, one of which doesn't involve line search.

Unfortunately, yourequation.com is down due to heavy request. I have to switch to another equation rendering method ASAP.

PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications


This paper discussed two implementations of PLDA, one using MPI and another using Map/Reduce. It seems that the Map/Reduce framework in google doesn't support iteration too. The comparison of LDA implementations via variational Bayesian, EP and Gibbs sampler was mentioned. But somehow quite strange, I would like to make the comparison myself sometime later.


Some models can be naturally extended into its parallel version, such as many monte-carlo methods. In the authors' implementation I see many similarity as another version by my colleagues. Wen-yen now is also one of my colleagues now, small world!

Tuesday, October 5, 2010

Templates and Anchors: Neuromechanical Hypothesis of Legged Locomotion on Land


I am not a biologist and don't quite understand the point. Generally I think they are discussing using templates to simplify the modelling of complicated locomotion of organism. They proposed two templates, SLIP (spring-loaded inverted pendulum) and LLS (lateral leg spring), trying to justify their model in biology, maybe. I can hardly find any equations in their paper, so I guess the math required to model these templates should be comparatively simple. Maybe using the Lagrangian, we may obtain the motion equation easily?

Collaborative Filtering on a Budget


This paper talks about dealing with large scale collaborative filtering. The collaborative filtering can be formulated as a matrix factorization problem and we may try several loss functions with different regularizer. One typical example is the M3F paper previously scanned here. A convenient solver is stochastic gradient descent.

The core idea proposed is we may use two hash functions (one for user and another for items recommended) to aggregate the user matrix and item matrix to eliminate the computational cost in large-scale problems. These matrices are approximated with the help of Rademacher functions. But I have no idea why this is possible. Maybe I will take a look some day later.

Saturday, October 2, 2010

Spacetime Constraints

by Andrew Witkin and Michael Kass

This is a paper from graphics talking about how to implement an animation of objects, such as luxo lamp from Pixar (see details of Lamp Jr. on wikipedia).

First thing is about a spacetime particle. We get the ODE for the particle using Newton's law and typically we should solve the ODE using some numerical methods. We have some initial value constraints and also an objective function (minimum feul that exhausted in the procedure). This is actually a variational optimization problem. The numerical solution is obtained by discretization. The derivatives are approximated via finite differences. And then the objective function actually turns out to be a quadratic form. So we get a linear constrained quadratic optimization problem.

For a complex object, the ODE is obtained via Lagrange dynamics (hmm... almost forget it). So actually we may solve a similar problem (but much more complicated, you have to analyse case by case).

Monday, August 30, 2010

Replicated Softmax: an undirected topic model


This paper talks about an application of restricted Boltzmann machine. The visible layer contains a matrix of 0-1 variables (indicating whether the kth sample takes value i). And the hidden layer is again binary variables. After defining the conditional probability (since the conditional probability for visible layer is multinomial and therefore a softmax function on the hidden units, that's where the model's name comes from; replicated for the weight matrix is shared), we may use contrastive divergence to train the model. In topic model, the so-called log-perplexity (probability of observing testing samples) has to be computed so as to compare with other topic models (such as LDA). Annealed importance sampling is employed to compute the partition function.

There is some link of this model with semantic hash previously developed by the authors. Experiments shows favorable result with undirected topic model.

Wednesday, August 25, 2010

Learning a Parametric Embedding by Preserving Local Structure


This paper proposed a parametric embedding for t-SNE. The mapping is parametrized via deep belief nets. The objective function of t-SNE then back propogates to the weights of several layers. The idea is simple but the implementation should be tedious.

Parametric Embedding for Class Visualization


This paper addresses the problem of visualizing posterior probabilities of topic models. We have scanned another paper on how to visualize the documents via a probabilistic graphical model. Actually this paper is the inspiration of the latter.

By introducing an objective function resembling SNE, this paper solves an easier problem: the p table in SNE is now given, instead of computed via binary searches; the q table has a much smaller size.

Actually the latter paper decouples the number of topics with the embeding dimensions directly (but coupled with the number of clusters). We may observe quite similar phenomenon in this model.

As SNE, this model is also optimized with a gradient-descent method (alternatively, though).

Sunday, August 22, 2010

Probabilistic Latent Semantic Visualization: Topic Model for Visualizing Documents


This paper proposed a model based on LDA. But the Dirichlet prior is replaced with the probability generated by the latent coordinates of each documents and the topics. The paper only deals with MAP estimation of the latent coordinates, which can be solved via EM-like algorithm.

The learning is simple for the distribution of words conditioned on topics (analytic solution) while difficult for the latent coordinates due to the optimization (has to be solved via gradient-based numerical solutions).

The idea is interesting though. Instead of seeking a representation of the documents in the topic space learn by LDA-like models, the visualization is directly modeled via a probabilistic graphical model.

A Probabilistic Approach to Semantic Representation


This paper actually introduces a Gibbs sampler for LDA model. The Gibbs sampler, however, does not sample all latent variables. Only latent topics are sampled. I guess many DP models actually are making inferences in this way. But is this the so-called Bayesian way of learning? Sampling ?= learning. Well.... I don't know why they may do this. I have to see more...

Monday, August 9, 2010

Relation between PLSA and NMF and Implications


This paper's critical point is PLSA solve the NMF problem with a KL divergence loss, which can be easily seen if you are familiar with the two algorithms. The implications are:
  • The NMF algorithms can be interpreted as EM algorithms;
  • The NMF does not hold many advantages as pLSA.
  • But NMF might benefit with many loss functions for different purposes. 

Friday, August 6, 2010

Probabilistic Latent Semantic Analysis


This paper introduces a probabilistic model for LSA problem. In traditional LSA, we have a word-document matrix (each column correspond to a document, each row denotes the count of a certain word). The LSA employs a SVD of the count matrix and indicates that the left singular vectors are latent topics. NMF might be more appropriate since the bases found are nonnegative and can be seen as distributions of words.

This paper builds the first probabilistic model for the latent topics. The model is quite simple
\Pr(w, d) = \sum_z \Pr(z) \Pr(d\mid z) \Pr(w \mid z)
which can be trained with EM algorithm. The inference of this model is a bit awkward. But we may simply use \Pr(w \mid z) for inference problems.

Later, LDA actually endow Dirichlet priors to the mixing proportions to the topics and words.

Monday, July 26, 2010

Heavy-Tailed Symmetric Stochastic Neigbor Embedding


t-SNE is a very useful visualization algorithm, which inherits the idea from SNE but modifies the neighborhood probabilities to t-distributions instead of the original. This heavy-tailed distribution works pretty well on many data. This paper discusses a more general case.

After transforming the original problem into its Lagrange dual, the authors get a unified algorithm (fixed-point iteration) for a family of heavy tailed distribution (including t-distribution). This is somewhat not as interesting as I have expected.

The authors also discussed how to integrate supervised information into the SNE algorithm. But their idea is kind of rough: inserting a similarity computed from the supervised information into the similarity in the high-dimensional space. I don't know whether this would truly work...

Dirichlet Component Analysis: Feature Extraction for Compositional Data


This paper discusses a problem of extracting features from compositional data (nonnegative features that sums up to 1). The problem is interesting though I am not sure about the major difficulties. To get a proper projection into lower dimensional space, we have to conform to a certain set of constraints (balanced rearrangement). A regularization operator is devised to preserve the Euclidean geometry. The rearrangement will shrink the data while regularization expands them.

The optimization is quite strange (solved via genetic algorithm). The maximization w.r.t. \alpha looks like a MLE but the minimization w.r.t. the rearrangement matrix doesn't make much sense to me.

I am still unsure about the application's intention.

Monday, February 15, 2010

Projected Gradient Methods for Nonnegative Matrix Factorization


This paper proposed an optimization technique for NMF problems, which I think could also be extended to more general Bregman divergence. The idea is based on optimization. It is pointed out that most algorithms can converge to local stationary points not local minima. The multiplicative updating rule only ensures the convergence, whether it be stationary point (let alone local minima) or not. The proposed algorithm is quite simple, merely eliminating those coords out of the feasible domain (i.e. set those negative values to 0). The result looks better than the multiplicative updating rule.

Friday, February 12, 2010

Generalized Nonnegative Matrix Approximation with Bregmen Divergence


This paper introduces a family of objective in the form of Bregman divergence. From this page, you may see how broad the family is. For any given continuously differentiable and strictly convex function f(\cdot), the Bregman divergence is defined as
B_f(p \parallel q) = f(p) - f(q) - \langle \nabla f(q), p - q\rangle
which can be regarded as the difference of function value change and a linear approximation of the change. There are many interesting properties. Please check it out. This paper uses the Bregman divergence to measure the goodness of NMF, which naturally extends the original NMF objectives in the previous paper scanned here. Please note that both B_\phi(x_i, Wv_i) and B_\phi(Wv_i, x_i) can be employed as the objective function, but the former one requires less computational work.

The most surprising fact is that the auxiliary function used to formulate the multiplicative updating rule can also be extended into the more general Bregman divergence case if the auxiliary function satisfies the same constraints
G(x,x) = f(x), \qquad G(x, y) \geq f(x), \forall y.
The authors provides the following f(\cdot),
f(v) = B_\phi(Wv, x)
given the approximation problem
X = (x_1 ,\ldots, x_N) \approx W V = W(v_1, \ldots, v_N).
The function G(\cdot, \cdot) is a little well-designed to simulate the original auxiliary function,
G(v, u) = \sum_{i, j} \lambda_{i, j} f\left( \frac{W_{i, j} v_j}{\lambda_{i, j}} \right) - \left( \sum_i f(x_i) + f'(x_i)(W v_i - x_i)\right)
They also derived another multiplicative rule with KKT condition.

This result is really interesting usage of Bregman distance.

Algorithms for Nonnegative Matrix Factorization


This paper follows their Nature paper. They proposed two objectives for the decomposition. Their algorithm also relies on the "auxiliary function". The updating rule is kind of interesting due to their multiplicative property. This is later generalized to Bregman divergence in a NIPS05 paper.

Sunday, January 31, 2010

Non-metric Label Propagation


This paper follows the idea of the non-metric similarity matrix analysis. By decomposing the Gram matrix into two separate graphs (one for positive eigenvalues and the other for the negative), they build two separate Markov chains, which compromise a mixture of Markov model for label propagation (just an explicit solution of linear equations). Their paper contains many experiments as usual, which I think might be the deficit of my own research work.

The idea is not that fancy but the application in label propagation might be novel, the research style of Zhou's :-p That requires keen olfaction.

Feature Discovery in Non-metric Pairwise Data


This is a paper about how to analysis pairwise "distance" or similarity matrices. Since no all similarity matrices can be transformed into a Gram matrix (as we do in MDS), it is interesting to take a deeper insight into the details.

Basically, we may imagine there are two metrics, one for similarity and another for dissimilarity (penalizing the similarity in human perception). By applying a spetral transformation, we may use metric methods if the spectra can be fixed (no negative).

The problem is how we may utilize the negative part of the spectra.