Sunday, December 19, 2010

The Anatomy of a Large-scale Social Search Engine


This is a WWW10 paper on real-time answer. The idea is to build a village model of knowledge sharing instead of the traditional model of library, e.g. the search engine solution provided by Google.

The users of aardvark have their social graph information collected from several sources: e.g. facebook friends, email contacts, IM contacts and etc, in which the users' questions will be propagated. The users must specify their expertise: either specifying by selecting some items or provide some publishing information to analyze (e.g. twitter, blog). The system builds two indices, ISAM index for the social graph and an inverted index for user's expertise and then enables the user's behavior.

The user's query will be analyzed (to see whether it is a question or not and what topic it is) and give the user a chance to determine the type (since it is immature to automatically determine the topic so far). Then a proper question will be handled by the routing suggestion using the social graph and expertise information.

Therefore, the core of the village model is the routing algorithm. The routing procedure is actually the same as a ranking (of users) problem:
s(u_i, u_j, q) = \Pr(u_j \mid u_i) \Pr(u_j \mid q)
where the first term is measured via users intimacy using the social network (regarding to social connection, demographic similarity, profile similarity, vocabulary match, chattiness, verbosity, politeness and speed) and the second term is learned with an aspecti model (just as pLSA). In practice, the probability \Pr(t \mid u_i) is smoothed over the social nets, since if one's friends know something, he either knows it or knows who to ask about.
The rank engine works as follows: it retrives those users with matched expertise (if the question is location sensitive, this would also be considered in the retrieval); secondly it uses the connectedness to find the one with a proper relationship and lastly it computes whether the query could be dealt by the user using availability information.

The rest are many small pieces we have to put together:
  • whether the proposed text is actually a question? (need a classifier)
  • whether the question is trivial? (we may have vertical search engines for retrieving the desired result without asking someone)
  • whether the question is location sensitive?

The whole platform is kind of difficult to construct but the idea is somewhat easy to grasp.

Wednesday, December 8, 2010

Search Logs as Information Footprints: Supporting Guided Navigation for Exploratory Search


This is a paper on building a exploration interface based on a clustering algorithm (star clustering), that can arrange search queries according to the similarities of the search queries into tree structure. Then the material is shown to the users in a tree structured interface, incorporating search utility.

It's quite similar to what we are doing and may be I will develop another parallel clustering algorithm based on affinity propagation. But let's first scan the result of star clustering.

Monday, December 6, 2010

The Stochastic Gradient Boosted Distributed Decision Trees


This paper proposed two solutions forr implementing the exact stochastic GBDT, which was developed by the famous statician Friendman. I'd like to scan his two previous papers later as the first study of decision trees (others will be CART and C4.5... I guess).

The map/reduce implementation is based on the previously scanned paper, using horizontal splits while in the MPI implementation using vertical splits. The former is quite directly; the later requires communication using all-to-all broadcasting.

Maybe after studying the GBDT, I would have a better understanding of this paper.

A Framework for Learning from Dsitributed Data Using Sufficient Statistics and its Application to Learning Decision Trees


This paper addresses training a large-scale problem using two types of split of data: horizontal fragmentation (sample subsets) and vertical fragmentation (feature subsets). The orientation is determined when writing row-wise sample matrix.

The key idea behind this paper is to extract sufficient statistics from  splits of data so that we may aggregate the statistics in the last to get the exact model. This is quite direct in the case of MoE (when each  expert is an exponential family). This paper, however, has the emphasis in decision trees.

The decision trees are trained layer-wise, finding splits of data using features that maximized the information gain (or some similar criteria). The paper discussed two cases:
  • if the data are horizontally splitted, each site compute the required statistics and they are all combine to find the best split; the split then is passed back to each site so that each site knows which node the samples are in and they could take a second collection of sufficient stats.
  • if the data are vertically splitted, each site could figure out its own split and they will be compared and find the best; however, the split is then represented by the indices (so that each site may understand which samples are in which node);

This solution is accurate and can be applied to implementing the decision trees under map/reduce framework directly. And it should also apply to many hierarchical models too.