Sunday, December 19, 2010

The Anatomy of a Large-scale Social Search Engine


This is a WWW10 paper on real-time answer. The idea is to build a village model of knowledge sharing instead of the traditional model of library, e.g. the search engine solution provided by Google.

The users of aardvark have their social graph information collected from several sources: e.g. facebook friends, email contacts, IM contacts and etc, in which the users' questions will be propagated. The users must specify their expertise: either specifying by selecting some items or provide some publishing information to analyze (e.g. twitter, blog). The system builds two indices, ISAM index for the social graph and an inverted index for user's expertise and then enables the user's behavior.

The user's query will be analyzed (to see whether it is a question or not and what topic it is) and give the user a chance to determine the type (since it is immature to automatically determine the topic so far). Then a proper question will be handled by the routing suggestion using the social graph and expertise information.

Therefore, the core of the village model is the routing algorithm. The routing procedure is actually the same as a ranking (of users) problem:
s(u_i, u_j, q) = \Pr(u_j \mid u_i) \Pr(u_j \mid q)
where the first term is measured via users intimacy using the social network (regarding to social connection, demographic similarity, profile similarity, vocabulary match, chattiness, verbosity, politeness and speed) and the second term is learned with an aspecti model (just as pLSA). In practice, the probability \Pr(t \mid u_i) is smoothed over the social nets, since if one's friends know something, he either knows it or knows who to ask about.
The rank engine works as follows: it retrives those users with matched expertise (if the question is location sensitive, this would also be considered in the retrieval); secondly it uses the connectedness to find the one with a proper relationship and lastly it computes whether the query could be dealt by the user using availability information.

The rest are many small pieces we have to put together:
  • whether the proposed text is actually a question? (need a classifier)
  • whether the question is trivial? (we may have vertical search engines for retrieving the desired result without asking someone)
  • whether the question is location sensitive?

The whole platform is kind of difficult to construct but the idea is somewhat easy to grasp.

No comments: