Wednesday, November 10, 2010

The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank


I started to read those papers by Domingos to follow their idea on Markov logic. Now I have already read a few pages from their tutorials and wish to know more about their past research background.

This paper talks about ranking, which I am not quite familiar with. This paper introduces two previous methods, i.e. HITS and PageRank, which I'd like to scan later. The HITS model is little bit complicated and can't be served online (they have to compute hubs and authorities at query time) while PageRank can utilize the PageRank at offline stage (i.e. after the crawling, they may compute the PageRank using incoming links and outgoing links, which later serves as a factor contributing to the final ranking function).

The graph PageRank builds consists of links:
  • using the outgoing links and incoming links;
  • if a webpage has no outgoing links, it links to all other pages;
So we may set a random walker on the graph and find the probability it gets to the page using the formula
\displaystyle P(j) = \frac{1 - \beta}{N} + \beta \sum_{i \in B(j)} \frac{P(i)}{|F_i|}
where B(j) contains all pages link to page j and F(i) contains the pages linked from page i.
This paper's main idea is to incorporate the query information into the PageRank, which, you may have already seen from tabove, contains nothing about the query q. Let
P_q(j) = (1 - \beta) P_q'(j) + \beta \sum_{i \in B_j} P_q(i) P_q(i \to j)
, where those query related terms are all derived from query relevance scores, e.g.
P_q'(j) = R_q(i) / \sum_{i \in W}R_q(i), \qquad P_q(i \to j) = R_q(i) / \sum_{j \in F_i} R_q(j)
.
To overcome the computation problem, the author suggests we pre-compute the ranking score for the search queries offline.

No comments: