Friday, January 2, 2009

Clustering by Passing Messages Between Data Points


It's an interesting paper about clustering. The technique they employed is affinity propagation, believed to be an example of belief propagation. It has several important features: it identifies exemplars in the samples instead of find a mean/center directly; when the similarity matrix is sparse, the algorithm works fast since it is a linear algorithm w.r.t. the number of similarity values s(i, k); instead of choosing the number of exemplars, the user is only required to set a value for s(k, k) which may influence the number of exemplars (the larger the value is the more exmplars are).

The iterative algorithm is very simple. Two matrices are introduced to show how well the sample k is as the exemplar of sample i in the opinion of sample i or k, i.e. responsibility r(i, k) and availability a(i, k). The updating rule for responsibility is
r( i, k ) <- s( i, k ) - max { a( i, j ) + s( i, j) | j <> k}.
The initial value of availability is 0 (later it is nonpositive). The maximal sum similarity and availability indicates the sample which is similar while thinking itself works best as the exemplar of sample i. Therefore the difference of s( i, k) and the sum shows the evidence sample i thinks sample k is his exemplar---evidence based on the observation of availability and similarity. The updating rule for availability is
a( i, k ) <- min { 0, r( k, k ) + sum { max{ 0, r( j, k) | j <> i, k } } }
As you might see here, the availability is nonpositive (due to the min function). The sum only includes those samples with positive responsibility (namely those samples still regard sample k as a possible exemplar). Therefore the sum with r( k, k) shows the availability of sample k for sample i, since it is a negative value when r( k, k) is negative and few samples take sample k as the exemplar---namely the evidence sample k holds based on its observation of other samples.

After the affinity propagation converges, there are two ways of determining whether a sample is the exemplar: one chooses those samples with r( k, k ) + a( k, k ) > 0; the other chooses those with r( k, k ) + a( k, k ) > r( k, j ) + a( k, j ) satisfied for all j <> k.

One important thing about the algorithm is damping. I haven't explored this part carefully but I do encounter this technique quite frequently. Actually in the training of SRBM and BP with momentum, similar techniques exist. The convex combination of the proposed updating values and the old ones from last update ensures the convergence (why?).

The science paper includes 3 experiments, on facial images, bioinformatic data and a text data of air lines, demonstrating the effectiveness of the algorithm. It's really amazing how well it works. How did they discover this simple but useful technique? This algorithm is designed from (loopy) belief propagation on an elaborate factor graph. I am not familiar with this and have to explore this ASAP.

Here is the important reference site from PSI group.

2 comments:

Anonymous said...

I am reading this article second time today, you have to be more careful with content leakers. If I will fount it again I will send you a link

Anonymous said...

I really like when people are expressing their opinion and thought. So I like the way you are writing