Sunday, June 19, 2011

SIGIR 2011

This time it is in Beijing.

Tutorials

I'd say most of them are quite elementary:
  • one tells machine learning techniques for IR;
  • another talks about online algorithm, which is the now trending topic in ML;

Keynotes

It is quite general but should be interesting enough.

Papers

Not sure about the content. No online proceedings. Not sure we may go there and listen to some of the talks.

Workshops

Some of them look interesting:

Hope I may sneak in.

Like like alike---Joint Friendship and Interest Propagation in Social Networks


This paper extends the idea of RLFM: we interpret RLFM with a supervised learning problem and this paper extends this idea with a social net, mainly via regularizers. So that's why the probabilistic model in RLFM may not be that important (just an interpretation in a disciplined way). There are 7 regularizers in total, which leads to a very complicated objective function.

The optimization is solved via stochastic gradient descent. Maybe for other techniques like interior point method, it is too difficult to derive?

Wednesday, June 15, 2011

Regression-based Latent Factor Models


This paper talks about another way of using the rating matrix.

Given features of users and items, there are some correlations between them. The CCA idea is ``unsupervised'' since it works on two kinds of data and finds the maximum correlations without further ``supervised information.'' Then if indeed we have some ``supervised'' information, that is how correlated a user is to some items, the problem becomes a ``supervised learning'' version. Actually, most of the literatures are in this style. Haha, you might have realized my point now: is there any ``semi-supervised'' version? Yeah, I'd like to develop one :-)

OK, let's go back to this paper. So with supervised information, the problem of CCA becomes finding two linear mapping of user's feature and item's feature into a common subspace, so that their inner product approximates the given one. This might be problematic, since obviously the ratings might be non-consistent with their relative face values. But anyway we have this basic idea: it's like what we get in PCA.

The second step is to make some probabilistic model out of it (PCA to PPCA) so that everything gets some interpretation and a disciplined way to train the model and making inferences. The paper shows us an interpretation: there is a common subspace that each user/item is drawn from (zero mean, some variance?), which can be obtain from the original feature space (via a linear transformation). The observed rating can be seen as the inner product mixed with a noise. With this graphical model, we may do the learning via EM. Here the authors chose the MCEM (MC to compute the expectation).

Let's try some of the ideas.

Monday, June 13, 2011

Inferring Rlevent Social Networks from Interpersonal Communication


The paper makes a point that the social nets are not really totally usable by its original form. We usually have to remove some edges by a certain extent for a specific tasks. My thought on this is like the social edges are like aqueduct or wires. Different behaviors have different propagation coefficients.

Another important thing in this paper is those features, in three categories:
  • reach
    • node degree
    • average neighbor degree
    • size of two-hop neighborhood
  • closure
    • embeddedness, an average of ratios of common friends over or-friends;
    • normalized clustering coefficient, the average probability of two of my neighbors are friends
  • bridge
    • network constraints (sum of all neighbors, each inner producted with the node)
    • ego components (after removal, how many new components will appear)

Sentiment Analysis with Global Topics and Local Dependency


This paper talks about the interaction between topic and sentiment. The first idea is to impose an additional factor of generating word: the topic decides the sentiment and the sentiment along with the topic decides the word. However, the structure can't determine where the writer changes his attitude. So they impose a hidden Markov structure on sentiment transitions. However the paper only gives some inference result and many parameters are manually set.

Monday, June 6, 2011

An Efficient Method For Compressive Sensing


This paper talks about how the interior point method is applied in L1 regularized least squared problem. The dual serves as an estimation of the duality gap, which then becomes the convergence judgement. The log barrier method formulates the primal objective function, which is solved by Newton's method. The Newtons's method requires solving a linear system, which addressed by conjugate gradient. The accuracy of the solution can be adjusted as the search evolves.

On a single machine this method may deal with millions of variables, in compressive sensing problems.

Near-Optimal Hashing Algorithms For Approximate Nearest Neighbor in High Dimensions


This is a survey paper on locality-sensitive hashing (LSH). LSH is quite useful for approximately retrieving similar items according to a given distance. The fact is, once we have enough data, those approximate nearest neighbors would perform as well as extact ones. So why bother computing the exact version?

The idea is based on stochastic algorithms in common literatures. The idea is to select a random projection and combine the binning to the final hash table. To understand intuitively how this works, given a point, the random projection into one bin will cut down the probability the retrieved points are not around the query by some extent. And when more projections are made, we get closer. E.g. for Hamming distance, the projection can be further simplified as a subset of randomly chosen coordinate indices.

There are some software implementing this LSH idea. I would like to try them in the following days.

Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models


This works might inspired the previous scanned paper, in that the proposed algorithm looks quite similar. This paper only focused on CME (a special case of convex optimization) and therefore the result is of comparatively limited usage in practice. The key difference of the two is this paper only employs a normal batch solver, unlike the stochastic solver in the latter.

This algorithm is map/reduce friendly, though.

Parallelized Stochastic Gradient Descent


This paper talks about a very interesting optimization technique. In online learning, SGD (stochastic gradient descent) is usually applied to optimize the model, since the data come in one by one (or one mini batch by another). For parametric models, the parameters are updated with one or a few samples sequentially.

This paper talks about a parallelized version. Essentially we run several SGD on different machines and aggregate their result by averaging. However, we may even do not distribute all data across all machines. The proof of convergence looks dependent on the convexity of the objective functions but I suspect it may not.

The result is quite interesting when we consider about the popular parallel computation framework, map/reduce. We'd better implement one ASAP.