Paper Scanner: 2011

Saturday, December 31, 2011

AppJoy: Personalized Mobile Application Discovery

This paper introduces the author's app AppJoy, which collects users' usage of apps on a mobile device, derives the similarity scores between apps based on the usage and recommend apps based on similarity scores. The idea is to utilize the time and spatial information along with the apps. Contrastive to other recommendation algorithm, based on ratings, the app usage may serve as a much better feature. I guess this shows a simpler model with strong feature might serve better than a complicated model with weak features.

It seems that, if we decide to move on to applications in mobile devices, we'd better get to know the details of the mobile platforms. E.g. IOS might not allow you to collect such usage information. So the experiments can only be done on Android.

DOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems

by Yin Huai, Rubao Lee, Simon Zhang, Cathy H. Xia and Xiaodong Zhang

I heard the talk in HIC 2011 but hardly knew what this paper is about. After reading the paper for a while, I realized that the author just formulates the jobs in a distributed system with a so-called DOT expression, i.e. data (row vector), operator (several column vectors) and transformation (another function on aggregated output of operators). Therefore a matrix-like expression can be formulated this way. Apparently Map/Reduce and dryad jobs can be formulated with it.

The authour discussed a little bit about the property of DOT formulation but I think they are way behind math. I don't see any necessity in formulating those jobs with this strange expression. The algebra also doesn't reveal anything appealing. Maybe I am wrong. After all I am not in that field.

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

by Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell and Dennis Fetterly

Dryad is kind of lower-level computational model than MapReduce. The programmer has to write a DAG for a specific task. The programmer implements the job details on each vertex and defines what kinf of edges that connect the vertices. In this way, the programmer first decomposes the tasks into a DAG. The commonly seen parallelization of splitting data has to be done manually or by some library to read data from a GFS like distributed storage. There is no shuffling like in reduce step. So programmer has to build his own for this type of operation (or dryad's library has such implementation). The comminication (i.e. the edges) can be fine tuned by the programmer, like a TCP pipe or a temporary file.

On a whole dryad provides a lower level of computational model, on top of which other computational models (like Map/Reduce) can be built. But it also raises bar for its users. I don't think MS will open source for dryad and this is MS's style. Proprietary software, limited community...

Spark: Cluster Computing with Working Sets

by Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker and Ion Stoica

Spark uses a so-called RDD (resilient distributed dataset) for data intensive operations, like iterative machine learning algorithms, which reduces repeated IO of MapReduce. Similar parallel operations like map, reduce, collect and foreach are defined. Reduce is even simpler (only associative operator, like plus). The dataset will be held in memory as long as the job is not finished. Parameterized models, when parammeters can be stored in memory, will surely benefit from this design.

Sunday, November 6, 2011

Swing or not to Swing: Learn When (not) to Advertise

by Andrei Broder, Massimiliano Ciaramita et al.

This paper talks about how to learn to show the ads or not in some cases. The basic idea is to learn a classification model based on candidate sets's features, like relevance feature, cohesiveness features. This is compared to a thresholding method, which can be trivially obtained via commonly used retrieval system of ads and proved more effective when our goal is not to maintain a high recall rate (therefore the more the better case).

Another idea is to train the model based on click feedbacks instead of editorial judgement. The paper mentioned to use online learning techniques, which is the reason that most online serving models has to take into account of the time-evolving property and allows the model to be trained in an incremental style.

Impedance Coupling in Content-Targeted Advertising

by Berthier Ribeiro-Neto, Marco Cristo, Paulo B. Golgher and Edleno Silva de Moura

This paper talks about an expansion of keywords in a webpages by using similar webpages. By this expansion, the retrieval of related ads can be improved. It is different from another idea, that is, the expansion set is mined from search engine.

The basic model is Bayesian belief net, where for each document D₀, we take its most similar pages D_k and create links from the doc to its related terms. Originally, the doc's related terms are sparse when used in retrieval (due to vocabulary impedance problem). But with similar pages, the candidate sets get enlarged but still we need to select which are good for the current page. Take Pr(D_i) as constant, and the conditional probabilities Pr(R | D_i) as some function related to the similarity to D₀ and Pr(T_j | D_i) as some normalized term frequencies. We may evaluate Pr(T_j | R) for each document as a ranking score to choose the corresponding term.

Review Spotlight: A User Interface for Summerizing User-Generated Reviews Using Adjective-Noun Word Pairs

by Koji Yatani, Michael Novati, Andrew Trusty and Khai N. Truong

This is an interesting HCI paper on showing interviews in the form of adj.+ n. form. The sentiment analysis shows the positiveness and negativeness of each mined structure, which renders the color of the phrases. The sizes can be determined by the popularity of the phrases. The interface helps the user get better impression on the product of interest. The layout of the phrases are not very important. When the mouse hovers the selected phrases, the corresponding reviews will be listed (maybe with a smart summary?).

In a way, this interface has similarity in word/tag clouds but do has very reasonably advantageous edges. Shall we implement one for some projects?

Sunday, June 19, 2011

SIGIR 2011

This time it is in Beijing.

Tutorials

I'd say most of them are quite elementary:

one tells machine learning techniques for IR;
another talks about online algorithm, which is the now trending topic in ML;

Keynotes

It is quite general but should be interesting enough.

Papers

Not sure about the content. No online proceedings. Not sure we may go there and listen to some of the talks.

Workshops

Some of them look interesting:

Hope I may sneak in.

Like like alike---Joint Friendship and Interest Propagation in Social Networks

by Shuang Hong Yang, Bo Long, Alex Smola, Narayanan Sadagopan, Zhaohui Zheng and Hongyuan Zha

This paper extends the idea of RLFM: we interpret RLFM with a supervised learning problem and this paper extends this idea with a social net, mainly via regularizers. So that's why the probabilistic model in RLFM may not be that important (just an interpretation in a disciplined way). There are 7 regularizers in total, which leads to a very complicated objective function.

The optimization is solved via stochastic gradient descent. Maybe for other techniques like interior point method, it is too difficult to derive?

Wednesday, June 15, 2011

Regression-based Latent Factor Models

by Deepak Agarwal and Bee-Chunh Chen

This paper talks about another way of using the rating matrix.

Given features of users and items, there are some correlations between them. The CCA idea is ``unsupervised'' since it works on two kinds of data and finds the maximum correlations without further ``supervised information.'' Then if indeed we have some ``supervised'' information, that is how correlated a user is to some items, the problem becomes a ``supervised learning'' version. Actually, most of the literatures are in this style. Haha, you might have realized my point now: is there any ``semi-supervised'' version? Yeah, I'd like to develop one :-)

OK, let's go back to this paper. So with supervised information, the problem of CCA becomes finding two linear mapping of user's feature and item's feature into a common subspace, so that their inner product approximates the given one. This might be problematic, since obviously the ratings might be non-consistent with their relative face values. But anyway we have this basic idea: it's like what we get in PCA.

The second step is to make some probabilistic model out of it (PCA to PPCA) so that everything gets some interpretation and a disciplined way to train the model and making inferences. The paper shows us an interpretation: there is a common subspace that each user/item is drawn from (zero mean, some variance?), which can be obtain from the original feature space (via a linear transformation). The observed rating can be seen as the inner product mixed with a noise. With this graphical model, we may do the learning via EM. Here the authors chose the MCEM (MC to compute the expectation).

Let's try some of the ideas.

Monday, June 13, 2011

Inferring Rlevent Social Networks from Interpersonal Communication

by Munmum de Choudhury, Winter A. Mason, Jake M. Hofman and Ducan J. Watts

The paper makes a point that the social nets are not really totally usable by its original form. We usually have to remove some edges by a certain extent for a specific tasks. My thought on this is like the social edges are like aqueduct or wires. Different behaviors have different propagation coefficients.

Another important thing in this paper is those features, in three categories:

reach
- node degree
- average neighbor degree
- size of two-hop neighborhood
closure
- embeddedness, an average of ratios of common friends over or-friends;
- normalized clustering coefficient, the average probability of two of my neighbors are friends
bridge
- network constraints (sum of all neighbors, each inner producted with the node)
- ego components (after removal, how many new components will appear)

Sentiment Analysis with Global Topics and Local Dependency

by Fangtao Li, Minlie Huang and Xiaoyan Zhu

This paper talks about the interaction between topic and sentiment. The first idea is to impose an additional factor of generating word: the topic decides the sentiment and the sentiment along with the topic decides the word. However, the structure can't determine where the writer changes his attitude. So they impose a hidden Markov structure on sentiment transitions. However the paper only gives some inference result and many parameters are manually set.

Monday, June 6, 2011

An Efficient Method For Compressive Sensing

by Seung-Jean Kim, Kwangmoo Koh, Michael Lustig and Stephen Boyd

This paper talks about how the interior point method is applied in L1 regularized least squared problem. The dual serves as an estimation of the duality gap, which then becomes the convergence judgement. The log barrier method formulates the primal objective function, which is solved by Newton's method. The Newtons's method requires solving a linear system, which addressed by conjugate gradient. The accuracy of the solution can be adjusted as the search evolves.

On a single machine this method may deal with millions of variables, in compressive sensing problems.

Near-Optimal Hashing Algorithms For Approximate Nearest Neighbor in High Dimensions

by Alexndr Andoni and Piotr Indyk

This is a survey paper on locality-sensitive hashing (LSH). LSH is quite useful for approximately retrieving similar items according to a given distance. The fact is, once we have enough data, those approximate nearest neighbors would perform as well as extact ones. So why bother computing the exact version?

The idea is based on stochastic algorithms in common literatures. The idea is to select a random projection and combine the binning to the final hash table. To understand intuitively how this works, given a point, the random projection into one bin will cut down the probability the retrieved points are not around the query by some extent. And when more projections are made, we get closer. E.g. for Hamming distance, the projection can be further simplified as a subset of randomly chosen coordinate indices.

There are some software implementing this LSH idea. I would like to try them in the following days.

Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

by Gideon Mann, Ryan McDonald, Mehryar Mohri, Nathan Silberman and Daniel D. Walker

This works might inspired the previous scanned paper, in that the proposed algorithm looks quite similar. This paper only focused on CME (a special case of convex optimization) and therefore the result is of comparatively limited usage in practice. The key difference of the two is this paper only employs a normal batch solver, unlike the stochastic solver in the latter.

This algorithm is map/reduce friendly, though.

Parallelized Stochastic Gradient Descent

by Martin A. Zinkevich, Markus Weimer, Alex Smola and Lihong Li

This paper talks about a very interesting optimization technique. In online learning, SGD (stochastic gradient descent) is usually applied to optimize the model, since the data come in one by one (or one mini batch by another). For parametric models, the parameters are updated with one or a few samples sequentially.

This paper talks about a parallelized version. Essentially we run several SGD on different machines and aggregate their result by averaging. However, we may even do not distribute all data across all machines. The proof of convergence looks dependent on the convexity of the objective functions but I suspect it may not.

The result is quite interesting when we consider about the popular parallel computation framework, map/reduce. We'd better implement one ASAP.

Friday, January 21, 2011

Random Walks for Image Segmentation

by Leo Grady

This paper proposed a random walk based algorithm for image segmentation. The idea is quite straightforward: some part of the images are labeled by the user and other areas are segmented using the random walk probability that stops at these landmarks.

The interesting part of the paper is that it proposed a (several) linear system to solve the probability, instead of solving an eigenvalue problem.

L_U X = -B

where L_U is the graph Laplacian of the unlabelled samples and the B is the cross block in the Laplacian of the whole graph.

The idea was used in t-SNE to build p-table for landmarks.

Friday, January 14, 2011

Annotating Photo Collections by Label Propagation According to Multiple Similarity Cues

by Liangliang Cao, Jiebo Luo and Thomas Huang

The image annotation problems is actually a multi-label propagation, unlike three previous papers. We are only interested in the label propagation part. Actually it is only a weighted counting for independent labels. Nothing new :-(

Thursday, January 13, 2011

Semi-supervised Classification Using Linear Neighborhood Propagation

by Fei Wang, Jingdong Wang, Changshui Zhang and Helen C. Shen

The so-called linear neighborhood propagation relies on the idea from LLE. Instead of using the graph weights directly, as in former methods (c.f. Zhou and Zhu's papers, two previous paper), the weights are computed using the idea from LLE. Therefore, in a way Zhou's version is something like diffusion map, Zhu's Laplacian eigenmap while this one LLE. We may find those counterparts in manifold learning.

The procedure to calculate the weights are identical to that of LLE (by minimizing the affine reconstruction error). Then the weights are used to propagate the labels with the same objective function as in semi-supervised LLE (or landmark LLE). In this way, it eliminates the selection for a width for Gaussian kernels.

Well, why not make a LTSA version? Haha...

Semi-supervised Learning Using Gaussian Fields and Harmonic Functions

by Xiaojin Zhu, Zoubin Ghahramani and John Lafferty

This is also a piece of even earlier work on label propagation in semi-supervised setting. There are many fancy nouns, Gaussian random fields, energy minimization and harmonics... But it ends like something resembling the previous paper.

They explain the label propagation as the minimization of the so-called energy function of a Gaussian field, which is actually the quadratic form of the Laplacian and therefore corresponds to the so-call elliptic PDE (and hence many more nouns: Green function, harmonics). The little difference might be they are using D^{-1} W (one-step transition matrix) instead of the symmetrized and normalized version D^{-1/2} W D^{-1/2} in the previously scanned paper. The closed form solution is obtained via Schur complement instead of the iterative procedure in Zhou's paper.

The paper also discusses connections with random walks and electric networks, graph kernels, spectral clustering and N-cut. I believe due to the connection, this propagation idea must have been discussed with all kinds of variants (normalized? symmetrized?)

Wednesday, January 12, 2011

Learning with Local and Glabal Consistency

by Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston and Bernhard Scholkopf

This paper is about label propagation in semi-supervised learning setting. The basic idea about label propagation is to use a graph as a random walk structure. The label is propagated with the following equation

Y(t + 1) = \alpha L Y(t) + (1 - \alpha) Y(t)

where Y(t) is the label matrix (in multi-class classification, it is a unit vector for labeled sample) and S is the normalized weight matrix. The iterative procedure will result in a limit of

Y^\star = (I - \alpha S)^{-1} Y

This reveals some connection with the graph Laplacian.
Th interesting part is how we may develop a parallel version of this idea.

Correlated Label Propagation with Application to Multi-label Learning

by Feng Kang, Rong Jin and Rahul Sukthankar

This paper talks about multi-label propagation. Usually label propagation is commonly seen in semi-supervised learning tasks. As for multi-labeled tasks, the critical difference is that labels may have correlationships (co-occurrences).

The paper proposes a linear programming model for the task

\max_{z \in \mathbb{R}^n} \sum_{k = 1}^n \alpha_k z_k \qquad \text{s.t. } \forall t \in \{ 0, 1\}^n, z^\top t \leq \sum_{i = 1}^N K(x_i, x_q) \Omega(t^\top t(S_i)), z \succeq 0

where \Omega is a concave function. The interpretation of this objective function is: for any possible labelling of the test sample, the propagated score is bounded from above by weighted (by the similarity, depicted by K(\cdot, \cdot)) sum of score from those samples sharing labels.
The amazing thing about the objective function is that a greedy algorithm can find the optimal when \Omega is concave and the solution is only determined by the order of the undetermined coefficients \alpha_k. This optimization is related to the so-called sub-modular optimization (see here).

Paper Scanner