Saturday, December 31, 2011

AppJoy: Personalized Mobile Application Discovery


This paper introduces the author's app AppJoy, which collects users' usage of apps on a mobile device, derives the similarity scores between apps based on the usage and recommend apps based on similarity scores. The idea is to utilize the time and spatial information along with the apps. Contrastive to other recommendation algorithm, based on ratings, the app usage may serve as a much better feature. I guess this shows a simpler model with strong feature might serve better than a complicated model with weak features.

It seems that, if we decide to move on to applications in mobile devices, we'd better get to know the details of the mobile platforms. E.g. IOS might not allow you to collect such usage information. So the experiments can only be done on Android.

DOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems


I heard the talk in HIC 2011 but hardly knew what this paper is about. After reading the paper for a while, I realized that the author just formulates the jobs in a distributed system with a so-called DOT expression, i.e. data (row vector), operator (several column vectors) and transformation (another function on aggregated output of operators). Therefore a matrix-like expression can be formulated this way. Apparently Map/Reduce and dryad jobs can be formulated with it.

The authour discussed a little bit about the property of DOT formulation but I think they are way behind math. I don't see any necessity in formulating those jobs with this strange expression. The algebra also doesn't reveal anything appealing. Maybe I am wrong. After all I am not in that field.

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks


Dryad is kind of lower-level computational model than MapReduce. The programmer has to write a DAG for a specific task. The programmer implements the job details on each vertex and defines what kinf of edges that connect the vertices. In this way, the programmer first decomposes the tasks into a DAG. The commonly seen parallelization of splitting data has to be done manually or by some library to read data from a GFS like distributed storage. There is no shuffling like in reduce step. So programmer has to build his own for this type of operation (or dryad's library has such implementation). The comminication (i.e. the edges) can be fine tuned by the programmer, like a TCP pipe or a temporary file.

On a whole dryad provides a lower level of computational model, on top of which other computational models (like Map/Reduce) can be built. But it also raises bar for its users. I don't think MS will open source for dryad and this is MS's style. Proprietary software, limited community...

Spark: Cluster Computing with Working Sets


Spark uses a so-called RDD (resilient distributed dataset) for data intensive operations, like iterative machine learning algorithms, which reduces repeated IO of MapReduce. Similar parallel operations like map, reduce, collect and foreach are defined. Reduce is even simpler (only associative operator, like plus). The dataset will be held in memory as long as the job is not finished. Parameterized models, when parammeters can be stored in memory, will surely benefit from this design.