Saturday, December 31, 2011

Spark: Cluster Computing with Working Sets


Spark uses a so-called RDD (resilient distributed dataset) for data intensive operations, like iterative machine learning algorithms, which reduces repeated IO of MapReduce. Similar parallel operations like map, reduce, collect and foreach are defined. Reduce is even simpler (only associative operator, like plus). The dataset will be held in memory as long as the job is not finished. Parameterized models, when parammeters can be stored in memory, will surely benefit from this design.

No comments: