Paper Scanner: Spark: Cluster Computing with Working Sets

Saturday, December 31, 2011

Spark: Cluster Computing with Working Sets

by Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker and Ion Stoica

Spark uses a so-called RDD (resilient distributed dataset) for data intensive operations, like iterative machine learning algorithms, which reduces repeated IO of MapReduce. Similar parallel operations like map, reduce, collect and foreach are defined. Reduce is even simpler (only associative operator, like plus). The dataset will be held in memory as long as the job is not finished. Parameterized models, when parammeters can be stored in memory, will surely benefit from this design.

Paper Scanner

Saturday, December 31, 2011

Spark: Cluster Computing with Working Sets

No comments:

Recent Comments

Scanning Areas

Paper list

Labels

Scanner