Monday, June 6, 2011

Parallelized Stochastic Gradient Descent


This paper talks about a very interesting optimization technique. In online learning, SGD (stochastic gradient descent) is usually applied to optimize the model, since the data come in one by one (or one mini batch by another). For parametric models, the parameters are updated with one or a few samples sequentially.

This paper talks about a parallelized version. Essentially we run several SGD on different machines and aggregate their result by averaging. However, we may even do not distribute all data across all machines. The proof of convergence looks dependent on the convexity of the objective functions but I suspect it may not.

The result is quite interesting when we consider about the popular parallel computation framework, map/reduce. We'd better implement one ASAP.

No comments: