Thursday, July 16, 2009

Efficient Projections onto the l1-Ball for Learning in High Dimensions


The projection on L1 ball is another way of computing L1 norm penalty for certain regularization problem. That is to say
\min_w \Omega(w) + \lambda \| w \|_1 
, we may optimize
\min_{w} \Omega(w) \quad \text{s.t. } \| w \|_1 \leq z
where z is a constant. The former one can be seen as a series of optimization problem indexed by the regularization parameter \lambda and the latter is indexed by the constant z. We can show that two forms are equivalent.

Theirfore, according to the latter, the optimization becomes finding a minimizer of the loss function on the L1 ball. If the minimizer happens to be inside the ball, it is the solution; otherwise, it must be on the boundary of the ball. So the difficult part becomes how can we project the vectors onto the L1 ball efficiently. This explains why the solution should be sparse as well.

The first simple case is
\min_w \frac{1}{2} \| w - v \|_2^2 \quad \text{s.t. } \sum_{i = 1}^n w_i = z, w_i \geq 0,
where v inside the positive cone. With a little analysis, we know this is actually very simple. When \| v\|_1 \leq z, the solution is obvious. Otherwise, we should decrease the coordinates. Using Lagrange multiplier, it is seen that each coordinate are shrunk with the same value. Therefore the norm \| w \|_1 will decrease, n \theta, (n-1)\theta, \ldots, the slope changes whenever a coordinates vanishes. Therefore we stop when the piecewise linear function meets \| w \| = z. We just need to find the intersection. If we implement it with a brute search, it ends up with a O(n \log n) algorithm, but with a little care (using the quick sort idea), the magic turns it into a O(n) algorithm.

Then we come to other cases when v can be in other orthants, due to the symmetry. Finally, we will use a tree structure to boost up the speed. In our implementation of a gradient-based algorithm, such as
w^{(t + 1)} = \Pi_W \Big( w^{(t)} + g^{(t)} \Big),
we calculate the projection onto the L1 ball, i.e. \Pi_W(\cdot). The idea is to use a red-black tree (gonna review these structure). The the projection can be done in the time which scales linearly in the non-zero entries of g^{(t)} and logarithmically in the total number of features n.

Let's compare this one with the ICML 09 paper and several other papers later.

No comments: