Monday, December 6, 2010

A Framework for Learning from Dsitributed Data Using Sufficient Statistics and its Application to Learning Decision Trees


This paper addresses training a large-scale problem using two types of split of data: horizontal fragmentation (sample subsets) and vertical fragmentation (feature subsets). The orientation is determined when writing row-wise sample matrix.

The key idea behind this paper is to extract sufficient statistics from  splits of data so that we may aggregate the statistics in the last to get the exact model. This is quite direct in the case of MoE (when each  expert is an exponential family). This paper, however, has the emphasis in decision trees.

The decision trees are trained layer-wise, finding splits of data using features that maximized the information gain (or some similar criteria). The paper discussed two cases:
  • if the data are horizontally splitted, each site compute the required statistics and they are all combine to find the best split; the split then is passed back to each site so that each site knows which node the samples are in and they could take a second collection of sufficient stats.
  • if the data are vertically splitted, each site could figure out its own split and they will be compared and find the best; however, the split is then represented by the indices (so that each site may understand which samples are in which node);

This solution is accurate and can be applied to implementing the decision trees under map/reduce framework directly. And it should also apply to many hierarchical models too.

No comments: