Sunday, April 5, 2009

Infinitely Imbalanced Logistic Regression


This paper studies the case of binary classification when we have a fixed set of samples from positive class but infinite samples from negative class. Their main result is formulated in the following theorem:
Let n >= 1 and xi in Rd be fixed and suppose that F0 satisfies rhe tail condition
\int e^{x^\top \beta}( 1 + \| x\|) \,\mathrm{d} F_0(x) < \infty, \qquad \forall \beta \in \mathbb{R}^d
and surrounds
\bar{x} =\frac{1}{n} \sum_{i = 1}^n x_i.
Then the maximizer of the centered log-likelihood satisfies
\lim_{N \to\infty} \frac{\int e^{x^\top \beta} x \,\mathrm{d} F_0(x)}{\int e^{x^\top \beta}\,\mathrm{d} F_0(x)} = \bar{x}.

This theorem tells us several important things:
  • There are two conditions we must satisfy if β does not diverge, either of which once violated will yield a counterexample (β will diverge).
  • The convergent β only relates to the positive samples via their mean. Therefore it does NOT really matter how their are distributed.
  • The author suggests this may be good for understanding the behavior of logistic regression and by removing the outlier or moving it towards the mean we can get a better model.
  • If F0(x) is a Gaussian or a mixture of Gaussians, β can be calculated as in LDA (the generative counterpart).

This is actually written by a stat guy and involves more derivations than other pure machine learning papers. This is quite interesting. I'd like to explore some theoretical properties of simple models but up to now I haven't found a proper point.

No comments: