Paper Scanner: Infinitely Imbalanced Logistic Regression

Sunday, April 5, 2009

Infinitely Imbalanced Logistic Regression

This paper studies the case of binary classification when we have a fixed set of samples from positive class but infinite samples from negative class. Their main result is formulated in the following theorem:
Let n >= 1 and x_i in R^d be fixed and suppose that F₀ satisfies rhe tail condition

\int e^{x^\top \beta}( 1 + \| x\|) \,\mathrm{d} F_0(x) < \infty, \qquad \forall \beta \in \mathbb{R}^d

and surrounds

\bar{x} =\frac{1}{n} \sum_{i = 1}^n x_i.

Then the maximizer of the centered log-likelihood satisfies

\lim_{N \to\infty} \frac{\int e^{x^\top \beta} x \,\mathrm{d} F_0(x)}{\int e^{x^\top \beta}\,\mathrm{d} F_0(x)} = \bar{x}.

This theorem tells us several important things:

There are two conditions we must satisfy if β does not diverge, either of which once violated will yield a counterexample (β will diverge).
The convergent β only relates to the positive samples via their mean. Therefore it does NOT really matter how their are distributed.
The author suggests this may be good for understanding the behavior of logistic regression and by removing the outlier or moving it towards the mean we can get a better model.
If F₀(x) is a Gaussian or a mixture of Gaussians, β can be calculated as in LDA (the generative counterpart).

This is actually written by a stat guy and involves more derivations than other pure machine learning papers. This is quite interesting. I'd like to explore some theoretical properties of simple models but up to now I haven't found a proper point.

Paper Scanner

Sunday, April 5, 2009

Infinitely Imbalanced Logistic Regression

No comments:

Recent Comments

Scanning Areas

Paper list

Labels

Scanner