Thursday, April 9, 2009

High-dimensional Data Analysis: The Curses and Blessings of Dimensionality


In this non-technical paper, Donoho prospected the development of data analysis and suggested the difficulty in dealing with high dimensional data (the curse of dimensionality) and the possible advantages we may take (the blessing of dimensionality). His advisor John Tuckey coined the word data analysis, which differs from the traditional mathematics or statistics in the way of rigorous derivation: more heuristic ideas are applied. He argued that nowadays people choose quite a different method of exploration in science: only a few variables are selected to make a model which can be analytically formulated and studied in the traditional way while now it is more widely employed to collect a large amount of data or the information about one sample is enriched via all kinds of ways.

Curse of Dimensionality:
  • For a naive search algorithm in a high dimensional discrete space, the number of candidates is exponential of the dimensionality.
  • For optimization, we need (1/\epsilon)^d evaluations for a Lipschitz continuous objective function for an approximation of error less than \epsilon.
  • In function approximation to a Lipschitz continous function, we also need O\big((1/\epsilon)^d \big) evaluations for a uniform approximation of error \epsilon.
  • In numeric integration of a Lipschitz continous function, again we need O\big( (1/\epsilon)^d\big) evaluations on a grid in order to obtain an approximate integration of error \epsilon.
  • Nonparametric extimation will be difficult. The much slower convegence rate is the ugly head of the curse of dimensionality.
  • Regularization is usually employed to tackle the curse of dimensionality.
Blessing of Dimensionality:
  • The concentration of measure phenomenon is for a lot of distribution in the high dimensional space. Let (\Omega, \mathcal{F}, \mu) be a metric measurable space, d(\cdot, \cdot) is the distance and \mu is the probability measure of the \sigma-field \mathcal{F}. Let \alpha(\epsilon) = \sup_A \{ \mu( \Omega \backslash A_\epsilon) \mid \mu(A) = 1/2\} where A_\epsilon = \{ x \in \Omega \mid d(x, A) < \epsilon\} is the \epsilon-extension of the measurable set A. The so-called Levy family (\Omega_n, \mathcal{F}_n, \mu_n), which satisfies \forall \epsilon > 0, \lim_{n \to \infty} \alpha_n(\epsilon) = 0, means that the probability measure will concentrate in the center part of the support when n increase. If the convergence to 0 is C \exp(-c n \epsilon^2), then it is also called normal Levy family. In many cases, the index n can be the dimensionality, which means as the dimensionality goes up, we may observe the concentration of measure. For a normal Levy family, the tail's behavior is like a Gaussian. The concentration of measure means in high dimensional space it is more likely that the data you sample contains much information as you don't need to go further from the data, you have already covered the support. The probability decays as you leave a set with a fixed nonzero measure set.
  • Dimension asymptotics. This is the refinement of the previous argument.
  • Approach to continuum: there is an underlying compactness to the space of observed data which will be reflected by approximate finite-dimensionality and an increasing simplicity of analysis for large d. (sounds like the reason for compressive sensing).
  • Approximation through randomness. Random projections, CS are in this range.
  • Approximation through computational harmonic analysis, which is a way to the continuum.
I am not familiar with harmonic analysis. Many points in this paper are still vague to me. I must reread it when I gain more knowledge.

1 comment:

Anonymous said...

This post will assiѕt the inteгnet userѕ for building
up nеω webpage оr even a wеblog fгom start to end.
Feel free to visit my web page - loans for bad credit