Systems and Networking Research highlights

Technical Perspective: Modeling High-Dimensional Data

Posted Feb 1 2012

Article
References
Author

Data in high dimension is difficult to visualize and understand. This has always been the case and is even more apparent now with the availability of large high-dimensional datasets and the need to make sense of them.

The classical statistics approach to understanding data is to find a simple probabilistic model that could have generated it. The model is usually a probability distribution on a large domain and each data point is generated independently from the same distribution. While more complex models without the assumption of independent, identically distributed data points are also studied, the independent data model is the predominant one and is reasonable for many situations. Even this framework would not be very useful or interesting without further assumptions—the distribution could be considered uniform over the data encountered. The key is to model complex, large data with a simple distribution. Finding such a fit (if one exists) would likely give an insightful explanation and the parameters of the simple distribution might even have predictive powers.

What distribution to use? The central limit theorem suggests the most reasonable candidate would be a Gaussian distribution. This is what any aggregate distribution will eventually converge. Indeed, trying to find the single best-fit Gaussian to a given dataset is a commonly used and efficient approach. The Gaussian is estimated using the mean and covariance matrix of the data. Unfortunately, this works well only in rather special cases.

Thus we arrive at a widely used framework in statistics, called a mixture model. Here we assume that data is generated from a mixture of a small number of distributions of known type; the most common assumption is a mixture of Gaussians. The problem is to find the best-fit mixture of a small number of Gaussians to the given data. The number of component Gaussians, k, is much smaller than n, the ambient dimension. Unlike the case of a single Gaussian (k = 1), where it is straightforward to estimate the underlying Gaussian, the problem becomes much more difficult for general k. Even the case of a two-Gaussian mixture remained open for a long time.

Kalai, Moitra, and Valiant³ show how to solve the problem for a mixture of two arbitrary n-dimensional Gaussians. Besides relying on a simple and ingenious reduction to the case of a mixture of two 1-dimensional Gaussians, their analysis relies on the following fundamental fact about the identifiability of Gaussian mixtures: two distinct mixtures of Gaussians have different density functions; as the density of two mixtures gets closer, so must the mixtures (the means, variances, and mixing weights of one mixture must be approximated by those of the other). Such a property is not true for general mixtures, not even for mixtures of nice distributions such as those with logconcave densities, but it is true for Gaussian mixtures. Moreover, unlike the classical proof, they establish a polynomial bound on the number of samples needed to identify the components of a mixture to within a desired accuracy. Surprisingly, this is the first improvement on the sample complexity from the classical exponential bound, in spite of mixture models being studied for over a century.⁵

The key insight of their method is to show that a finite set of moments (six of them for the case of two 1-dimensional Gaussians) suffice to identify the components. With this tool in hand, they consider several random, 1-D projections of an n-dimensional mixture, identify the projections of the components in each, correctly cluster them according to component of origin, and thereby gather enough constraints on the original components to estimate their means, covariance matrices, and mixing weights.

In a follow-up paper, Moitra and Valiant⁴ extended this approach to a mixture of k Gaussians, with complexity growing exponentially in k, but polynomially in all other parameters; thus a polynomial-time algorithm for any fixed k. A simple exponential dependence on k is unavoidable even in the sample complexity, at least if the goal is to identify the components of an arbitrary mixture of Gaussians with no separation condition. A similar bound was also proved independently using a different algorithm in a more abstract (and general) setting with possibly non-Gaussian components by Belkin and Sinha.¹

The work presented in the following paper settles an important open problem, establishes fundamental facts and thereby advances classical statistics, and raises very interesting questions for computer science; among them: What can we hope to do for non-Gaussian mixtures (for which robust identifiability does not hold in general)? Can we handle Gaussian mixtures with some noise? In other words, is there an agnostic algorithm for learning Gaussian mixtures? Perhaps most interestingly, what reasonable assumptions lead to fully polynomial or even practical algorithms? (much work in the field assumes separation between components, which might be unavoidable for efficiency; for example, a polynomial-time algorithm is given assuming each component can be mostly separated by some hyper-plane from the rest of the mixture;² one clean conjecture is that any probabilistically separable mixture is identifiable in polynomial-time).

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Technical Perspective: Modeling High-Dimensional Data

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/2076450.2076473

February 2012 Issue

Published: February 1, 2012

Vol. 55 No. 2

Page: 112

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

BLOG@CACM Apr 16 2024

The Value of Data in Embodied Artificial Intelligence

Shaoshan Liu

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Technical Perspective: Modeling High-Dimensional Data

DOI

February 2012 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.